All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
@ 2016-11-25 11:28 Vladimir Sementsov-Ogievskiy
  2016-11-25 14:02 ` Stefan Hajnoczi
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-11-25 11:28 UTC (permalink / raw)
  To: nbd-general, qemu-devel
  Cc: kwolf, pbonzini, pborzenkov, stefanha, den, w, eblake, alex,
	vsementsov, mpa

With the availability of sparse storage formats, it is often needed
to query status of a particular range and read only those blocks of
data that are actually present on the block device.

To provide such information, the patch adds the BLOCK_STATUS
extension with one new NBD_CMD_BLOCK_STATUS command, a new
structured reply chunk format, and a new transmission flag.

There exists a concept of data dirtiness, which is required
during, for example, incremental block device backup. To express
this concept via NBD protocol, this patch also adds a flag to
NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
provisioning information; however, with the current proposal, data
dirtiness is only useful with additional coordination outside of
the NBD protocol (such as a way to start and stop the server from
tracking dirty sectors).  Future NBD extensions may add commands
to control dirtiness through NBD.

Since NBD protocol has no notion of block size, and to mimic SCSI
"GET LBA STATUS" command more closely, it has been chosen to return
a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
instead of a bitmap.

CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: Wouter Verhelst <w@uter.be>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---

v3:

Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
removing the restriction, that sum of status descriptor lengths must be equal
to requested length. I.e., let's permit the server to replay with less data
than required if it wants.

Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
 NBD_FLAG_CAN_MULTI_CONN in master branch.

And, finally, I've rebased this onto current state of
extension-structured-reply branch (which itself should be rebased on
master IMHO).

By this resend I just want to continue the diqussion, started about half
a year ago. Here is a summary of some questions and ideas from v2
diqussion:

1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
   A: This all is for read-only disks, so the data is static and unchangeable.

2. Q: different granularities of dirty/allocated bitmaps. Any problems?
   A: 1: server replies with status descriptors of any size, granularity
         is hidden from the client
      2: dirty/allocated requests are separate and unrelated to each
         other, so their granularities are not intersecting

3. Q: selecting of dirty bitmap to export
   A: several variants:
      1: id of bitmap is in flags field of request
          pros: - simple
          cons: - it's a hack. flags field is for other uses.
                - we'll have to map bitmap names to these "ids"
      2: introduce extended nbd requests with variable length and exploit this
         feature for BLOCK_STATUS command, specifying bitmap identifier.
         pros: - looks like a true way
         cons: - we have to create additional extension
               - possible we have to create a map,
                 {<QEMU bitmap name> <=> <NBD bitmap id>}
      3: exteranl tool should select which bitmap to export. So, in case of Qemu
         it should be something like qmp command block-export-dirty-bitmap.
         pros: - simple
               - we can extend it to behave like (2) later
         cons: - additional qmp command to implement (possibly, the lesser evil)
         note: Hmm, external tool can make chose between allocated/dirty data too,
               so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.

4. Q: Should not get_{allocated,dirty} be separate commands?
   cons: Two commands with almost same semantic and similar means?
   pros: However here is a good point of separating clearly defined and native
         for block devices GET_BLOCK_STATUS from user-driven and actually
         undefined data, called 'dirtyness'.

5. Number of status descriptors, sent by server, should be restricted
   variants:
   1: just allow server to restrict this as it wants (which was done in v3)
   2: (not excluding 1). Client specifies somehow the maximum for number
      of descriptors.
      2.1: add command flag, which will request only one descriptor
           (otherwise, no restrictions from the client)
      2.2: again, introduce extended nbd requests, and add field to
           specify this maximum

6. A: What to do with unspecified flags (in request/reply)?
   I think the normal variant is to make them reserved. (Server should
   return EINVAL if found unknown bits, client should consider replay
   with unknown bits as an error)

======

Also, an idea on 2-4:

    As we say, that dirtiness is unknown for NBD, and external tool
    should specify, manage and understand, which data is actually
    transmitted, why not just call it user_data and leave status field
    of reply chunk unspecified in this case?

    So, I propose one flag for NBD_CMD_BLOCK_STATUS:
    NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
    Eric's 'Block provisioning status' paragraph.  If it is set, we just
    leave status field for some external... protocol? Who knows, what is
    this user data.

    Note: I'm not sure, that I like this (my) proposal. It's just an
    idea, may be someone like it.  And, I think, it represents what we
    are trying to do more honestly.

    Note2: the next step of generalization will be NBD_CMD_USER, with
    variable request size, structured reply and no definition :)


Another idea, about backups themselves:

    Why do we need allocated/zero status for backup? IMHO we don't.

    Full backup: just do structured read - it will show us, which chunks
    may be treaded as zeroes.

    Incremental backup: get dirty bitmap (somehow, for example through
    user-defined part of proposed command), than, for dirty blocks, read
    them through structured read, so information about zero/unallocated
    areas are here.

For me all the variants above are OK. Let's finally choose something.

v2:
v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html

Since then, we've added the STRUCTURED_REPLY extension, which
necessitates a rather larger rebase; I've also changed things
to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
modes to be determined by boolean flags (rather than by fixed
values of the 16-bit flags field), changed the reply status fields
to be bitwise-or values (with a default of 0 always being sane),
and changed the descriptor layout to drop an offset but to include
a 32-bit status so that the descriptor is nicely 8-byte aligned
without padding.

 doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 154 insertions(+), 1 deletion(-)

diff --git a/doc/proto.md b/doc/proto.md
index 1c2fa5b..253b6f1 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -755,6 +755,8 @@ The field has the following format:
   MUST leave this flag clear if structured replies have not been
   negotiated. Clients MUST NOT set the `NBD_CMD_FLAG_DF` request
   flag unless this transmission flag is set.
+- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental
+  `BLOCK_STATUS` extension; see below.
 
 Clients SHOULD ignore unknown flags.
 
@@ -1036,6 +1038,10 @@ interpret the "length" bytes of payload.
   64 bits: offset (unsigned)  
   32 bits: hole size (unsigned, MUST be nonzero)  
 
+- `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
+
+  Defined by the experimental extension `BLOCK_STATUS`; see below.
+
 All error chunk types have bit 15 set, and begin with the same
 *error*, *message length*, and optional *message* fields as
 `NBD_REPLY_TYPE_ERROR`.  If non-zero, *message length* indicates
@@ -1070,7 +1076,7 @@ remaining structured fields at the end.
   were sent earlier in the structured reply, the server SHOULD NOT
   send multiple distinct offsets that lie within the bounds of a
   single content chunk.  Valid as a reply to `NBD_CMD_READ`,
-  `NBD_CMD_WRITE`, and `NBD_CMD_TRIM`.
+  `NBD_CMD_WRITE`, `NBD_CMD_TRIM`, and `NBD_CMD_BLOCK_STATUS`.
 
   The payload is structured as:
 
@@ -1247,6 +1253,11 @@ The following request types exist:
 
     Defined by the experimental `WRITE_ZEROES` [extension](https://github.com/yoe/nbd/blob/extension-write-zeroes/doc/proto.md).
 
+* `NBD_CMD_BLOCK_STATUS` (7)
+
+    Defined by the experimental `BLOCK_STATUS` extension; see below.
+
+
 * Other requests
 
     Some third-party implementations may require additional protocol
@@ -1331,6 +1342,148 @@ written as branches which can be merged into master if and
 when those extensions are promoted to the normative version
 of the document in the master branch.
 
+### `BLOCK_STATUS` extension
+
+With the availability of sparse storage formats, it is often needed to
+query the status of a particular range and read only those blocks of
+data that are actually present on the block device.
+
+Some storage formats and operations over such formats express a
+concept of data dirtiness. Whether the operation is block device
+mirroring, incremental block device backup or any other operation with
+a concept of data dirtiness, they all share a need to provide a list
+of ranges that this particular operation treats as dirty.
+
+To provide such class of information, the `BLOCK_STATUS` extension
+adds a new `NBD_CMD_BLOCK_STATUS` command which returns a list of
+ranges with their respective states.  This extension is not available
+unless the client also negotiates the `STRUCTURED_REPLY` extension.
+
+* `NBD_FLAG_SEND_BLOCK_STATUS`
+
+    The server SHOULD set this transmission flag to 1 if structured
+    replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS`
+    request is supported.
+
+* `NBD_REPLY_TYPE_BLOCK_STATUS`
+
+    *length* MUST be a positive integer multiple of 8.  This reply
+    represents a series of consecutive block descriptors where the sum
+    of the lengths of the descriptors MUST not be greater than the
+    length of the original request.  This chunk type MUST appear at most
+    once in a structured reply. Valid as a reply to
+    `NBD_CMD_BLOCK_STATUS`.
+
+    The payload is structured as a list of one or more descriptors,
+    each with this layout:
+
+        * 32 bits, length (unsigned, MUST NOT be zero)
+        * 32 bits, status flags
+
+    The definition of the status flags is determined based on the
+    flags present in the original request.
+
+* `NBD_CMD_BLOCK_STATUS`
+
+    A block status query request. Length and offset define the range
+    of interest. Clients SHOULD NOT use this request unless the server
+    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
+    in turn requires the client to first negotiate structured replies.
+    For a successful return, the server MUST use a structured reply,
+    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
+
+    The list of block status descriptors within the
+    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
+    of the file starting from specified *offset*, and the sum of the
+    *length* fields of each descriptor MUST not be greater than the
+    overall *length* of the request. This means that the server MAY
+    return less data than required. However the server MUST return at
+    least one status descriptor.  The server SHOULD use different
+    *status* values between consecutive descriptors, and SHOULD use
+    descriptor lengths that are an integer multiple of 512 bytes where
+    possible (the first and last descriptor of an unaligned query being
+    the most obvious places for an exception). The status flags are
+    intentionally defined so that a server MAY always safely report a
+    status of 0 for any block, although the server SHOULD return
+    additional status values when they can be easily detected.
+
+    If an error occurs, the server SHOULD set the appropriate error
+    code in the error field of either a simple reply or an error
+    chunk.  However, if the error does not involve invalid usage (such
+    as a request beyond the bounds of the file), a server MAY reply
+    with a single block status descriptor with *length* matching the
+    requested length, and *status* of 0 rather than reporting the
+    error.
+
+    The type of information requested by the client is determined by
+    the request flags, as follows:
+
+    1. Block provisioning status
+
+    Upon receiving an `NBD_CMD_BLOCK_STATUS` command with the flag
+    `NBD_FLAG_STATUS_DIRTY` clear, the server MUST return the
+    provisioning status of the device, where the status field of each
+    descriptor is determined by the following bits (all four
+    combinations of these two bits are possible):
+
+      - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole
+        (and future writes to that area may cause fragmentation or
+        encounter an `ENOSPC` error); if clear, the block is allocated
+        or the server could not otherwise determine its status.  Note
+        that the use of `NBD_CMD_TRIM` is related to this status, but
+        that the server MAY report a hole even where trim has not been
+        requested, and also that a server MAY report allocation even
+        where a trim has been requested.
+      - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as
+        all zeroes; if clear, the block contents are not known.  Note
+        that the use of `NBD_CMD_WRITE_ZEROES` is related to this
+        status, but that the server MAY report zeroes even where write
+        zeroes has not been requested, and also that a server MAY
+        report unknown content even where write zeroes has been
+        requested.
+
+    The client SHOULD NOT read from an area that has both
+    `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.
+
+    2. Block dirtiness status
+
+    This command is meant to operate in tandem with other (non-NBD)
+    channels to the server.  Generally, a "dirty" block is a block
+    that has been written to by someone, but the exact meaning of "has
+    been written" is left to the implementation.  For example, a
+    virtual machine monitor could provide a (non-NBD) command to start
+    tracking blocks written by the virtual machine.  A backup client
+    can then connect to an NBD server provided by the virtual machine
+    monitor and use `NBD_CMD_BLOCK_STATUS` with the
+    `NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty
+    blocks that the virtual machine has changed.
+
+    An implementation that doesn't track the "dirtiness" state of
+    blocks MUST either fail this command with `EINVAL`, or mark all
+    blocks as dirty in the descriptor that it returns.  Upon receiving
+    an `NBD_CMD_BLOCK_STATUS` command with the flag
+    `NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness
+    status of the device, where the status field of each descriptor is
+    determined by the following bit:
+
+      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
+        portion of the file that is still clean because it has not
+        been written; if clear, the block represents a portion of the
+        file that is dirty, or where the server could not otherwise
+        determine its status.
+
+A client MAY close the connection if it detects that the server has
+sent an invalid chunks (such as lengths in the
+`NBD_REPLY_TYPE_BLOCK_STATUS` not summing up to the requested length).
+The server SHOULD return `EINVAL` if it receives a `BLOCK_STATUS`
+request including one or more sectors beyond the size of the device.
+
+The extension adds the following new command flag:
+
+- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`.
+  SHOULD be set to 1 if the client wants to request dirtiness status
+  rather than provisioning status.
+
 ## About this file
 
 This file tries to document the NBD protocol as it is currently
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-25 11:28 [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension Vladimir Sementsov-Ogievskiy
@ 2016-11-25 14:02 ` Stefan Hajnoczi
  2016-11-27 19:17 ` [Qemu-devel] [Nbd] " Wouter Verhelst
  2016-11-29 12:57 ` [Qemu-devel] " Alex Bligh
  2 siblings, 0 replies; 37+ messages in thread
From: Stefan Hajnoczi @ 2016-11-25 14:02 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: nbd-general, qemu-devel, kwolf, pbonzini, pborzenkov, den, w,
	eblake, alex, mpa

[-- Attachment #1: Type: text/plain, Size: 7501 bytes --]

On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the availability of sparse storage formats, it is often needed
> to query status of a particular range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS
> extension with one new NBD_CMD_BLOCK_STATUS command, a new
> structured reply chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required
> during, for example, incremental block device backup. To express
> this concept via NBD protocol, this patch also adds a flag to
> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
> provisioning information; however, with the current proposal, data
> dirtiness is only useful with additional coordination outside of
> the NBD protocol (such as a way to start and stop the server from
> tracking dirty sectors).  Future NBD extensions may add commands
> to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI
> "GET LBA STATUS" command more closely, it has been chosen to return
> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
> instead of a bitmap.
> 
> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> CC: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 
> v3:
> 
> Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
> removing the restriction, that sum of status descriptor lengths must be equal
> to requested length. I.e., let's permit the server to replay with less data
> than required if it wants.
> 
> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>  NBD_FLAG_CAN_MULTI_CONN in master branch.
> 
> And, finally, I've rebased this onto current state of
> extension-structured-reply branch (which itself should be rebased on
> master IMHO).
> 
> By this resend I just want to continue the diqussion, started about half
> a year ago. Here is a summary of some questions and ideas from v2
> diqussion:
> 
> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
>    A: This all is for read-only disks, so the data is static and unchangeable.
> 
> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>    A: 1: server replies with status descriptors of any size, granularity
>          is hidden from the client
>       2: dirty/allocated requests are separate and unrelated to each
>          other, so their granularities are not intersecting
> 
> 3. Q: selecting of dirty bitmap to export
>    A: several variants:
>       1: id of bitmap is in flags field of request
>           pros: - simple
>           cons: - it's a hack. flags field is for other uses.
>                 - we'll have to map bitmap names to these "ids"
>       2: introduce extended nbd requests with variable length and exploit this
>          feature for BLOCK_STATUS command, specifying bitmap identifier.
>          pros: - looks like a true way
>          cons: - we have to create additional extension
>                - possible we have to create a map,
>                  {<QEMU bitmap name> <=> <NBD bitmap id>}
>       3: exteranl tool should select which bitmap to export. So, in case of Qemu
>          it should be something like qmp command block-export-dirty-bitmap.
>          pros: - simple
>                - we can extend it to behave like (2) later
>          cons: - additional qmp command to implement (possibly, the lesser evil)
>          note: Hmm, external tool can make chose between allocated/dirty data too,
>                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> 
> 4. Q: Should not get_{allocated,dirty} be separate commands?
>    cons: Two commands with almost same semantic and similar means?
>    pros: However here is a good point of separating clearly defined and native
>          for block devices GET_BLOCK_STATUS from user-driven and actually
>          undefined data, called 'dirtyness'.
> 
> 5. Number of status descriptors, sent by server, should be restricted
>    variants:
>    1: just allow server to restrict this as it wants (which was done in v3)
>    2: (not excluding 1). Client specifies somehow the maximum for number
>       of descriptors.
>       2.1: add command flag, which will request only one descriptor
>            (otherwise, no restrictions from the client)
>       2.2: again, introduce extended nbd requests, and add field to
>            specify this maximum
> 
> 6. A: What to do with unspecified flags (in request/reply)?
>    I think the normal variant is to make them reserved. (Server should
>    return EINVAL if found unknown bits, client should consider replay
>    with unknown bits as an error)
> 
> ======
> 
> Also, an idea on 2-4:
> 
>     As we say, that dirtiness is unknown for NBD, and external tool
>     should specify, manage and understand, which data is actually
>     transmitted, why not just call it user_data and leave status field
>     of reply chunk unspecified in this case?
> 
>     So, I propose one flag for NBD_CMD_BLOCK_STATUS:
>     NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
>     Eric's 'Block provisioning status' paragraph.  If it is set, we just
>     leave status field for some external... protocol? Who knows, what is
>     this user data.
> 
>     Note: I'm not sure, that I like this (my) proposal. It's just an
>     idea, may be someone like it.  And, I think, it represents what we
>     are trying to do more honestly.
> 
>     Note2: the next step of generalization will be NBD_CMD_USER, with
>     variable request size, structured reply and no definition :)
> 
> 
> Another idea, about backups themselves:
> 
>     Why do we need allocated/zero status for backup? IMHO we don't.
> 
>     Full backup: just do structured read - it will show us, which chunks
>     may be treaded as zeroes.
> 
>     Incremental backup: get dirty bitmap (somehow, for example through
>     user-defined part of proposed command), than, for dirty blocks, read
>     them through structured read, so information about zero/unallocated
>     areas are here.
> 
> For me all the variants above are OK. Let's finally choose something.
> 
> v2:
> v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html
> 
> Since then, we've added the STRUCTURED_REPLY extension, which
> necessitates a rather larger rebase; I've also changed things
> to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
> modes to be determined by boolean flags (rather than by fixed
> values of the 16-bit flags field), changed the reply status fields
> to be bitwise-or values (with a default of 0 always being sane),
> and changed the descriptor layout to drop an offset but to include
> a 32-bit status so that the descriptor is nicely 8-byte aligned
> without padding.
> 
>  doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 154 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-25 11:28 [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension Vladimir Sementsov-Ogievskiy
  2016-11-25 14:02 ` Stefan Hajnoczi
@ 2016-11-27 19:17 ` Wouter Verhelst
  2016-11-28 11:19   ` Stefan Hajnoczi
                     ` (3 more replies)
  2016-11-29 12:57 ` [Qemu-devel] " Alex Bligh
  2 siblings, 4 replies; 37+ messages in thread
From: Wouter Verhelst @ 2016-11-27 19:17 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: nbd-general, qemu-devel, kwolf, den, pborzenkov, stefanha, mpa, pbonzini

Hi Vladimir,

Quickly: the reason I haven't merged this yes is twofold:
- I wasn't thrilled with the proposal at the time. It felt a bit
  hackish, and bolted onto NBD so you could use it, but without defining
  everything in the NBD protocol. "We're reading some data, but it's not
  about you". That didn't feel right
- There were a number of questions still unanswered (you're answering a
  few below, so that's good).

For clarity, I have no objection whatsoever to adding more commands if
they're useful, but I would prefer that they're also useful with NBD on
its own, i.e., without requiring an initiation or correlation of some
state through another protocol or network connection or whatever. If
that's needed, that feels like I didn't do my job properly, if you get
my point.

On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the availability of sparse storage formats, it is often needed
> to query status of a particular range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS
> extension with one new NBD_CMD_BLOCK_STATUS command, a new
> structured reply chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required
> during, for example, incremental block device backup. To express
> this concept via NBD protocol, this patch also adds a flag to
> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
> provisioning information; however, with the current proposal, data
> dirtiness is only useful with additional coordination outside of
> the NBD protocol (such as a way to start and stop the server from
> tracking dirty sectors).  Future NBD extensions may add commands
> to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI
> "GET LBA STATUS" command more closely, it has been chosen to return
> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
> instead of a bitmap.
> 
> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> CC: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 
> v3:
> 
> Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
> removing the restriction, that sum of status descriptor lengths must be equal
> to requested length. I.e., let's permit the server to replay with less data
> than required if it wants.

Reasonable, yes. The length that the client requests should be a maximum (i.e.
"I'm interested in this range"), not an exact request.

> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>  NBD_FLAG_CAN_MULTI_CONN in master branch.

Right.

> And, finally, I've rebased this onto current state of
> extension-structured-reply branch (which itself should be rebased on
> master IMHO).

Probably a good idea, given the above.

> By this resend I just want to continue the diqussion, started about half
> a year ago. Here is a summary of some questions and ideas from v2
> diqussion:
> 
> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
>    A: This all is for read-only disks, so the data is static and unchangeable.

I think we should declare that it's up to the client to ensure no other
writes happen without its knowledge. This may be because the client and
server communicate out of band about state changes, or because the
client somehow knows that it's the only writer, or whatever.

We can easily do that by declaring that the result of that command only
talks about *current* state, and that concurrent writes by different
clients may invalidate the state. This is true for NBD in general (i.e.,
concurrent read or write commands from other clients may confuse file
systems on top of NBD), so it doesn't change expectations in any way.

> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>    A: 1: server replies with status descriptors of any size, granularity
>          is hidden from the client
>       2: dirty/allocated requests are separate and unrelated to each
>          other, so their granularities are not intersecting

Not entirely sure anymore what this is about?

> 3. Q: selecting of dirty bitmap to export
>    A: several variants:
>       1: id of bitmap is in flags field of request
>           pros: - simple
>           cons: - it's a hack. flags field is for other uses.
>                 - we'll have to map bitmap names to these "ids"
>       2: introduce extended nbd requests with variable length and exploit this
>          feature for BLOCK_STATUS command, specifying bitmap identifier.
>          pros: - looks like a true way
>          cons: - we have to create additional extension
>                - possible we have to create a map,
>                  {<QEMU bitmap name> <=> <NBD bitmap id>}
>       3: exteranl tool should select which bitmap to export. So, in case of Qemu
>          it should be something like qmp command block-export-dirty-bitmap.
>          pros: - simple
>                - we can extend it to behave like (2) later
>          cons: - additional qmp command to implement (possibly, the lesser evil)
>          note: Hmm, external tool can make chose between allocated/dirty data too,
>                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.

Downside of 3, though, is that it moves the definition of what the
different states mean outside of the NBD protocol (i.e., the protocol
messages are not entirely defined anymore, and their meaning depends on
the clients and servers in use).

To avoid this, we should have a clear definition of what the reply means
*by default*, but then we can add a note that clients and servers can
possibly define other meanings out of band if they want to.

> 4. Q: Should not get_{allocated,dirty} be separate commands?
>    cons: Two commands with almost same semantic and similar means?
>    pros: However here is a good point of separating clearly defined and native
>          for block devices GET_BLOCK_STATUS from user-driven and actually
>          undefined data, called 'dirtyness'.

Yeah, having them separate commands might be a bad idea indeed.

> 5. Number of status descriptors, sent by server, should be restricted
>    variants:
>    1: just allow server to restrict this as it wants (which was done in v3)
>    2: (not excluding 1). Client specifies somehow the maximum for number
>       of descriptors.
>       2.1: add command flag, which will request only one descriptor
>            (otherwise, no restrictions from the client)
>       2.2: again, introduce extended nbd requests, and add field to
>            specify this maximum

I think having a flag which requests just one descriptor can be useful,
but I'm hesitant to add it unless it's actually going to be used; so in
other words, I'll leave the decision on that bit to you.

> 6. A: What to do with unspecified flags (in request/reply)?
>    I think the normal variant is to make them reserved. (Server should
>    return EINVAL if found unknown bits, client should consider replay
>    with unknown bits as an error)

Right, probably best to do that, yes.

> ======
> 
> Also, an idea on 2-4:
> 
>     As we say, that dirtiness is unknown for NBD, and external tool
>     should specify, manage and understand, which data is actually
>     transmitted, why not just call it user_data and leave status field
>     of reply chunk unspecified in this case?
> 
>     So, I propose one flag for NBD_CMD_BLOCK_STATUS:
>     NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
>     Eric's 'Block provisioning status' paragraph.  If it is set, we just
>     leave status field for some external... protocol? Who knows, what is
>     this user data.

Yes, this sounds like a reasonable approach.

>     Note: I'm not sure, that I like this (my) proposal. It's just an
>     idea, may be someone like it.  And, I think, it represents what we
>     are trying to do more honestly.

Indeed.

>     Note2: the next step of generalization will be NBD_CMD_USER, with
>     variable request size, structured reply and no definition :)

Well, er, no please, if we can avoid it :-)

> Another idea, about backups themselves:
> 
>     Why do we need allocated/zero status for backup? IMHO we don't.

Well, I've been thinking so all along, but then I don't really know what
it is, in detail, that you want to do :-)

I can understand a "has this changed since time X" request, which the
"dirty" thing seems to want to be. Whether something is allocated is
just a special case of that.

Actually, come to think of that. What is the exact use case for this
thing? I understand you're trying to create incremental backups of
things, which would imply you don't write from the client that is
getting the block status thingies, right? If so, how about:

- NBD_OPT_GET_SNAPSHOTS (during negotiation): returns a list of
  snapshots. Not required, optional, includes a machine-readable form,
  not defined by NBD, which explains what the snapshot is about (e.g., a
  qemu json file). The "base" version of that is just "allocation
  status", and is implied (i.e., you don't need to run
  NBD_OPT_GET_SNAPSHOTS if you're not interested in anything but the
  allocation status).
- NBD_CMD_BLOCK_STATUS (during transmission), returns block descriptors
  which tell you what the status of a block of data is for each of the
  relevant snapshots that we know about.

Perhaps this is somewhat overengineered, but it does bring most of the
definition of what a snapshot is back into the NBD protocol, without
having to say "this could be anything", and without requiring
connectivity over two ports for this to be useful (e.g., you could store
the machine-readable form of the snapshot description into your backup
program and match what they mean with what you're interested in at
restore time, etc).

This wouldn't work if you're interested in new snapshots that get
created once we've already moved into transmission, but hey.

Thoughts?

>     Full backup: just do structured read - it will show us, which chunks
>     may be treaded as zeroes.

Right.

>     Incremental backup: get dirty bitmap (somehow, for example through
>     user-defined part of proposed command), than, for dirty blocks, read
>     them through structured read, so information about zero/unallocated
>     areas are here.
> 
> For me all the variants above are OK. Let's finally choose something.
> 
> v2:
> v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html
> 
> Since then, we've added the STRUCTURED_REPLY extension, which
> necessitates a rather larger rebase; I've also changed things
> to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
> modes to be determined by boolean flags (rather than by fixed
> values of the 16-bit flags field), changed the reply status fields
> to be bitwise-or values (with a default of 0 always being sane),
> and changed the descriptor layout to drop an offset but to include
> a 32-bit status so that the descriptor is nicely 8-byte aligned
> without padding.
> 
>  doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 154 insertions(+), 1 deletion(-)

[...]

I'll commit this in a minute into a separate branch called
"extension-blockstatus", under the understanding that changes are still
required, as per above (i.e., don't assume that just because there's a
branch I'm happy with the current result ;-)

Regards

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-27 19:17 ` [Qemu-devel] [Nbd] " Wouter Verhelst
@ 2016-11-28 11:19   ` Stefan Hajnoczi
  2016-11-28 17:33     ` Wouter Verhelst
  2016-11-28 23:15   ` John Snow
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Stefan Hajnoczi @ 2016-11-28 11:19 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: Vladimir Sementsov-Ogievskiy, nbd-general, qemu-devel, kwolf,
	den, pborzenkov, mpa, pbonzini

[-- Attachment #1: Type: text/plain, Size: 2548 bytes --]

On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote:
> Quickly: the reason I haven't merged this yes is twofold:
> - I wasn't thrilled with the proposal at the time. It felt a bit
>   hackish, and bolted onto NBD so you could use it, but without defining
>   everything in the NBD protocol. "We're reading some data, but it's not
>   about you". That didn't feel right
>
> - There were a number of questions still unanswered (you're answering a
>   few below, so that's good).
> 
> For clarity, I have no objection whatsoever to adding more commands if
> they're useful, but I would prefer that they're also useful with NBD on
> its own, i.e., without requiring an initiation or correlation of some
> state through another protocol or network connection or whatever. If
> that's needed, that feels like I didn't do my job properly, if you get
> my point.

The out-of-band operations you are referring to are for dirty bitmap
management.  (The goal is to read out blocks that changed since the last
backup.)

The client does not access the live disk, instead it accesses a
read-only snapshot and the dirty information (so that it can copy out
only blocks that were written).  The client is allowed to read blocks
that are not dirty too.

If you want to implement the whole incremental backup workflow in NBD
then the client would first have to connect to the live disk, set up
dirty tracking, create a snapshot export, and then connect to that
snapshot.

That sounds like a big feature set and I'd argue it's for the control
plane (storage API) and not the data plane (NBD).  There were
discussions about transferring the dirty information via the control
plane but it seems more appropriate to it in the data plane since it is
block-level information.

I'm arguing that the NBD protocol doesn't need to support the
incremental backup workflow since it's a complex control plane concept.

Being able to read dirty information via NBD is useful for other block
backup applications, not just QEMU.  It could be used for syncing LVM
volumes across machines, for example, if someone implements an NBD+LVM
server.

Another issue with adding control plane operations is that you need to
begin considering privilege separation.  Should all NBD clients be able
to initiate snapshots, dirty tracking, etc or is some kind of access
control required to limit certain commands?  Not all clients require the
same privileges and so they shouldn't have access to the same set of
operations.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-28 11:19   ` Stefan Hajnoczi
@ 2016-11-28 17:33     ` Wouter Verhelst
  2016-11-29  9:17       ` Stefan Hajnoczi
  2016-11-29 10:50       ` Wouter Verhelst
  0 siblings, 2 replies; 37+ messages in thread
From: Wouter Verhelst @ 2016-11-28 17:33 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: nbd-general, kwolf, Vladimir Sementsov-Ogievskiy, qemu-devel,
	pborzenkov, pbonzini, mpa, den

Hi Stefan,

On Mon, Nov 28, 2016 at 11:19:44AM +0000, Stefan Hajnoczi wrote:
> On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote:
> > Quickly: the reason I haven't merged this yes is twofold:
> > - I wasn't thrilled with the proposal at the time. It felt a bit
> >   hackish, and bolted onto NBD so you could use it, but without defining
> >   everything in the NBD protocol. "We're reading some data, but it's not
> >   about you". That didn't feel right
> >
> > - There were a number of questions still unanswered (you're answering a
> >   few below, so that's good).
> > 
> > For clarity, I have no objection whatsoever to adding more commands if
> > they're useful, but I would prefer that they're also useful with NBD on
> > its own, i.e., without requiring an initiation or correlation of some
> > state through another protocol or network connection or whatever. If
> > that's needed, that feels like I didn't do my job properly, if you get
> > my point.
> 
> The out-of-band operations you are referring to are for dirty bitmap
> management.  (The goal is to read out blocks that changed since the last
> backup.)
> 
> The client does not access the live disk, instead it accesses a
> read-only snapshot and the dirty information (so that it can copy out
> only blocks that were written).  The client is allowed to read blocks
> that are not dirty too.

I understood as much, yes.

> If you want to implement the whole incremental backup workflow in NBD
> then the client would first have to connect to the live disk, set up
> dirty tracking, create a snapshot export, and then connect to that
> snapshot.
> 
> That sounds like a big feature set and I'd argue it's for the control
> plane (storage API) and not the data plane (NBD).  There were
> discussions about transferring the dirty information via the control
> plane but it seems more appropriate to it in the data plane since it is
> block-level information.

I agree that creating and managing snapshots is out of scope for NBD. The
protocol is not set up for that.

However, I'm arguing that if we're going to provide information about
snapshots, we should be able to properly refer to these snapshots from
within an NBD context. My previous mail suggested adding a negotiation
message that would essentially ask the server "tell me about the
snapshots you know about", giving them an NBD identifier in the process
(accompanied by a "foreign" identifier that is decidedly *not* an NBD
identifier and that could be used to match the NBD identifier to
something implementation-defined). This would be read-only information;
the client cannot ask the server to create new snapshots. We can then
later in the protocol refer to these snapshots by way of that NBD
identifier.

My proposal also makes it impossible to get updates of newly created
snapshots without disconnecting and reconnecting (due to the fact that
you can't go from transmission back to negotiation), but I'm not sure
that's a problem.

Doing so has two advantages:
- If a client is accidentally (due to misconfiguration or implementation
  bugs or whatnot) connecting to the wrong server after having created a
  snapshot through a management protocol, we have an opportunity to
  detect this error, due to the fact that the "foreign" identifiers
  passed to the client during negotiation will not match with what the
  client was expecting.
- A future version of the protocol could possibly include an extended
  version of the read command, allowing a client to read information
  from multiple storage snapshots without requiring a reconnect, and
  allowing current clients information about allocation status across
  various snapshots (although a first implementation could very well
  limit itself to only having one snapshot).

[...]
> I'm arguing that the NBD protocol doesn't need to support the
> incremental backup workflow since it's a complex control plane concept.
> 
> Being able to read dirty information via NBD is useful for other block
> backup applications, not just QEMU.  It could be used for syncing LVM
> volumes across machines, for example, if someone implements an NBD+LVM
> server.

Indeed, and I was considering adding a basic implementation to go with
the copy-on-write support in stock nbd-server, too.

> Another issue with adding control plane operations is that you need to
> begin considering privilege separation.  Should all NBD clients be able
> to initiate snapshots, dirty tracking, etc or is some kind of access
> control required to limit certain commands?  Not all clients require the
> same privileges and so they shouldn't have access to the same set of
> operations.

Sure, which is why I wasn't suggesting anything of the sorts :-)

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-27 19:17 ` [Qemu-devel] [Nbd] " Wouter Verhelst
  2016-11-28 11:19   ` Stefan Hajnoczi
@ 2016-11-28 23:15   ` John Snow
  2016-11-29 10:18   ` Kevin Wolf
  2016-11-30 10:41   ` Sergey Talantov
  3 siblings, 0 replies; 37+ messages in thread
From: John Snow @ 2016-11-28 23:15 UTC (permalink / raw)
  To: Wouter Verhelst, Vladimir Sementsov-Ogievskiy
  Cc: nbd-general, kwolf, qemu-devel, pborzenkov, stefanha, pbonzini, mpa, den

Hi Wouter,

Some of this mess may be partially my fault, but I have not been 
following the NBD extension proposals up until this point.

Are you familiar with the genesis behind this idea and what we are 
trying to accomplish in general?

We had the thought to propose an extension roughly similar to SCSI's 
'get lba status', but the existing status bits there did not correlate 
semantically to what we are hoping to convey. There was some debate over 
if it would be an abuse of protocol to attempt to use them as such.

For a quick recap, get lba status appears to offer three canonical statuses:

0h: "LBA extent is mapped, [...] or has an unknown state"
1h: "[...] LBA extent is deallocated"
2h: "[...] LBA extent is anchored"

My interpretation of mapped was simply that it was physically allocated, 
and 'deallocated' was simply unallocated.

(I uh, am not actually clear on what anchored means exactly.)

Either way, we felt at the time that it would be wrong to propose an 
analogue command for NBD and then immediately abuse the existing 
semantics, hence a new command like -- but not identical to -- the SCSI one.


On 11/27/2016 02:17 PM, Wouter Verhelst wrote:
> Hi Vladimir,
>
> Quickly: the reason I haven't merged this yes is twofold:
> - I wasn't thrilled with the proposal at the time. It felt a bit
>   hackish, and bolted onto NBD so you could use it, but without defining
>   everything in the NBD protocol. "We're reading some data, but it's not
>   about you". That didn't feel right
> - There were a number of questions still unanswered (you're answering a
>   few below, so that's good).
>
> For clarity, I have no objection whatsoever to adding more commands if
> they're useful, but I would prefer that they're also useful with NBD on
> its own, i.e., without requiring an initiation or correlation of some
> state through another protocol or network connection or whatever. If
> that's needed, that feels like I didn't do my job properly, if you get
> my point.
>
> On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> With the availability of sparse storage formats, it is often needed
>> to query status of a particular range and read only those blocks of
>> data that are actually present on the block device.
>>
>> To provide such information, the patch adds the BLOCK_STATUS
>> extension with one new NBD_CMD_BLOCK_STATUS command, a new
>> structured reply chunk format, and a new transmission flag.
>>
>> There exists a concept of data dirtiness, which is required
>> during, for example, incremental block device backup. To express
>> this concept via NBD protocol, this patch also adds a flag to
>> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
>> provisioning information; however, with the current proposal, data
>> dirtiness is only useful with additional coordination outside of
>> the NBD protocol (such as a way to start and stop the server from
>> tracking dirty sectors).  Future NBD extensions may add commands
>> to control dirtiness through NBD.
>>
>> Since NBD protocol has no notion of block size, and to mimic SCSI
>> "GET LBA STATUS" command more closely, it has been chosen to return
>> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
>> instead of a bitmap.
>>
>> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
>> CC: Denis V. Lunev <den@openvz.org>
>> CC: Wouter Verhelst <w@uter.be>
>> CC: Paolo Bonzini <pbonzini@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>
>> v3:
>>
>> Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
>> removing the restriction, that sum of status descriptor lengths must be equal
>> to requested length. I.e., let's permit the server to replay with less data
>> than required if it wants.
>
> Reasonable, yes. The length that the client requests should be a maximum (i.e.
> "I'm interested in this range"), not an exact request.
>
>> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>>  NBD_FLAG_CAN_MULTI_CONN in master branch.
>
> Right.
>
>> And, finally, I've rebased this onto current state of
>> extension-structured-reply branch (which itself should be rebased on
>> master IMHO).
>
> Probably a good idea, given the above.
>
>> By this resend I just want to continue the diqussion, started about half
>> a year ago. Here is a summary of some questions and ideas from v2
>> diqussion:
>>
>> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable?
>>    A: This all is for read-only disks, so the data is static and unchangeable.
>
> I think we should declare that it's up to the client to ensure no other
> writes happen without its knowledge. This may be because the client and
> server communicate out of band about state changes, or because the
> client somehow knows that it's the only writer, or whatever.
>
> We can easily do that by declaring that the result of that command only
> talks about *current* state, and that concurrent writes by different
> clients may invalidate the state. This is true for NBD in general (i.e.,
> concurrent read or write commands from other clients may confuse file
> systems on top of NBD), so it doesn't change expectations in any way.
>

Agree with you here. "This was correct when I sent it, but it's up to 
you to ensure that nothing would have invalidated that in the meantime" 
is fine semantically to me.

Of course in our implementation, we intend to only export essentially 
read-only snapshots of data, so we don't personally expect to run into 
any of these kind of semantic problems.

>> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>>    A: 1: server replies with status descriptors of any size, granularity
>>          is hidden from the client
>>       2: dirty/allocated requests are separate and unrelated to each
>>          other, so their granularities are not intersecting
>
> Not entirely sure anymore what this is about?
>

We have a concept of granularity for dirty tracking; consider it a block 
size.

I don't think it's necessarily relevant to NBD, except for the case of 
false positives. Consider the case of a 512 byte block that is assumed 
dirty simply because it's adjacent to actually dirty data.

That may have some meaning for how we write the NBD spec; i.e. the 
meaning of the status bit changes from "This is definitely modified" to 
"This was possibly modified," due to limitations in the management of 
this information by the server.

If we expose the granularity information via NBD, it at least makes it 
clear how fuzzy the results presented may be. Otherwise, it's not really 
required.

[Again, as seen below, a user-defined bit would be plenty sufficient...!]

>> 3. Q: selecting of dirty bitmap to export
>>    A: several variants:
>>       1: id of bitmap is in flags field of request
>>           pros: - simple
>>           cons: - it's a hack. flags field is for other uses.
>>                 - we'll have to map bitmap names to these "ids"
>>       2: introduce extended nbd requests with variable length and exploit this
>>          feature for BLOCK_STATUS command, specifying bitmap identifier.
>>          pros: - looks like a true way
>>          cons: - we have to create additional extension
>>                - possible we have to create a map,
>>                  {<QEMU bitmap name> <=> <NBD bitmap id>}
>>       3: exteranl tool should select which bitmap to export. So, in case of Qemu
>>          it should be something like qmp command block-export-dirty-bitmap.
>>          pros: - simple
>>                - we can extend it to behave like (2) later
>>          cons: - additional qmp command to implement (possibly, the lesser evil)
>>          note: Hmm, external tool can make chose between allocated/dirty data too,
>>                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
>
> Downside of 3, though, is that it moves the definition of what the
> different states mean outside of the NBD protocol (i.e., the protocol
> messages are not entirely defined anymore, and their meaning depends on
> the clients and servers in use).
>
> To avoid this, we should have a clear definition of what the reply means
> *by default*, but then we can add a note that clients and servers can
> possibly define other meanings out of band if they want to.
>

I had personally only ever considered #3; where a command would be 
issued to QEMU to begin offering NBD data for some point-in-time as 
associated with a particular reference/snapshot/backup/etc. This leaves 
it up to the out-of-band client to order up the right data.

That does appear to choose a meaningful name for the status bits a bit 
more difficult...

[...but reading ahead, a 'user defined' bit would fit the bill just fine.]

When Vladimir authored a "persistence" feature for bitmaps to be stored 
alongside QCOW2 files, we had difficulty describing exactly what a dirty 
bit meant for the data -- ultimately it is reliant on external 
information. The meaning is simply but ambiguously, "The data associated 
with this bit has been changed since the last time the bit was reset."

We don't record or stipulate the conditions for a reset.

(for our purposes, a reset would occur during a full or incremental backup.)

>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>    cons: Two commands with almost same semantic and similar means?
>>    pros: However here is a good point of separating clearly defined and native
>>          for block devices GET_BLOCK_STATUS from user-driven and actually
>>          undefined data, called 'dirtyness'.
>
> Yeah, having them separate commands might be a bad idea indeed.
>
>> 5. Number of status descriptors, sent by server, should be restricted
>>    variants:
>>    1: just allow server to restrict this as it wants (which was done in v3)
>>    2: (not excluding 1). Client specifies somehow the maximum for number
>>       of descriptors.
>>       2.1: add command flag, which will request only one descriptor
>>            (otherwise, no restrictions from the client)
>>       2.2: again, introduce extended nbd requests, and add field to
>>            specify this maximum
>
> I think having a flag which requests just one descriptor can be useful,
> but I'm hesitant to add it unless it's actually going to be used; so in
> other words, I'll leave the decision on that bit to you.
>
>> 6. A: What to do with unspecified flags (in request/reply)?
>>    I think the normal variant is to make them reserved. (Server should
>>    return EINVAL if found unknown bits, client should consider replay
>>    with unknown bits as an error)
>
> Right, probably best to do that, yes.
>
>> ======
>>
>> Also, an idea on 2-4:
>>
>>     As we say, that dirtiness is unknown for NBD, and external tool
>>     should specify, manage and understand, which data is actually
>>     transmitted, why not just call it user_data and leave status field
>>     of reply chunk unspecified in this case?
>>
>>     So, I propose one flag for NBD_CMD_BLOCK_STATUS:
>>     NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
>>     Eric's 'Block provisioning status' paragraph.  If it is set, we just
>>     leave status field for some external... protocol? Who knows, what is
>>     this user data.
>
> Yes, this sounds like a reasonable approach.
>

I'd be pretty happy (personally) with some user defined bits. Leaves a 
lot of the ambiguousness of exactly what we're trying to convey away 
from the NBD spec, which is nice.

>>     Note: I'm not sure, that I like this (my) proposal. It's just an
>>     idea, may be someone like it.  And, I think, it represents what we
>>     are trying to do more honestly.
>
> Indeed.
>
>>     Note2: the next step of generalization will be NBD_CMD_USER, with
>>     variable request size, structured reply and no definition :)
>
> Well, er, no please, if we can avoid it :-)
>
>> Another idea, about backups themselves:
>>
>>     Why do we need allocated/zero status for backup? IMHO we don't.
>
> Well, I've been thinking so all along, but then I don't really know what
> it is, in detail, that you want to do :-)
>
> I can understand a "has this changed since time X" request, which the
> "dirty" thing seems to want to be. Whether something is allocated is
> just a special case of that.
>
> Actually, come to think of that. What is the exact use case for this
> thing? I understand you're trying to create incremental backups of
> things, which would imply you don't write from the client that is
> getting the block status thingies, right? If so, how about:
>

Essentially you can create a bitmap object in-memory in QEMU and then 
associate it with a drive. It records changes to the drive, and QEMU can 
be instructed to write out any changes since the last backup to disk, 
while clearing the bits of the bitmap.

It doesn't record a specific point in time, it's just implicit to the 
last time you cleared the bitmap -- usually the last incremental or full 
backup you've made.

So it does describe "blocks changed since time X," it's just that we 
don't really know exactly when time X was.

This is all well and dandy, except there is some desire from third 
parties to be able to ask QEMU about this dirty block information -- to 
be able to see this bitmap, more or less. We already use NBD for 
exporting data, and instead of inventing a new side-band, we decided 
that we wanted to use NBD to let external users get this information.

What exactly they do with that info is beyond QEMU's scope.

> - NBD_OPT_GET_SNAPSHOTS (during negotiation): returns a list of
>   snapshots. Not required, optional, includes a machine-readable form,
>   not defined by NBD, which explains what the snapshot is about (e.g., a
>   qemu json file). The "base" version of that is just "allocation
>   status", and is implied (i.e., you don't need to run
>   NBD_OPT_GET_SNAPSHOTS if you're not interested in anything but the
>   allocation status).

We don't necessarily have /snapshots/ per se, but we do have some block 
device and one or more bitmaps describing deltas to one or more backups 
that we do not necessarily have access to.

e.g. a given bitmap may describe a delta to an off-site backup. We know 
the delta, but do not maintain any meaningful handle to the given 
off-site backup, including name, URI, or date.

QEMU leaves this association up to upper management, and provides only 
IDs to facilitate any additional correlation.

We could offer up descriptions of these bitmaps in response to such a 
command, but they're not ... quite snapshots. It is more the case that 
we use them to create snapshots.

> - NBD_CMD_BLOCK_STATUS (during transmission), returns block descriptors
>   which tell you what the status of a block of data is for each of the
>   relevant snapshots that we know about.
>
> Perhaps this is somewhat overengineered, but it does bring most of the
> definition of what a snapshot is back into the NBD protocol, without
> having to say "this could be anything", and without requiring
> connectivity over two ports for this to be useful (e.g., you could store
> the machine-readable form of the snapshot description into your backup
> program and match what they mean with what you're interested in at
> restore time, etc).
>
> This wouldn't work if you're interested in new snapshots that get
> created once we've already moved into transmission, but hey.
>
> Thoughts?
>
>>     Full backup: just do structured read - it will show us, which chunks
>>     may be treaded as zeroes.
>
> Right.
>
>>     Incremental backup: get dirty bitmap (somehow, for example through
>>     user-defined part of proposed command), than, for dirty blocks, read
>>     them through structured read, so information about zero/unallocated
>>     areas are here.
>>
>> For me all the variants above are OK. Let's finally choose something.
>>
>> v2:
>> v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html
>>
>> Since then, we've added the STRUCTURED_REPLY extension, which
>> necessitates a rather larger rebase; I've also changed things
>> to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
>> modes to be determined by boolean flags (rather than by fixed
>> values of the 16-bit flags field), changed the reply status fields
>> to be bitwise-or values (with a default of 0 always being sane),
>> and changed the descriptor layout to drop an offset but to include
>> a 32-bit status so that the descriptor is nicely 8-byte aligned
>> without padding.
>>
>>  doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 154 insertions(+), 1 deletion(-)
>
> [...]
>
> I'll commit this in a minute into a separate branch called
> "extension-blockstatus", under the understanding that changes are still
> required, as per above (i.e., don't assume that just because there's a
> branch I'm happy with the current result ;-)
>
> Regards
>

Err, I hope I haven't confused everything to heck and back now....

Thanks,
--js

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-28 17:33     ` Wouter Verhelst
@ 2016-11-29  9:17       ` Stefan Hajnoczi
  2016-11-29 10:50       ` Wouter Verhelst
  1 sibling, 0 replies; 37+ messages in thread
From: Stefan Hajnoczi @ 2016-11-29  9:17 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, kwolf, Vladimir Sementsov-Ogievskiy, qemu-devel,
	pborzenkov, pbonzini, mpa, den

[-- Attachment #1: Type: text/plain, Size: 4281 bytes --]

On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:
> Hi Stefan,
> 
> On Mon, Nov 28, 2016 at 11:19:44AM +0000, Stefan Hajnoczi wrote:
> > On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote:
> > > Quickly: the reason I haven't merged this yes is twofold:
> > > - I wasn't thrilled with the proposal at the time. It felt a bit
> > >   hackish, and bolted onto NBD so you could use it, but without defining
> > >   everything in the NBD protocol. "We're reading some data, but it's not
> > >   about you". That didn't feel right
> > >
> > > - There were a number of questions still unanswered (you're answering a
> > >   few below, so that's good).
> > > 
> > > For clarity, I have no objection whatsoever to adding more commands if
> > > they're useful, but I would prefer that they're also useful with NBD on
> > > its own, i.e., without requiring an initiation or correlation of some
> > > state through another protocol or network connection or whatever. If
> > > that's needed, that feels like I didn't do my job properly, if you get
> > > my point.
> > 
> > The out-of-band operations you are referring to are for dirty bitmap
> > management.  (The goal is to read out blocks that changed since the last
> > backup.)
> > 
> > The client does not access the live disk, instead it accesses a
> > read-only snapshot and the dirty information (so that it can copy out
> > only blocks that were written).  The client is allowed to read blocks
> > that are not dirty too.
> 
> I understood as much, yes.
> 
> > If you want to implement the whole incremental backup workflow in NBD
> > then the client would first have to connect to the live disk, set up
> > dirty tracking, create a snapshot export, and then connect to that
> > snapshot.
> > 
> > That sounds like a big feature set and I'd argue it's for the control
> > plane (storage API) and not the data plane (NBD).  There were
> > discussions about transferring the dirty information via the control
> > plane but it seems more appropriate to it in the data plane since it is
> > block-level information.
> 
> I agree that creating and managing snapshots is out of scope for NBD. The
> protocol is not set up for that.
> 
> However, I'm arguing that if we're going to provide information about
> snapshots, we should be able to properly refer to these snapshots from
> within an NBD context. My previous mail suggested adding a negotiation
> message that would essentially ask the server "tell me about the
> snapshots you know about", giving them an NBD identifier in the process
> (accompanied by a "foreign" identifier that is decidedly *not* an NBD
> identifier and that could be used to match the NBD identifier to
> something implementation-defined). This would be read-only information;
> the client cannot ask the server to create new snapshots. We can then
> later in the protocol refer to these snapshots by way of that NBD
> identifier.
> 
> My proposal also makes it impossible to get updates of newly created
> snapshots without disconnecting and reconnecting (due to the fact that
> you can't go from transmission back to negotiation), but I'm not sure
> that's a problem.
> 
> Doing so has two advantages:
> - If a client is accidentally (due to misconfiguration or implementation
>   bugs or whatnot) connecting to the wrong server after having created a
>   snapshot through a management protocol, we have an opportunity to
>   detect this error, due to the fact that the "foreign" identifiers
>   passed to the client during negotiation will not match with what the
>   client was expecting.
> - A future version of the protocol could possibly include an extended
>   version of the read command, allowing a client to read information
>   from multiple storage snapshots without requiring a reconnect, and
>   allowing current clients information about allocation status across
>   various snapshots (although a first implementation could very well
>   limit itself to only having one snapshot).

Sorry, I misunderstood you.

Snapshots are not very different from NBD exports.  Especially if the
storage system supports writeable-snapshot (aka cloning).  Should we
just used named exports?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-27 19:17 ` [Qemu-devel] [Nbd] " Wouter Verhelst
  2016-11-28 11:19   ` Stefan Hajnoczi
  2016-11-28 23:15   ` John Snow
@ 2016-11-29 10:18   ` Kevin Wolf
  2016-11-29 11:34     ` Vladimir Sementsov-Ogievskiy
  2016-11-30 10:41   ` Sergey Talantov
  3 siblings, 1 reply; 37+ messages in thread
From: Kevin Wolf @ 2016-11-29 10:18 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: Vladimir Sementsov-Ogievskiy, nbd-general, qemu-devel, den,
	pborzenkov, stefanha, mpa, pbonzini

Am 27.11.2016 um 20:17 hat Wouter Verhelst geschrieben:
> > 3. Q: selecting of dirty bitmap to export
> >    A: several variants:
> >       1: id of bitmap is in flags field of request
> >           pros: - simple
> >           cons: - it's a hack. flags field is for other uses.
> >                 - we'll have to map bitmap names to these "ids"
> >       2: introduce extended nbd requests with variable length and exploit this
> >          feature for BLOCK_STATUS command, specifying bitmap identifier.
> >          pros: - looks like a true way
> >          cons: - we have to create additional extension
> >                - possible we have to create a map,
> >                  {<QEMU bitmap name> <=> <NBD bitmap id>}
> >       3: exteranl tool should select which bitmap to export. So, in case of Qemu
> >          it should be something like qmp command block-export-dirty-bitmap.
> >          pros: - simple
> >                - we can extend it to behave like (2) later
> >          cons: - additional qmp command to implement (possibly, the lesser evil)
> >          note: Hmm, external tool can make chose between allocated/dirty data too,
> >                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> 
> Downside of 3, though, is that it moves the definition of what the
> different states mean outside of the NBD protocol (i.e., the protocol
> messages are not entirely defined anymore, and their meaning depends on
> the clients and servers in use).

Another point to consider is that option 3 doesn't allow you to access
two (or more) different bitmaps from the client without using the side
channel all the time to switch back and forth between them and having to
drain the request queue each time to avoid races.

In general, if we have something that "looks like the true way", I'd
advocate to choose this option. Experience tells that we'd regret
anything simpler in a year or two, and then we'll have to do the real
thing anyway, but still need to support the quick hack for
compatibility.

> > 5. Number of status descriptors, sent by server, should be restricted
> >    variants:
> >    1: just allow server to restrict this as it wants (which was done in v3)
> >    2: (not excluding 1). Client specifies somehow the maximum for number
> >       of descriptors.
> >       2.1: add command flag, which will request only one descriptor
> >            (otherwise, no restrictions from the client)
> >       2.2: again, introduce extended nbd requests, and add field to
> >            specify this maximum
> 
> I think having a flag which requests just one descriptor can be useful,
> but I'm hesitant to add it unless it's actually going to be used; so in
> other words, I'll leave the decision on that bit to you.

The native qemu block layer interface returns the status of only one
contiguous chunks, so the easiest way to implement the NBD block driver
in qemu would always use that bit.

On the other hand, it would be possible for the NBD block driver in qemu
to cache the rest of the response internally and answer the next request
from the cache instead of sending a new request over the network. Maybe
that's what it should be doing anyway for good performance.

> > Also, an idea on 2-4:
> > 
> >     As we say, that dirtiness is unknown for NBD, and external tool
> >     should specify, manage and understand, which data is actually
> >     transmitted, why not just call it user_data and leave status field
> >     of reply chunk unspecified in this case?
> > 
> >     So, I propose one flag for NBD_CMD_BLOCK_STATUS:
> >     NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
> >     Eric's 'Block provisioning status' paragraph.  If it is set, we just
> >     leave status field for some external... protocol? Who knows, what is
> >     this user data.
> 
> Yes, this sounds like a reasonable approach.

Makes sense to me, too.

However, if we have use for a single NBD_FLAG_STATUS_USER bit, we also
have use for a second one. If we go with one of the options where the
bitmap is selected per command, we're fine because you can simply move
the second bit to a different bitmap and do two requests. If we have
only a single active bitmap at a time, though, this feels like an actual
problem.

> > Another idea, about backups themselves:
> > 
> >     Why do we need allocated/zero status for backup? IMHO we don't.
> 
> Well, I've been thinking so all along, but then I don't really know what
> it is, in detail, that you want to do :-)

I think we do. A block can be dirtied by discarding/write_zero-ing it.
Then we want the dirty bit to know that we need to include this block in
the incremental backup, but we also want to know that we don't actually
have to transfer the data in it.

Kevin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-28 17:33     ` Wouter Verhelst
  2016-11-29  9:17       ` Stefan Hajnoczi
@ 2016-11-29 10:50       ` Wouter Verhelst
  2016-11-29 12:41         ` Vladimir Sementsov-Ogievskiy
                           ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Wouter Verhelst @ 2016-11-29 10:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: nbd-general, kwolf, Vladimir Sementsov-Ogievskiy, qemu-devel,
	pborzenkov, den, mpa, pbonzini

Hi,

On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:
> However, I'm arguing that if we're going to provide information about
> snapshots, we should be able to properly refer to these snapshots from
> within an NBD context. My previous mail suggested adding a negotiation
> message that would essentially ask the server "tell me about the
> snapshots you know about", giving them an NBD identifier in the process
> (accompanied by a "foreign" identifier that is decidedly *not* an NBD
> identifier and that could be used to match the NBD identifier to
> something implementation-defined). This would be read-only information;
> the client cannot ask the server to create new snapshots. We can then
> later in the protocol refer to these snapshots by way of that NBD
> identifier.

To make this a bit more concrete, I've changed the proposal like so:

>From bc6d0df4156e670be7b6adea4f2813f44ffa7202 Mon Sep 17 00:00:00 2001
From: Wouter Verhelst <w@uter.be>
Date: Tue, 29 Nov 2016 11:46:04 +0100
Subject: [PATCH] Update with allocation contexts etc

Signed-off-by: Wouter Verhelst <w@uter.be>
---
 doc/proto.md | 210 +++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 145 insertions(+), 65 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index dfdd06d..fe7ae53 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -768,8 +768,6 @@ The field has the following format:
   to that command to the client. In the absense of this flag, clients
   SHOULD NOT multiplex their commands over more than one connection to
   the export.
-- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental
-  `BLOCK_STATUS` extension; see below.
 
 Clients SHOULD ignore unknown flags.
 
@@ -871,6 +869,46 @@ of the newstyle negotiation.
 
     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
+- `NBD_OPT_ALLOC_CONTEXT` (10)
+
+    Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
+    followed by an `NBD_REP_ACK`. If a server replies to such a request
+    with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
+    commands during the transmission phase.
+
+    If the query string is syntactically invalid, the server SHOULD send
+    `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
+    but finds no allocation contexts, the server MUST send a single
+    reply of type `NBD_REP_ACK`.
+
+    This option MUST NOT be requested unless structured replies have
+    been negotiated first. If a client attempts to do so, a server
+    SHOULD send `NBD_REP_ERR_INVALID`.
+
+    Data:
+    - 32 bits, type
+    - String, query to select a subset of the available allocation
+      contexts. If this is not specified (i.e., length is 4 and no
+      command is sent), then the server MUST send all the allocation
+      contexts it knows about. If specified, this query string MUST
+      start with a name that uniquely identifies a server
+      implementation; e.g., the reference implementation that
+      accompanies this document would support query strings starting
+      with 'nbd-server:'
+
+    The type may be one of:
+    - `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts
+      selected by the query string is returned to the client without
+      changing any state (i.e., this does not add allocation contexts
+      for further usage).
+    - `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
+      selected by the query string is added to the list of existing
+      allocation contexts.
+    - `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts
+      selected by the query string is removed from the list of used
+      allocation contexts. Servers SHOULD NOT reuse existing allocation
+      context IDs.
+
 #### Option reply types
 
 These values are used in the "reply type" field, sent by the server
@@ -901,6 +939,18 @@ during option haggling in the fixed newstyle negotiation.
 
     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
+- `NBD_REP_ALLOC_CONTEXT` (4)
+
+    A description of an allocation context. Data:
+
+    - 32 bits, NBD allocation context ID. If the request was NOT of type
+      `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero.
+    - String, name of the allocation context. This is not required to be
+      a human-readable string, but it MUST be valid UTF-8 data.
+
+    Allocation context ID 0 is implied, and always exists. It cannot be
+    removed.
+
 There are a number of error reply types, all of which are denoted by
 having bit 31 set. All error replies MAY have some data set, in which
 case that data is an error message string suitable for display to the user.
@@ -938,15 +988,47 @@ case that data is an error message string suitable for display to the user.
 
     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
-* `NBD_REP_ERR_SHUTDOWN` (2^32 + 7)
+* `NBD_REP_ERR_SHUTDOWN` (2^31 + 7)
 
     The server is unwilling to continue negotiation as it is in the
     process of being shut down.
 
-* `NBD_REP_ERR_BLOCK_SIZE_REQD` (2^32 + 8)
+* `NBD_REP_ERR_BLOCK_SIZE_REQD` (2^31 + 8)
 
     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
+##### Allocation contexts
+
+Allocation context 0 is the basic "exists at all" allocation context. If
+an extent is not allocated at allocation context 0, it MUST NOT be
+listed as allocated at another allocation context. This supports sparse
+file semantics on the server side. If a server has only one allocation
+context (the default), then writing to an extent which is allocated in
+that allocation context 0 MUST NOT fail with ENOSPC.
+
+For all other cases, this specification requires no specific semantics
+of allocation contexts. Implementations could support allocation
+contexts with semantics like the following:
+
+- Incremental snapshots; if a block is allocated in one allocation
+  context, that implies that it is also allocated in the next level up.
+- Various bits of data about the backend of the storage; e.g., if the
+  storage is written on a RAID array, an allocation context could
+  return information about the redundancy level of a given extent
+- If the backend implements a write-through cache of some sort, or
+  synchronises with other servers, an allocation context could state
+  that an extent is "allocated" once it has reached permanent storage
+  and/or is synchronized with other servers.
+
+The only requirement of an allocation context is that it MUST be
+representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
+
+Likewise, the syntax of query strings is not specified by this document.
+
+Server implementations SHOULD document their syntax for query strings
+and semantics for resulting allocation contexts in a document like this
+one.
+
 ### Transmission phase
 
 #### Flag fields
@@ -983,6 +1065,9 @@ valid may depend on negotiation during the handshake phase.
    content chunk in reply.  MUST NOT be set unless the transmission
    flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
    `EOVERFLOW` error chunk, if the request length is too large.
+- bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
+  set, the client is interested in only one extent per allocation
+  context.
 
 ##### Structured reply flags
 
@@ -1371,38 +1456,48 @@ adds a new `NBD_CMD_BLOCK_STATUS` command which returns a list of
 ranges with their respective states.  This extension is not available
 unless the client also negotiates the `STRUCTURED_REPLY` extension.
 
-* `NBD_FLAG_SEND_BLOCK_STATUS`
-
-    The server SHOULD set this transmission flag to 1 if structured
-    replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS`
-    request is supported.
-
 * `NBD_REPLY_TYPE_BLOCK_STATUS`
 
-    *length* MUST be a positive integer multiple of 8.  This reply
+    *length* MUST be 4 + (a positive integer multiple of 8).  This reply
     represents a series of consecutive block descriptors where the sum
     of the lengths of the descriptors MUST not be greater than the
-    length of the original request.  This chunk type MUST appear at most
-    once in a structured reply. Valid as a reply to
+    length of the original request. This chunk type MUST appear at most
+    once per allocation ID in a structured reply. Valid as a reply to
     `NBD_CMD_BLOCK_STATUS`.
 
-    The payload is structured as a list of one or more descriptors,
-    each with this layout:
+    Servers MUST return an `NBD_REPLY_TYPE_BLOCK_STATUS` chunk for every
+    allocation context ID, except if the semantics of particular
+    allocation contexts mean that the information for one allocation
+    context is implied by the information for another.
+
+    The payload starts with:
+
+        * 32 bits, allocation context ID
+
+    and is followed by a list of one or more descriptors, each with this
+    layout:
 
         * 32 bits, length (unsigned, MUST NOT be zero)
         * 32 bits, status flags
 
-    The definition of the status flags is determined based on the
-    flags present in the original request.
+    If the client used the `NBD_CMD_FLAG_REQ_ONE` flag in the request,
+    then every reply chunk MUST NOT contain more than one descriptor.
+
+    Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
+    its request, the server MAY return less descriptors in the reply
+    than would be required to fully specify the whole range of requested
+    information to the client, if the number of descriptors would be
+    over 16 otherwise and looking up the information would be too
+    resource-intensive for the server.
 
 * `NBD_CMD_BLOCK_STATUS`
 
-    A block status query request. Length and offset define the range
-    of interest. Clients SHOULD NOT use this request unless the server
-    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
-    in turn requires the client to first negotiate structured replies.
-    For a successful return, the server MUST use a structured reply,
-    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
+    A block status query request. Length and offset define the range of
+    interest. Clients SHOULD NOT use this request unless allocation
+    contexts have been negotiated, which in turn requires the client to
+    first negotiate structured replies. For a successful return, the
+    server MUST use a structured reply, containing at least one chunk of
+    type `NBD_REPLY_TYPE_BLOCK_STATUS`.
 
     The list of block status descriptors within the
     `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
@@ -1427,18 +1522,12 @@ unless the client also negotiates the `STRUCTURED_REPLY` extension.
     requested length, and *status* of 0 rather than reporting the
     error.
 
-    The type of information requested by the client is determined by
-    the request flags, as follows:
-
-    1. Block provisioning status
+    Upon receiving an `NBD_CMD_BLOCK_STATUS` command, the server MUST
+    return the status of the device, where the status field of each
+    descriptor is determined by the following bits (all combinations of
+    these bits are possible):
 
-    Upon receiving an `NBD_CMD_BLOCK_STATUS` command with the flag
-    `NBD_FLAG_STATUS_DIRTY` clear, the server MUST return the
-    provisioning status of the device, where the status field of each
-    descriptor is determined by the following bits (all four
-    combinations of these two bits are possible):
-
-      - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole
+      - `NBD_STATE_HOLE` (bit 0): if set, the block represents a hole
         (and future writes to that area may cause fragmentation or
         encounter an `ENOSPC` error); if clear, the block is allocated
         or the server could not otherwise determine its status.  Note
@@ -1446,43 +1535,34 @@ unless the client also negotiates the `STRUCTURED_REPLY` extension.
         that the server MAY report a hole even where trim has not been
         requested, and also that a server MAY report allocation even
         where a trim has been requested.
-      - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as
+      - `NBD_STATE_ZERO` (bit 1): if set, the block contents read as
         all zeroes; if clear, the block contents are not known.  Note
         that the use of `NBD_CMD_WRITE_ZEROES` is related to this
         status, but that the server MAY report zeroes even where write
         zeroes has not been requested, and also that a server MAY
         report unknown content even where write zeroes has been
         requested.
-
-    The client SHOULD NOT read from an area that has both
-    `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.
-
-    2. Block dirtiness status
-
-    This command is meant to operate in tandem with other (non-NBD)
-    channels to the server.  Generally, a "dirty" block is a block
-    that has been written to by someone, but the exact meaning of "has
-    been written" is left to the implementation.  For example, a
-    virtual machine monitor could provide a (non-NBD) command to start
-    tracking blocks written by the virtual machine.  A backup client
-    can then connect to an NBD server provided by the virtual machine
-    monitor and use `NBD_CMD_BLOCK_STATUS` with the
-    `NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty
-    blocks that the virtual machine has changed.
-
-    An implementation that doesn't track the "dirtiness" state of
-    blocks MUST either fail this command with `EINVAL`, or mark all
-    blocks as dirty in the descriptor that it returns.  Upon receiving
-    an `NBD_CMD_BLOCK_STATUS` command with the flag
-    `NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness
-    status of the device, where the status field of each descriptor is
-    determined by the following bit:
-
-      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
+      - `NBD_STATE_CLEAN` (bit 2): if set, the block represents a
         portion of the file that is still clean because it has not
         been written; if clear, the block represents a portion of the
         file that is dirty, or where the server could not otherwise
-        determine its status.
+        determine its status. The server MUST NOT set this bit for
+        allocation context 0, where it has no meaning.
+      - `NBD_STATE_ACTIVE` (bit 3): if set, the block represents a
+	portion of the file that is "active" in the given allocation
+	context. The server MUST NOT set this bit for allocation context
+	0, where it has no meaning.
+
+    The exact semantics of what it means for a block to be "clean" or
+    "active" at a given allocation context is not defined by this
+    specification, except that the default in both cases should be to
+    clear the bit. That is, when the allocation context does not have
+    knowledge of the relevant status for the given extent, or when the
+    allocation context does not assign any meaning to it, the bits
+    should be cleared.
+
+    A client SHOULD NOT read from an area that has both `NBD_STATE_HOLE`
+    set and `NBD_STATE_ZERO` clear.
 
 A client MAY close the connection if it detects that the server has
 sent an invalid chunks (such as lengths in the
@@ -1492,9 +1572,9 @@ request including one or more sectors beyond the size of the device.
 
 The extension adds the following new command flag:
 
-- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`.
-  SHOULD be set to 1 if the client wants to request dirtiness status
-  rather than provisioning status.
+- `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`.
+  SHOULD be set to 1 if the client wants to request information for only
+  one extent per allocation context.
 
 ## About this file

I've pushed this to the extension-blockstatus branch, too.

Thoughts?
 
-- 
2.10.2


-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-29 10:18   ` Kevin Wolf
@ 2016-11-29 11:34     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 37+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-11-29 11:34 UTC (permalink / raw)
  To: Kevin Wolf, Wouter Verhelst
  Cc: nbd-general, qemu-devel, den, pborzenkov, stefanha, mpa, pbonzini

29.11.2016 13:18, Kevin Wolf wrote:
> Am 27.11.2016 um 20:17 hat Wouter Verhelst geschrieben:
>>> 3. Q: selecting of dirty bitmap to export
>>>     A: several variants:
>>>        1: id of bitmap is in flags field of request
>>>            pros: - simple
>>>            cons: - it's a hack. flags field is for other uses.
>>>                  - we'll have to map bitmap names to these "ids"
>>>        2: introduce extended nbd requests with variable length and exploit this
>>>           feature for BLOCK_STATUS command, specifying bitmap identifier.
>>>           pros: - looks like a true way
>>>           cons: - we have to create additional extension
>>>                 - possible we have to create a map,
>>>                   {<QEMU bitmap name> <=> <NBD bitmap id>}
>>>        3: exteranl tool should select which bitmap to export. So, in case of Qemu
>>>           it should be something like qmp command block-export-dirty-bitmap.
>>>           pros: - simple
>>>                 - we can extend it to behave like (2) later
>>>           cons: - additional qmp command to implement (possibly, the lesser evil)
>>>           note: Hmm, external tool can make chose between allocated/dirty data too,
>>>                 so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
>> Downside of 3, though, is that it moves the definition of what the
>> different states mean outside of the NBD protocol (i.e., the protocol
>> messages are not entirely defined anymore, and their meaning depends on
>> the clients and servers in use).
> Another point to consider is that option 3 doesn't allow you to access
> two (or more) different bitmaps from the client without using the side
> channel all the time to switch back and forth between them and having to
> drain the request queue each time to avoid races.
>
> In general, if we have something that "looks like the true way", I'd
> advocate to choose this option. Experience tells that we'd regret
> anything simpler in a year or two, and then we'll have to do the real
> thing anyway, but still need to support the quick hack for
> compatibility.
>
>>> 5. Number of status descriptors, sent by server, should be restricted
>>>     variants:
>>>     1: just allow server to restrict this as it wants (which was done in v3)
>>>     2: (not excluding 1). Client specifies somehow the maximum for number
>>>        of descriptors.
>>>        2.1: add command flag, which will request only one descriptor
>>>             (otherwise, no restrictions from the client)
>>>        2.2: again, introduce extended nbd requests, and add field to
>>>             specify this maximum
>> I think having a flag which requests just one descriptor can be useful,
>> but I'm hesitant to add it unless it's actually going to be used; so in
>> other words, I'll leave the decision on that bit to you.
> The native qemu block layer interface returns the status of only one
> contiguous chunks, so the easiest way to implement the NBD block driver
> in qemu would always use that bit.
>
> On the other hand, it would be possible for the NBD block driver in qemu
> to cache the rest of the response internally and answer the next request
> from the cache instead of sending a new request over the network. Maybe
> that's what it should be doing anyway for good performance.
>
>>> Also, an idea on 2-4:
>>>
>>>      As we say, that dirtiness is unknown for NBD, and external tool
>>>      should specify, manage and understand, which data is actually
>>>      transmitted, why not just call it user_data and leave status field
>>>      of reply chunk unspecified in this case?
>>>
>>>      So, I propose one flag for NBD_CMD_BLOCK_STATUS:
>>>      NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
>>>      Eric's 'Block provisioning status' paragraph.  If it is set, we just
>>>      leave status field for some external... protocol? Who knows, what is
>>>      this user data.
>> Yes, this sounds like a reasonable approach.
> Makes sense to me, too.
>
> However, if we have use for a single NBD_FLAG_STATUS_USER bit, we also
> have use for a second one. If we go with one of the options where the
> bitmap is selected per command, we're fine because you can simply move
> the second bit to a different bitmap and do two requests. If we have
> only a single active bitmap at a time, though, this feels like an actual
> problem.
>
>>> Another idea, about backups themselves:
>>>
>>>      Why do we need allocated/zero status for backup? IMHO we don't.
>> Well, I've been thinking so all along, but then I don't really know what
>> it is, in detail, that you want to do :-)
> I think we do. A block can be dirtied by discarding/write_zero-ing it.
> Then we want the dirty bit to know that we need to include this block in
> the incremental backup, but we also want to know that we don't actually
> have to transfer the data in it.

And we will know that automatically, by using structured read command, 
so separate call to get_block_status is not needed.

>
> Kevin


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-29 10:50       ` Wouter Verhelst
@ 2016-11-29 12:41         ` Vladimir Sementsov-Ogievskiy
  2016-11-29 13:08           ` Wouter Verhelst
  2016-11-29 13:07         ` Alex Bligh
  2016-12-01 10:14         ` Wouter Verhelst
  2 siblings, 1 reply; 37+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-11-29 12:41 UTC (permalink / raw)
  To: Wouter Verhelst, Stefan Hajnoczi
  Cc: nbd-general, kwolf, qemu-devel, pborzenkov, den, mpa, pbonzini

Hi,

29.11.2016 13:50, Wouter Verhelst wrote:
> Hi,
>
> On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:
>> However, I'm arguing that if we're going to provide information about
>> snapshots, we should be able to properly refer to these snapshots from
>> within an NBD context. My previous mail suggested adding a negotiation
>> message that would essentially ask the server "tell me about the
>> snapshots you know about", giving them an NBD identifier in the process
>> (accompanied by a "foreign" identifier that is decidedly *not* an NBD
>> identifier and that could be used to match the NBD identifier to
>> something implementation-defined). This would be read-only information;
>> the client cannot ask the server to create new snapshots. We can then
>> later in the protocol refer to these snapshots by way of that NBD
>> identifier.
> To make this a bit more concrete, I've changed the proposal like so:
>
>  From bc6d0df4156e670be7b6adea4f2813f44ffa7202 Mon Sep 17 00:00:00 2001
> From: Wouter Verhelst <w@uter.be>
> Date: Tue, 29 Nov 2016 11:46:04 +0100
> Subject: [PATCH] Update with allocation contexts etc
>
> Signed-off-by: Wouter Verhelst <w@uter.be>
> ---
>   doc/proto.md | 210 +++++++++++++++++++++++++++++++++++++++++------------------
>   1 file changed, 145 insertions(+), 65 deletions(-)
>
> diff --git a/doc/proto.md b/doc/proto.md
> index dfdd06d..fe7ae53 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -768,8 +768,6 @@ The field has the following format:
>     to that command to the client. In the absense of this flag, clients
>     SHOULD NOT multiplex their commands over more than one connection to
>     the export.
> -- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental
> -  `BLOCK_STATUS` extension; see below.
>   
>   Clients SHOULD ignore unknown flags.
>   
> @@ -871,6 +869,46 @@ of the newstyle negotiation.
>   
>       Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
>   
> +- `NBD_OPT_ALLOC_CONTEXT` (10)
> +
> +    Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
> +    followed by an `NBD_REP_ACK`. If a server replies to such a request
> +    with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
> +    commands during the transmission phase.
> +
> +    If the query string is syntactically invalid, the server SHOULD send
> +    `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
> +    but finds no allocation contexts, the server MUST send a single
> +    reply of type `NBD_REP_ACK`.
> +
> +    This option MUST NOT be requested unless structured replies have
> +    been negotiated first. If a client attempts to do so, a server
> +    SHOULD send `NBD_REP_ERR_INVALID`.
> +
> +    Data:
> +    - 32 bits, type
> +    - String, query to select a subset of the available allocation
> +      contexts. If this is not specified (i.e., length is 4 and no
> +      command is sent), then the server MUST send all the allocation
> +      contexts it knows about. If specified, this query string MUST
> +      start with a name that uniquely identifies a server
> +      implementation; e.g., the reference implementation that
> +      accompanies this document would support query strings starting
> +      with 'nbd-server:'
> +
> +    The type may be one of:
> +    - `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts
> +      selected by the query string is returned to the client without
> +      changing any state (i.e., this does not add allocation contexts
> +      for further usage).
> +    - `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
> +      selected by the query string is added to the list of existing
> +      allocation contexts.
> +    - `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts
> +      selected by the query string is removed from the list of used
> +      allocation contexts. Servers SHOULD NOT reuse existing allocation
> +      context IDs.
> +
>   #### Option reply types
>   
>   These values are used in the "reply type" field, sent by the server
> @@ -901,6 +939,18 @@ during option haggling in the fixed newstyle negotiation.
>   
>       Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
>   
> +- `NBD_REP_ALLOC_CONTEXT` (4)
> +
> +    A description of an allocation context. Data:
> +
> +    - 32 bits, NBD allocation context ID. If the request was NOT of type
> +      `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero.
> +    - String, name of the allocation context. This is not required to be
> +      a human-readable string, but it MUST be valid UTF-8 data.
> +
> +    Allocation context ID 0 is implied, and always exists. It cannot be
> +    removed.
> +
>   There are a number of error reply types, all of which are denoted by
>   having bit 31 set. All error replies MAY have some data set, in which
>   case that data is an error message string suitable for display to the user.
> @@ -938,15 +988,47 @@ case that data is an error message string suitable for display to the user.
>   
>       Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
>   
> -* `NBD_REP_ERR_SHUTDOWN` (2^32 + 7)
> +* `NBD_REP_ERR_SHUTDOWN` (2^31 + 7)
>   
>       The server is unwilling to continue negotiation as it is in the
>       process of being shut down.
>   
> -* `NBD_REP_ERR_BLOCK_SIZE_REQD` (2^32 + 8)
> +* `NBD_REP_ERR_BLOCK_SIZE_REQD` (2^31 + 8)
>   
>       Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
>   
> +##### Allocation contexts
> +
> +Allocation context 0 is the basic "exists at all" allocation context. If
> +an extent is not allocated at allocation context 0, it MUST NOT be
> +listed as allocated at another allocation context. This supports sparse

allocated here is range with unset NBD_STATE_HOLE bit?


> +file semantics on the server side. If a server has only one allocation
> +context (the default), then writing to an extent which is allocated in
> +that allocation context 0 MUST NOT fail with ENOSPC.
> +
> +For all other cases, this specification requires no specific semantics
> +of allocation contexts. Implementations could support allocation
> +contexts with semantics like the following:
> +
> +- Incremental snapshots; if a block is allocated in one allocation
> +  context, that implies that it is also allocated in the next level up.
> +- Various bits of data about the backend of the storage; e.g., if the
> +  storage is written on a RAID array, an allocation context could
> +  return information about the redundancy level of a given extent
> +- If the backend implements a write-through cache of some sort, or
> +  synchronises with other servers, an allocation context could state
> +  that an extent is "allocated" once it has reached permanent storage
> +  and/or is synchronized with other servers.
> +
> +The only requirement of an allocation context is that it MUST be
> +representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
> +
> +Likewise, the syntax of query strings is not specified by this document.
> +
> +Server implementations SHOULD document their syntax for query strings
> +and semantics for resulting allocation contexts in a document like this
> +one.

IMHO, redundant paragraph for this spec.

> +
>   ### Transmission phase
>   
>   #### Flag fields
> @@ -983,6 +1065,9 @@ valid may depend on negotiation during the handshake phase.
>      content chunk in reply.  MUST NOT be set unless the transmission
>      flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
>      `EOVERFLOW` error chunk, if the request length is too large.
> +- bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
> +  set, the client is interested in only one extent per allocation
> +  context.
>   
>   ##### Structured reply flags
>   
> @@ -1371,38 +1456,48 @@ adds a new `NBD_CMD_BLOCK_STATUS` command which returns a list of
>   ranges with their respective states.  This extension is not available
>   unless the client also negotiates the `STRUCTURED_REPLY` extension.
>   
> -* `NBD_FLAG_SEND_BLOCK_STATUS`
> -
> -    The server SHOULD set this transmission flag to 1 if structured
> -    replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS`
> -    request is supported.
> -
>   * `NBD_REPLY_TYPE_BLOCK_STATUS`
>   
> -    *length* MUST be a positive integer multiple of 8.  This reply
> +    *length* MUST be 4 + (a positive integer multiple of 8).  This reply
>       represents a series of consecutive block descriptors where the sum
>       of the lengths of the descriptors MUST not be greater than the
> -    length of the original request.  This chunk type MUST appear at most
> -    once in a structured reply. Valid as a reply to
> +    length of the original request. This chunk type MUST appear at most
> +    once per allocation ID in a structured reply. Valid as a reply to
>       `NBD_CMD_BLOCK_STATUS`.
>   
> -    The payload is structured as a list of one or more descriptors,
> -    each with this layout:
> +    Servers MUST return an `NBD_REPLY_TYPE_BLOCK_STATUS` chunk for every
> +    allocation context ID, except if the semantics of particular
> +    allocation contexts mean that the information for one allocation
> +    context is implied by the information for another.

So, actually, instead of selecting with a nbd_cmd or with external tool 
which bitmap we want to access, we just reply with all bitmaps 
(negotiated at the beginning). Personally I dislike this. With such 
approach, we will have to export allocation bitmap always, even if we 
need only dirtyness. Consider, that requesting allocation bitmap may be 
much more expensive in time that requesting dirtyness. Or allocation 
bitmap information is implied by dirtyness?

Furthermore, as allocation context semantics defined externally, 
'semantics mean that information is implied' states nothing, and we 
actually return to way#3, where external tool decides which bitmap to 
export.

> +
> +    The payload starts with:
> +
> +        * 32 bits, allocation context ID
> +
> +    and is followed by a list of one or more descriptors, each with this
> +    layout:
>   
>           * 32 bits, length (unsigned, MUST NOT be zero)
>           * 32 bits, status flags
>   
> -    The definition of the status flags is determined based on the
> -    flags present in the original request.
> +    If the client used the `NBD_CMD_FLAG_REQ_ONE` flag in the request,
> +    then every reply chunk MUST NOT contain more than one descriptor.
> +
> +    Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
> +    its request, the server MAY return less descriptors in the reply
> +    than would be required to fully specify the whole range of requested
> +    information to the client, if the number of descriptors would be
> +    over 16 otherwise and looking up the information would be too
> +    resource-intensive for the server.

So, if there are <=16 extents, they all MUST be present in reply.. (just 
note)

>   
>   * `NBD_CMD_BLOCK_STATUS`
>   
> -    A block status query request. Length and offset define the range
> -    of interest. Clients SHOULD NOT use this request unless the server
> -    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
> -    in turn requires the client to first negotiate structured replies.
> -    For a successful return, the server MUST use a structured reply,
> -    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> +    A block status query request. Length and offset define the range of
> +    interest. Clients SHOULD NOT use this request unless allocation
> +    contexts have been negotiated, which in turn requires the client to
> +    first negotiate structured replies. For a successful return, the
> +    server MUST use a structured reply, containing at least one chunk of
> +    type `NBD_REPLY_TYPE_BLOCK_STATUS`.
>   
>       The list of block status descriptors within the
>       `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
> @@ -1427,18 +1522,12 @@ unless the client also negotiates the `STRUCTURED_REPLY` extension.
>       requested length, and *status* of 0 rather than reporting the
>       error.
>   
> -    The type of information requested by the client is determined by
> -    the request flags, as follows:
> -
> -    1. Block provisioning status
> +    Upon receiving an `NBD_CMD_BLOCK_STATUS` command, the server MUST
> +    return the status of the device, where the status field of each
> +    descriptor is determined by the following bits (all combinations of
> +    these bits are possible):
>   
> -    Upon receiving an `NBD_CMD_BLOCK_STATUS` command with the flag
> -    `NBD_FLAG_STATUS_DIRTY` clear, the server MUST return the
> -    provisioning status of the device, where the status field of each
> -    descriptor is determined by the following bits (all four
> -    combinations of these two bits are possible):
> -
> -      - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole
> +      - `NBD_STATE_HOLE` (bit 0): if set, the block represents a hole
>           (and future writes to that area may cause fragmentation or
>           encounter an `ENOSPC` error); if clear, the block is allocated
>           or the server could not otherwise determine its status.  Note
> @@ -1446,43 +1535,34 @@ unless the client also negotiates the `STRUCTURED_REPLY` extension.
>           that the server MAY report a hole even where trim has not been
>           requested, and also that a server MAY report allocation even
>           where a trim has been requested.
> -      - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as
> +      - `NBD_STATE_ZERO` (bit 1): if set, the block contents read as
>           all zeroes; if clear, the block contents are not known.  Note
>           that the use of `NBD_CMD_WRITE_ZEROES` is related to this
>           status, but that the server MAY report zeroes even where write
>           zeroes has not been requested, and also that a server MAY
>           report unknown content even where write zeroes has been
>           requested.
> -
> -    The client SHOULD NOT read from an area that has both
> -    `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.
> -
> -    2. Block dirtiness status
> -
> -    This command is meant to operate in tandem with other (non-NBD)
> -    channels to the server.  Generally, a "dirty" block is a block
> -    that has been written to by someone, but the exact meaning of "has
> -    been written" is left to the implementation.  For example, a
> -    virtual machine monitor could provide a (non-NBD) command to start
> -    tracking blocks written by the virtual machine.  A backup client
> -    can then connect to an NBD server provided by the virtual machine
> -    monitor and use `NBD_CMD_BLOCK_STATUS` with the
> -    `NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty
> -    blocks that the virtual machine has changed.
> -
> -    An implementation that doesn't track the "dirtiness" state of
> -    blocks MUST either fail this command with `EINVAL`, or mark all
> -    blocks as dirty in the descriptor that it returns.  Upon receiving
> -    an `NBD_CMD_BLOCK_STATUS` command with the flag
> -    `NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness
> -    status of the device, where the status field of each descriptor is
> -    determined by the following bit:
> -
> -      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
> +      - `NBD_STATE_CLEAN` (bit 2): if set, the block represents a
>           portion of the file that is still clean because it has not
>           been written; if clear, the block represents a portion of the
>           file that is dirty, or where the server could not otherwise
> -        determine its status.
> +        determine its status. The server MUST NOT set this bit for
> +        allocation context 0, where it has no meaning.
> +      - `NBD_STATE_ACTIVE` (bit 3): if set, the block represents a
> +	portion of the file that is "active" in the given allocation
> +	context. The server MUST NOT set this bit for allocation context
> +	0, where it has no meaning.
> +
> +    The exact semantics of what it means for a block to be "clean" or
> +    "active" at a given allocation context is not defined by this
> +    specification, except that the default in both cases should be to
> +    clear the bit. That is, when the allocation context does not have
> +    knowledge of the relevant status for the given extent, or when the
> +    allocation context does not assign any meaning to it, the bits
> +    should be cleared.
> +
> +    A client SHOULD NOT read from an area that has both `NBD_STATE_HOLE`
> +    set and `NBD_STATE_ZERO` clear.
>   
>   A client MAY close the connection if it detects that the server has
>   sent an invalid chunks (such as lengths in the
> @@ -1492,9 +1572,9 @@ request including one or more sectors beyond the size of the device.
>   
>   The extension adds the following new command flag:
>   
> -- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`.
> -  SHOULD be set to 1 if the client wants to request dirtiness status
> -  rather than provisioning status.
> +- `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`.
> +  SHOULD be set to 1 if the client wants to request information for only
> +  one extent per allocation context.
>   
>   ## About this file
>
> I've pushed this to the extension-blockstatus branch, too.
>
> Thoughts?
>   

For me considering dirtyness like one of allocation contexts sounds a 
bit weird, as dirtyness is not allocation.. But it is not so important.


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-25 11:28 [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension Vladimir Sementsov-Ogievskiy
  2016-11-25 14:02 ` Stefan Hajnoczi
  2016-11-27 19:17 ` [Qemu-devel] [Nbd] " Wouter Verhelst
@ 2016-11-29 12:57 ` Alex Bligh
  2016-11-29 14:36   ` Vladimir Sementsov-Ogievskiy
  2016-12-01 23:42   ` [Qemu-devel] " John Snow
  2 siblings, 2 replies; 37+ messages in thread
From: Alex Bligh @ 2016-11-29 12:57 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Alex Bligh, nbd-general, qemu-devel, Kevin Wolf, Paolo Bonzini,
	Pavel Borzenkov, Stefan stefanha@redhat. com, Denis V. Lunev,
	Wouter Verhelst, Eric Blake, mpa

Vladimir,

I went back to April to reread the previous train of conversation
then found you had helpfully summarised some if it. Comments
below.

Rather than comment on many of the points individual, the root
of my confusion and to some extent uncomfortableness about this
proposal is 'who owns the meaning the the bitmaps'.

Some of this is my own confusion (sorry) about the use to which
this is being put, which is I think at root a documentation issue.
To illustrate this, you write in the FAQ section that this is for
read only disks, but the text talks about:

+Some storage formats and operations over such formats express a
+concept of data dirtiness. Whether the operation is block device
+mirroring, incremental block device backup or any other operation with
+a concept of data dirtiness, they all share a need to provide a list
+of ranges that this particular operation treats as dirty.

How can data be 'dirty' if it is static and unchangeable? (I thought)

I now think what you are talking about backing up a *snapshot* of a disk
that's running, where the disk itself was not connected using NBD? IE it's
not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
an opaque state represented in a bitmap, which is binary metadata
at some particular level of granularity. It might as well be 'happiness'
or 'is coloured blue'. The NBD server would (normally) have no way of
manipulating this bitmap.

In previous comments, I said 'how come we can set the dirty bit through
writes but can't clear it?'. This (my statement) is now I think wrong,
as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
state of the bitmap comes from whatever sets the bitmap which is outside
the scope of this protocol to transmit it.

However, we have the uncomfortable (to me) situation where the protocol
describes a flag 'dirty', with implications as to what it does, but
no actual strict definition of how it's set. So any 'other' user has
no real idea how to use the information, or how to implement a server
that provides a 'dirty' bit, because the semantics of that aren't within
the protocol. This does not sit happily with me.

So I'm wondering whether we should simplify and generalise this spec. You
say that for the dirty flag, there's no specification of when it is
set and cleared - that's implementation defined. Would it not be better
then to say 'that whole thing is private to Qemu - even the name'.

Rather you could read the list of bitmaps a server has, with a textual
name, each having an index (and perhaps a granularity). You could then
ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that
bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec,
and some (e.g. ones beginning with 'X-') could be reserved for user
usage. So you could use 'X-QEMU-DIRTY'). If you then change what your
dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not
need a protocol change to support it.

IE rather than looking at 'a way of reading the dirty bit', we could
have this as a generic way of reading opaque bitmaps. Only one (allocation)
might be given meaning to start off with, and it wouldn't be necessary
for all servers to support that - i.e. you could support bitmap reading
without having an ALLOCATION bitmap available.

This spec would then be limited to the transmission of the bitmaps
(remove the word 'dirty' entirely, except perhaps as an illustrative
use case), and include only the definition of the allocation bitmap.

Some more nits:

> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
> NBD_FLAG_CAN_MULTI_CONN in master branch.
> 
> And, finally, I've rebased this onto current state of
> extension-structured-reply branch (which itself should be rebased on
> master IMHO).

Each documentation branch should normally be branched off master unless
it depends on another extension (in which case it will be branched from that).
I haven't been rebasing them frequently as it can disrupt those working
on the branches. There's only really an issue around rebasing where you
depend on another branch.


> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>   A: 1: server replies with status descriptors of any size, granularity
>         is hidden from the client
>      2: dirty/allocated requests are separate and unrelated to each
>         other, so their granularities are not intersecting

I'm OK with this, but note that you do actually mention a granularity
of sorts in the spec (512 byes) - I think you should replace that
with the minimum block size.

> 3. Q: selecting of dirty bitmap to export
>   A: several variants:
>      1: id of bitmap is in flags field of request
>          pros: - simple
>          cons: - it's a hack. flags field is for other uses.
>                - we'll have to map bitmap names to these "ids"
>      2: introduce extended nbd requests with variable length and exploit this
>         feature for BLOCK_STATUS command, specifying bitmap identifier.
>         pros: - looks like a true way
>         cons: - we have to create additional extension
>               - possible we have to create a map,
>                 {<QEMU bitmap name> <=> <NBD bitmap id>}
>      3: exteranl tool should select which bitmap to export. So, in case of Qemu
>         it should be something like qmp command block-export-dirty-bitmap.
>         pros: - simple
>               - we can extend it to behave like (2) later
>         cons: - additional qmp command to implement (possibly, the lesser evil)
>         note: Hmm, external tool can make chose between allocated/dirty data too,
>               so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.

Yes, this is all pretty horrible. I suspect we want to do something like (2),
and permit extra data across (in my proposal, it would only be one byte to select
the index). I suppose one could ask for a list of bitmaps.

> 4. Q: Should not get_{allocated,dirty} be separate commands?
>   cons: Two commands with almost same semantic and similar means?
>   pros: However here is a good point of separating clearly defined and native
>         for block devices GET_BLOCK_STATUS from user-driven and actually
>         undefined data, called 'dirtyness'.

I'm suggesting one generic 'read bitmap' command like you.

> 5. Number of status descriptors, sent by server, should be restricted
>   variants:
>   1: just allow server to restrict this as it wants (which was done in v3)
>   2: (not excluding 1). Client specifies somehow the maximum for number
>      of descriptors.
>      2.1: add command flag, which will request only one descriptor
>           (otherwise, no restrictions from the client)
>      2.2: again, introduce extended nbd requests, and add field to
>           specify this maximum

I think some form of extended request is the way to go, but out of
interest, what's the issue with as many descriptors being sent as it
takes to encode the reply? The client can just consume the remainder
(without buffering) and reissue the request at a later point for
the areas it discarded.

> 
> 6. A: What to do with unspecified flags (in request/reply)?
>   I think the normal variant is to make them reserved. (Server should
>   return EINVAL if found unknown bits, client should consider replay
>   with unknown bits as an error)

Yeah.

> 
> +
> +* `NBD_CMD_BLOCK_STATUS`
> +
> +    A block status query request. Length and offset define the range
> +    of interest. Clients SHOULD NOT use this request unless the server

MUST NOT is what we say elsewhere I believe.

> +    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
> +    in turn requires the client to first negotiate structured replies.
> +    For a successful return, the server MUST use a structured reply,
> +    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.

Nit: are you saying that non-structured error replies are permissible?
You're always/often going to get a non-structured  (simple) error reply
if the server doesn't support the command, but I think it would be fair to say the
server MUST use a structured reply to NBD_CMD_SEND_BLOCK_STATUS if
it supports the command. This is effectively what we say re NBD_CMD_READ.

> +
> +    The list of block status descriptors within the
> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
> +    of the file starting from specified *offset*, and the sum of the
> +    *length* fields of each descriptor MUST not be greater than the
> +    overall *length* of the request. This means that the server MAY
> +    return less data than required. However the server MUST return at
> +    least one status descriptor

I'm not sure I understand why that's useful. What should the client
infer from the server refusing to provide information? We don't
permit short reads etc.

> .  The server SHOULD use different
> +    *status* values between consecutive descriptors, and SHOULD use
> +    descriptor lengths that are an integer multiple of 512 bytes where
> +    possible (the first and last descriptor of an unaligned query being
> +    the most obvious places for an exception).

Surely better would be an an integer multiple of the minimum block
size. Being able to offer bitmap support at finer granularity than
the absolute minimum block size helps no one, and if it were possible
to support a 256 byte block size (I think some floppy disks had that)
I see no reason not to support that as a granularity.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-29 10:50       ` Wouter Verhelst
  2016-11-29 12:41         ` Vladimir Sementsov-Ogievskiy
@ 2016-11-29 13:07         ` Alex Bligh
  2016-12-01 10:14         ` Wouter Verhelst
  2 siblings, 0 replies; 37+ messages in thread
From: Alex Bligh @ 2016-11-29 13:07 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: Alex Bligh, Stefan stefanha@redhat. com, nbd-general, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy, qemu-devel, Pavel Borzenkov,
	Paolo Bonzini, Markus Pargmann, den


> On 29 Nov 2016, at 10:50, Wouter Verhelst <w@uter.be> wrote:
> 
> +- `NBD_OPT_ALLOC_CONTEXT` (10)
> +
> +    Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
> +    followed by an `NBD_REP_ACK`. If a server replies to such a request
> +    with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
> +    commands during the transmission phase.

I haven't read this in detail but this seems to me to be similar to
the idea I just posted (sorry - kept getting interrupted whilst writing
the email) re multiple bitmaps?

But the name 'ALLOC_CONTEXT' is a bit weird. Why not call 'metadata
bitmaps' or 'metadata extents' or something. Metadata seems right as
it's data about data.

> +##### Allocation contexts
> +
> +Allocation context 0 is the basic "exists at all" allocation context. If
> +an extent is not allocated at allocation context 0, it MUST NOT be
> +listed as allocated at another allocation context. This supports sparse
> +file semantics on the server side. If a server has only one allocation
> +context (the default), then writing to an extent which is allocated in
> +that allocation context 0 MUST NOT fail with ENOSPC.
> +
> +For all other cases, this specification requires no specific semantics
> +of allocation contexts. Implementations could support allocation
> +contexts with semantics like the following:
> +
> +- Incremental snapshots; if a block is allocated in one allocation
> +  context, that implies that it is also allocated in the next level up.
> +- Various bits of data about the backend of the storage; e.g., if the
> +  storage is written on a RAID array, an allocation context could
> +  return information about the redundancy level of a given extent
> +- If the backend implements a write-through cache of some sort, or
> +  synchronises with other servers, an allocation context could state
> +  that an extent is "allocated" once it has reached permanent storage
> +  and/or is synchronized with other servers.
> +
> +The only requirement of an allocation context is that it MUST be
> +representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
> +
> +Likewise, the syntax of query strings is not specified by this document.
> +
> +Server implementations SHOULD document their syntax for query strings
> +and semantics for resulting allocation contexts in a document like this
> +one.
> +

But this seems strange. Whether something is 'allocated' seems orthogonal
to me to whether it needs backing up. Even the fact it's been zeroed
(now a hole) might need backing up).

So don't we need multiple independent lists of extents? Of course a server
might *implement* them under the hood with separate bitmaps or one big
bitmap, or no bitmap at all (for instance using file extents on POSIX).



-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-29 12:41         ` Vladimir Sementsov-Ogievskiy
@ 2016-11-29 13:08           ` Wouter Verhelst
  0 siblings, 0 replies; 37+ messages in thread
From: Wouter Verhelst @ 2016-11-29 13:08 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Stefan Hajnoczi, nbd-general, kwolf, qemu-devel, pborzenkov,
	pbonzini, mpa, den

Hi Vladimir,

On Tue, Nov 29, 2016 at 03:41:10PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Hi,
> 
> 29.11.2016 13:50, Wouter Verhelst wrote:
> 
>     Hi,
> 
>     On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:
> 
>         However, I'm arguing that if we're going to provide information about
>         snapshots, we should be able to properly refer to these snapshots from
>         within an NBD context. My previous mail suggested adding a negotiation
>         message that would essentially ask the server "tell me about the
>         snapshots you know about", giving them an NBD identifier in the process
>         (accompanied by a "foreign" identifier that is decidedly *not* an NBD
>         identifier and that could be used to match the NBD identifier to
>         something implementation-defined). This would be read-only information;
>         the client cannot ask the server to create new snapshots. We can then
>         later in the protocol refer to these snapshots by way of that NBD
>         identifier.
> 
>     To make this a bit more concrete, I've changed the proposal like so:
> 
[...]
>     +##### Allocation contexts
>     +
>     +Allocation context 0 is the basic "exists at all" allocation context. If
>     +an extent is not allocated at allocation context 0, it MUST NOT be
>     +listed as allocated at another allocation context. This supports sparse
> 
> 
> allocated here is range with unset NBD_STATE_HOLE bit?

Eh, yes. I should clarify that a bit further (no time right now though,
but patches are certainly welcome).

>     +file semantics on the server side. If a server has only one allocation
>     +context (the default), then writing to an extent which is allocated in
>     +that allocation context 0 MUST NOT fail with ENOSPC.
>     +
>     +For all other cases, this specification requires no specific semantics
>     +of allocation contexts. Implementations could support allocation
>     +contexts with semantics like the following:
>     +
>     +- Incremental snapshots; if a block is allocated in one allocation
>     +  context, that implies that it is also allocated in the next level up.
>     +- Various bits of data about the backend of the storage; e.g., if the
>     +  storage is written on a RAID array, an allocation context could
>     +  return information about the redundancy level of a given extent
>     +- If the backend implements a write-through cache of some sort, or
>     +  synchronises with other servers, an allocation context could state
>     +  that an extent is "allocated" once it has reached permanent storage
>     +  and/or is synchronized with other servers.
>     +
>     +The only requirement of an allocation context is that it MUST be
>     +representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
>     +
>     +Likewise, the syntax of query strings is not specified by this document.
>     +
>     +Server implementations SHOULD document their syntax for query strings
>     +and semantics for resulting allocation contexts in a document like this
>     +one.
> 
> 
> IMHO, redundant paragraph for this spec.

The "SHOULD document" one? I want that there at any rate, just to make
clear that it's probably a good thing to have it (but no, it's not part
of the formal protocol spec).

>     +
>      ### Transmission phase
> 
>      #### Flag fields
>     @@ -983,6 +1065,9 @@ valid may depend on negotiation during the handshake phase.
>         content chunk in reply.  MUST NOT be set unless the transmission
>         flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
>         `EOVERFLOW` error chunk, if the request length is too large.
>     +- bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
>     +  set, the client is interested in only one extent per allocation
>     +  context.
> 
>      ##### Structured reply flags
> 
>     @@ -1371,38 +1456,48 @@ adds a new `NBD_CMD_BLOCK_STATUS` command which returns a list of
>      ranges with their respective states.  This extension is not available
>      unless the client also negotiates the `STRUCTURED_REPLY` extension.
> 
>     -* `NBD_FLAG_SEND_BLOCK_STATUS`
>     -
>     -    The server SHOULD set this transmission flag to 1 if structured
>     -    replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS`
>     -    request is supported.
>     -
>      * `NBD_REPLY_TYPE_BLOCK_STATUS`
> 
>     -    *length* MUST be a positive integer multiple of 8.  This reply
>     +    *length* MUST be 4 + (a positive integer multiple of 8).  This reply
>          represents a series of consecutive block descriptors where the sum
>          of the lengths of the descriptors MUST not be greater than the
>     -    length of the original request.  This chunk type MUST appear at most
>     -    once in a structured reply. Valid as a reply to
>     +    length of the original request. This chunk type MUST appear at most
>     +    once per allocation ID in a structured reply. Valid as a reply to
>          `NBD_CMD_BLOCK_STATUS`.
> 
>     -    The payload is structured as a list of one or more descriptors,
>     -    each with this layout:
>     +    Servers MUST return an `NBD_REPLY_TYPE_BLOCK_STATUS` chunk for every
>     +    allocation context ID, except if the semantics of particular
>     +    allocation contexts mean that the information for one allocation
>     +    context is implied by the information for another.
> 
> 
> So, actually, instead of selecting with a nbd_cmd or with external tool which
> bitmap we want to access, we just reply with all bitmaps (negotiated at the
> beginning). Personally I dislike this. With such approach, we will have to
> export allocation bitmap always, even if we need only dirtyness. Consider, that
> requesting allocation bitmap may be much more expensive in time that requesting
> dirtyness.

Ah, yes, didn't consider that. I suppose more updates will be required,
then.

What would you suggest instead, then?

> Or allocation bitmap information is implied by dirtyness?
> 
> Furthermore, as allocation context semantics defined externally, 'semantics
> mean that information is implied' states nothing, and we actually return to way
> #3, where external tool decides which bitmap to export.
> 
> 
>     +
>     +    The payload starts with:
>     +
>     +        * 32 bits, allocation context ID
>     +
>     +    and is followed by a list of one or more descriptors, each with this
>     +    layout:
> 
>              * 32 bits, length (unsigned, MUST NOT be zero)
>              * 32 bits, status flags
> 
>     -    The definition of the status flags is determined based on the
>     -    flags present in the original request.
>     +    If the client used the `NBD_CMD_FLAG_REQ_ONE` flag in the request,
>     +    then every reply chunk MUST NOT contain more than one descriptor.
>     +
>     +    Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
>     +    its request, the server MAY return less descriptors in the reply
>     +    than would be required to fully specify the whole range of requested
>     +    information to the client, if the number of descriptors would be
>     +    over 16 otherwise and looking up the information would be too
>     +    resource-intensive for the server.
> 
> 
> So, if there are <=16 extents, they all MUST be present in reply.. (just note)

That's the proposal, yes. I think it makes sense to have a minimum that
MUST be present (unless the client asked for REQ_ONE), although the
exact count can be different from 16, if needs be.

[...]
>     Thoughts?
> 
> For me considering dirtyness like one of allocation contexts sounds a bit
> weird, as dirtyness is not allocation.. But it is not so important.

I would certainly be willing to change the name, if that helps. The idea
is that you have various types of information that you can query the
server about. I called these "allocation contexts", but I'm certainly
not of the opinion that it *has* to be called that. Allocation is one of
the information types, but there can be more; my proposed spec
explicitly does not go into detail about the others.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-29 12:57 ` [Qemu-devel] " Alex Bligh
@ 2016-11-29 14:36   ` Vladimir Sementsov-Ogievskiy
  2016-11-29 14:52     ` Alex Bligh
  2016-12-01 23:42   ` [Qemu-devel] " John Snow
  1 sibling, 1 reply; 37+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-11-29 14:36 UTC (permalink / raw)
  To: Alex Bligh
  Cc: nbd-general, qemu-devel, Kevin Wolf, Paolo Bonzini,
	Pavel Borzenkov, Stefan stefanha@redhat. com, Denis V. Lunev,
	Wouter Verhelst, Eric Blake, mpa

29.11.2016 15:57, Alex Bligh wrote:
> Vladimir,
>
> I went back to April to reread the previous train of conversation
> then found you had helpfully summarised some if it. Comments
> below.
>
> Rather than comment on many of the points individual, the root
> of my confusion and to some extent uncomfortableness about this
> proposal is 'who owns the meaning the the bitmaps'.
>
> Some of this is my own confusion (sorry) about the use to which
> this is being put, which is I think at root a documentation issue.
> To illustrate this, you write in the FAQ section that this is for
> read only disks, but the text talks about:
>
> +Some storage formats and operations over such formats express a
> +concept of data dirtiness. Whether the operation is block device
> +mirroring, incremental block device backup or any other operation with
> +a concept of data dirtiness, they all share a need to provide a list
> +of ranges that this particular operation treats as dirty.
>
> How can data be 'dirty' if it is static and unchangeable? (I thought)
>
> I now think what you are talking about backing up a *snapshot* of a disk
> that's running, where the disk itself was not connected using NBD? IE it's
> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
> an opaque state represented in a bitmap, which is binary metadata
> at some particular level of granularity. It might as well be 'happiness'
> or 'is coloured blue'. The NBD server would (normally) have no way of
> manipulating this bitmap.

Yes, something like this.

Note: in Qemu it may not be a snapshot (actually, I didn't see a way in 
Qemu to export snapshots not switching to them (except opening external 
snapshot as a separate block dev)), but just any read-only drive, or 
temporary drive, used in the image fleecing scheme:

driveA is online normal drive
driveB is empty nbd export

- start backup driveA->driveB with sync=none, which means that the only 
copy which is done is copying old data from A to B before every write to A
- and set driveA as backing for driveB (on read from B, if data is not 
allocated it will be read from A)

after that, driveB is something like a snapshot for backup through NBD.

It's all just to say: calling backup-point-in-time-state just a 
'snapshot' confuses me and I hope not to see this word in the spec (in 
this context).

>
> In previous comments, I said 'how come we can set the dirty bit through
> writes but can't clear it?'. This (my statement) is now I think wrong,
> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
> state of the bitmap comes from whatever sets the bitmap which is outside
> the scope of this protocol to transmit it.
>
> However, we have the uncomfortable (to me) situation where the protocol
> describes a flag 'dirty', with implications as to what it does, but
> no actual strict definition of how it's set. So any 'other' user has
> no real idea how to use the information, or how to implement a server
> that provides a 'dirty' bit, because the semantics of that aren't within
> the protocol. This does not sit happily with me.
>
> So I'm wondering whether we should simplify and generalise this spec. You
> say that for the dirty flag, there's no specification of when it is
> set and cleared - that's implementation defined. Would it not be better
> then to say 'that whole thing is private to Qemu - even the name'.
>
> Rather you could read the list of bitmaps a server has, with a textual
> name, each having an index (and perhaps a granularity). You could then
> ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that
> bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec,
> and some (e.g. ones beginning with 'X-') could be reserved for user
> usage. So you could use 'X-QEMU-DIRTY'). If you then change what your
> dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not
> need a protocol change to support it.
>
> IE rather than looking at 'a way of reading the dirty bit', we could
> have this as a generic way of reading opaque bitmaps. Only one (allocation)
> might be given meaning to start off with, and it wouldn't be necessary
> for all servers to support that - i.e. you could support bitmap reading
> without having an ALLOCATION bitmap available.
>
> This spec would then be limited to the transmission of the bitmaps
> (remove the word 'dirty' entirely, except perhaps as an illustrative
> use case), and include only the definition of the allocation bitmap.

Good point. For Qcow2 we finally come to just bitmaps, not "dirty 
bitmaps", to make it more general. There is a problem with allocation if 
we want to make it a subcase of bitmap: allocation natively have two 
bits per block: zero and allocated. We can of course separate this into 
two bitmaps, but this will not be similar with classic get_block_status.

> Some more nits:
>
>> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>> NBD_FLAG_CAN_MULTI_CONN in master branch.
>>
>> And, finally, I've rebased this onto current state of
>> extension-structured-reply branch (which itself should be rebased on
>> master IMHO).
> Each documentation branch should normally be branched off master unless
> it depends on another extension (in which case it will be branched from that).
> I haven't been rebasing them frequently as it can disrupt those working
> on the branches. There's only really an issue around rebasing where you
> depend on another branch.
>
>
>> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>>    A: 1: server replies with status descriptors of any size, granularity
>>          is hidden from the client
>>       2: dirty/allocated requests are separate and unrelated to each
>>          other, so their granularities are not intersecting
> I'm OK with this, but note that you do actually mention a granularity
> of sorts in the spec (512 byes) - I think you should replace that
> with the minimum block size.
>
>> 3. Q: selecting of dirty bitmap to export
>>    A: several variants:
>>       1: id of bitmap is in flags field of request
>>           pros: - simple
>>           cons: - it's a hack. flags field is for other uses.
>>                 - we'll have to map bitmap names to these "ids"
>>       2: introduce extended nbd requests with variable length and exploit this
>>          feature for BLOCK_STATUS command, specifying bitmap identifier.
>>          pros: - looks like a true way
>>          cons: - we have to create additional extension
>>                - possible we have to create a map,
>>                  {<QEMU bitmap name> <=> <NBD bitmap id>}
>>       3: exteranl tool should select which bitmap to export. So, in case of Qemu
>>          it should be something like qmp command block-export-dirty-bitmap.
>>          pros: - simple
>>                - we can extend it to behave like (2) later
>>          cons: - additional qmp command to implement (possibly, the lesser evil)
>>          note: Hmm, external tool can make chose between allocated/dirty data too,
>>                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> Yes, this is all pretty horrible. I suspect we want to do something like (2),
> and permit extra data across (in my proposal, it would only be one byte to select
> the index). I suppose one could ask for a list of bitmaps.
>
>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>    cons: Two commands with almost same semantic and similar means?
>>    pros: However here is a good point of separating clearly defined and native
>>          for block devices GET_BLOCK_STATUS from user-driven and actually
>>          undefined data, called 'dirtyness'.
> I'm suggesting one generic 'read bitmap' command like you.

To support get_block_status in this general read_bitmap, we will need to 
define something like 'multibitmap', which allows several bits per 
chunk, as allocation data has two: zero and allocated.

>
>> 5. Number of status descriptors, sent by server, should be restricted
>>    variants:
>>    1: just allow server to restrict this as it wants (which was done in v3)
>>    2: (not excluding 1). Client specifies somehow the maximum for number
>>       of descriptors.
>>       2.1: add command flag, which will request only one descriptor
>>            (otherwise, no restrictions from the client)
>>       2.2: again, introduce extended nbd requests, and add field to
>>            specify this maximum
> I think some form of extended request is the way to go, but out of
> interest, what's the issue with as many descriptors being sent as it
> takes to encode the reply? The client can just consume the remainder
> (without buffering) and reissue the request at a later point for
> the areas it discarded.

the issue is: too many descriptors possible. So, (1) solves it. (2) is 
optional, just to simplify/optimize client side.

>
>> 6. A: What to do with unspecified flags (in request/reply)?
>>    I think the normal variant is to make them reserved. (Server should
>>    return EINVAL if found unknown bits, client should consider replay
>>    with unknown bits as an error)
> Yeah.
>
>> +
>> +* `NBD_CMD_BLOCK_STATUS`
>> +
>> +    A block status query request. Length and offset define the range
>> +    of interest. Clients SHOULD NOT use this request unless the server
> MUST NOT is what we say elsewhere I believe.
>
>> +    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
>> +    in turn requires the client to first negotiate structured replies.
>> +    For a successful return, the server MUST use a structured reply,
>> +    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> Nit: are you saying that non-structured error replies are permissible?
> You're always/often going to get a non-structured  (simple) error reply
> if the server doesn't support the command, but I think it would be fair to say the
> server MUST use a structured reply to NBD_CMD_SEND_BLOCK_STATUS if
> it supports the command. This is effectively what we say re NBD_CMD_READ.

I agree.

>
>> +
>> +    The list of block status descriptors within the
>> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>> +    of the file starting from specified *offset*, and the sum of the
>> +    *length* fields of each descriptor MUST not be greater than the
>> +    overall *length* of the request. This means that the server MAY
>> +    return less data than required. However the server MUST return at
>> +    least one status descriptor
> I'm not sure I understand why that's useful. What should the client
> infer from the server refusing to provide information? We don't
> permit short reads etc.

if the bitmap is 010101010101 we will have too many descriptors. For 
example, 16tb disk, 64k granularity -> 2G of descriptors payload.

>
>> .  The server SHOULD use different
>> +    *status* values between consecutive descriptors, and SHOULD use
>> +    descriptor lengths that are an integer multiple of 512 bytes where
>> +    possible (the first and last descriptor of an unaligned query being
>> +    the most obvious places for an exception).
> Surely better would be an an integer multiple of the minimum block
> size. Being able to offer bitmap support at finer granularity than
> the absolute minimum block size helps no one, and if it were possible
> to support a 256 byte block size (I think some floppy disks had that)
> I see no reason not to support that as a granularity.
>

Agree. Anyway it is just a strong recommendation, not requirement.


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-29 14:36   ` Vladimir Sementsov-Ogievskiy
@ 2016-11-29 14:52     ` Alex Bligh
  2016-11-29 15:07       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 37+ messages in thread
From: Alex Bligh @ 2016-11-29 14:52 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Alex Bligh, nbd-general, qemu-devel, Kevin Wolf, Paolo Bonzini,
	Pavel Borzenkov, Stefan stefanha@redhat. com, Denis V. Lunev,
	Wouter Verhelst, Eric Blake, mpa

Vladimir,

>>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>>   cons: Two commands with almost same semantic and similar means?
>>>   pros: However here is a good point of separating clearly defined and native
>>>         for block devices GET_BLOCK_STATUS from user-driven and actually
>>>         undefined data, called 'dirtyness'.
>> I'm suggesting one generic 'read bitmap' command like you.
> 
> To support get_block_status in this general read_bitmap, we will need to define something like 'multibitmap', which allows several bits per chunk, as allocation data has two: zero and allocated.

I think you are saying that for arbitrary 'bitmap' there might be more than one state. For instance, one might (in an allocation 'bitmap') have a hole, a non-hole-zero, or a non-hole-non-zero.

In the spec I'd suggest, for one 'bitmap', we represent the output as extents. Each extent has a status. For the bitmap to be useful, at least two status need to be possible, but the above would have three. This could be internally implemented by the server as (a) a bitmap (with two bits per entry), (b) two bitmaps (possibly with different granularity), (c) something else (e.g. reading file extents, then if the data is allocated manually comparing it against zero).

I should have put 'bitmap' in quotes in what I wrote because returning extents (as you suggested) is a good idea, and there need not be an actual bitmap.

>>> 5. Number of status descriptors, sent by server, should be restricted
>>>   variants:
>>>   1: just allow server to restrict this as it wants (which was done in v3)
>>>   2: (not excluding 1). Client specifies somehow the maximum for number
>>>      of descriptors.
>>>      2.1: add command flag, which will request only one descriptor
>>>           (otherwise, no restrictions from the client)
>>>      2.2: again, introduce extended nbd requests, and add field to
>>>           specify this maximum
>> I think some form of extended request is the way to go, but out of
>> interest, what's the issue with as many descriptors being sent as it
>> takes to encode the reply? The client can just consume the remainder
>> (without buffering) and reissue the request at a later point for
>> the areas it discarded.
> 
> the issue is: too many descriptors possible. So, (1) solves it. (2) is optional, just to simplify/optimize client side.

I think I'd prefer the server to return what it was asked for, and the client to deal with it. So either the client should be able to specify a maximum number of extents (and if we are extending the command structure, that's possible) or we deal with the client consuming and retrying unwanted extents. The reason for this is that it's unlikely the server can know a priori the number of extents which is the appropriate maximum for the client.

>>> +    The list of block status descriptors within the
>>> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>>> +    of the file starting from specified *offset*, and the sum of the
>>> +    *length* fields of each descriptor MUST not be greater than the
>>> +    overall *length* of the request. This means that the server MAY
>>> +    return less data than required. However the server MUST return at
>>> +    least one status descriptor
>> I'm not sure I understand why that's useful. What should the client
>> infer from the server refusing to provide information? We don't
>> permit short reads etc.
> 
> if the bitmap is 010101010101 we will have too many descriptors. For example, 16tb disk, 64k granularity -> 2G of descriptors payload.

Yep. And the cost of consuming and retrying is quite high. One option would be for the client to realise this is a possibility, and not request the entire extent map for a 16TB disk, as it might be very large! Even if the client worked at e.g. a 64MB level (where they'd get a maximum of 1024 extents per reply), this isn't going to noticeably increase the round trip timing. One issue here is that to determine a 'reasonable' size, the client needs to know the minimum length of any extent.

I think the answer is probably a 'maximum number of extents' in the request packet.

Of course with statuses in extent, the final extent could be represented as 'I don't know, break this bit into a separate request' status.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-29 14:52     ` Alex Bligh
@ 2016-11-29 15:07       ` Vladimir Sementsov-Ogievskiy
  2016-11-29 15:17         ` [Qemu-devel] [Nbd] " Wouter Verhelst
  0 siblings, 1 reply; 37+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-11-29 15:07 UTC (permalink / raw)
  To: Alex Bligh
  Cc: nbd-general, qemu-devel, Kevin Wolf, Paolo Bonzini,
	Pavel Borzenkov, Stefan stefanha@redhat. com, Denis V. Lunev,
	Wouter Verhelst, Eric Blake, mpa

29.11.2016 17:52, Alex Bligh wrote:
> Vladimir,
>
>>>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>>>    cons: Two commands with almost same semantic and similar means?
>>>>    pros: However here is a good point of separating clearly defined and native
>>>>          for block devices GET_BLOCK_STATUS from user-driven and actually
>>>>          undefined data, called 'dirtyness'.
>>> I'm suggesting one generic 'read bitmap' command like you.
>> To support get_block_status in this general read_bitmap, we will need to define something like 'multibitmap', which allows several bits per chunk, as allocation data has two: zero and allocated.
> I think you are saying that for arbitrary 'bitmap' there might be more than one state. For instance, one might (in an allocation 'bitmap') have a hole, a non-hole-zero, or a non-hole-non-zero.
>
> In the spec I'd suggest, for one 'bitmap', we represent the output as extents. Each extent has a status. For the bitmap to be useful, at least two status need to be possible, but the above would have three. This could be internally implemented by the server as (a) a bitmap (with two bits per entry), (b) two bitmaps (possibly with different granularity), (c) something else (e.g. reading file extents, then if the data is allocated manually comparing it against zero).
>
> I should have put 'bitmap' in quotes in what I wrote because returning extents (as you suggested) is a good idea, and there need not be an actual bitmap.
>
>>>> 5. Number of status descriptors, sent by server, should be restricted
>>>>    variants:
>>>>    1: just allow server to restrict this as it wants (which was done in v3)
>>>>    2: (not excluding 1). Client specifies somehow the maximum for number
>>>>       of descriptors.
>>>>       2.1: add command flag, which will request only one descriptor
>>>>            (otherwise, no restrictions from the client)
>>>>       2.2: again, introduce extended nbd requests, and add field to
>>>>            specify this maximum
>>> I think some form of extended request is the way to go, but out of
>>> interest, what's the issue with as many descriptors being sent as it
>>> takes to encode the reply? The client can just consume the remainder
>>> (without buffering) and reissue the request at a later point for
>>> the areas it discarded.
>> the issue is: too many descriptors possible. So, (1) solves it. (2) is optional, just to simplify/optimize client side.
> I think I'd prefer the server to return what it was asked for, and the client to deal with it. So either the client should be able to specify a maximum number of extents (and if we are extending the command structure, that's possible) or we deal with the client consuming and retrying unwanted extents. The reason for this is that it's unlikely the server can know a priori the number of extents which is the appropriate maximum for the client.
>
>>>> +    The list of block status descriptors within the
>>>> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>>>> +    of the file starting from specified *offset*, and the sum of the
>>>> +    *length* fields of each descriptor MUST not be greater than the
>>>> +    overall *length* of the request. This means that the server MAY
>>>> +    return less data than required. However the server MUST return at
>>>> +    least one status descriptor
>>> I'm not sure I understand why that's useful. What should the client
>>> infer from the server refusing to provide information? We don't
>>> permit short reads etc.
>> if the bitmap is 010101010101 we will have too many descriptors. For example, 16tb disk, 64k granularity -> 2G of descriptors payload.
> Yep. And the cost of consuming and retrying is quite high. One option would be for the client to realise this is a possibility, and not request the entire extent map for a 16TB disk, as it might be very large! Even if the client worked at e.g. a 64MB level (where they'd get a maximum of 1024 extents per reply), this isn't going to noticeably increase the round trip timing. One issue here is that to determine a 'reasonable' size, the client needs to know the minimum length of any extent.

and with this approach we will in turn have overhead of too many 
requests for 00000000 or 11111111 bitmaps.

>
> I think the answer is probably a 'maximum number of extents' in the request packet.
>
> Of course with statuses in extent, the final extent could be represented as 'I don't know, break this bit into a separate request' status.
>

With such predefined status, we can postpone creating extended requests, 
have number of extents restricted by server and have sum of extents 
lengths be equal to request length.


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-29 15:07       ` Vladimir Sementsov-Ogievskiy
@ 2016-11-29 15:17         ` Wouter Verhelst
  0 siblings, 0 replies; 37+ messages in thread
From: Wouter Verhelst @ 2016-11-29 15:17 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Alex Bligh, nbd-general, Kevin Wolf, qemu-devel, mpa,
	Pavel Borzenkov, Stefan stefanha@redhat. com, Denis V. Lunev,
	Paolo Bonzini

On Tue, Nov 29, 2016 at 06:07:56PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 29.11.2016 17:52, Alex Bligh wrote:
> > Vladimir,
> >> if the bitmap is 010101010101 we will have too many descriptors.
> >> For example, 16tb disk, 64k granularity -> 2G of descriptors
> >> payload.
> > Yep. And the cost of consuming and retrying is quite high. One
> > option would be for the client to realise this is a possibility, and
> > not request the entire extent map for a 16TB disk, as it might be
> > very large! Even if the client worked at e.g. a 64MB level (where
> > they'd get a maximum of 1024 extents per reply), this isn't going to
> > noticeably increase the round trip timing. One issue here is that to
> > determine a 'reasonable' size, the client needs to know the minimum
> > length of any extent.
> 
> and with this approach we will in turn have overhead of too many 
> requests for 00000000 or 11111111 bitmaps.

This is why my proposal suggests the server may abort the sent extents
after X extents (sixteen, but that number can certainly change) have
been sent. After all, the server will have a better view on what's going
to be costly in terms of "number of extents".

> > I think the answer is probably a 'maximum number of extents' in the request packet.
> >
> > Of course with statuses in extent, the final extent could be
> > represented as 'I don't know, break this bit into a separate
> > request' status.
> >
> 
> With such predefined status, we can postpone creating extended requests, 
> have number of extents restricted by server and have sum of extents 
> lengths be equal to request length.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-27 19:17 ` [Qemu-devel] [Nbd] " Wouter Verhelst
                     ` (2 preceding siblings ...)
  2016-11-29 10:18   ` Kevin Wolf
@ 2016-11-30 10:41   ` Sergey Talantov
  3 siblings, 0 replies; 37+ messages in thread
From: Sergey Talantov @ 2016-11-30 10:41 UTC (permalink / raw)
  To: Wouter Verhelst, Vladimir Sementsov-Ogievskiy
  Cc: nbd-general, kwolf, qemu-devel, pborzenkov, stefanha, pbonzini, mpa, den

Hi, Wouter!

> Actually, come to think of that. What is the exact use case for this thing? I understand you're trying to create incremental backups of things, which would imply you don't write from the client that is getting the ?
> block status thingies, right?

Overall, the most desired use case for this NBD extension is to allow 3-rd party software to make incremental backups.
Acronis (vendor of backup solutions) would support qemu backup if block status is provided.


-----Original Message-----
From: Wouter Verhelst [mailto:w@uter.be] 
Sent: Sunday, November 27, 2016 22:17
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Cc: nbd-general@lists.sourceforge.net; kwolf@redhat.com; qemu-devel@nongnu.org; pborzenkov@virtuozzo.com; stefanha@redhat.com; pbonzini@redhat.com; mpa@pengutronix.de; den@openvz.org
Subject: Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

Hi Vladimir,

Quickly: the reason I haven't merged this yes is twofold:
- I wasn't thrilled with the proposal at the time. It felt a bit
  hackish, and bolted onto NBD so you could use it, but without defining
  everything in the NBD protocol. "We're reading some data, but it's not
  about you". That didn't feel right
- There were a number of questions still unanswered (you're answering a
  few below, so that's good).

For clarity, I have no objection whatsoever to adding more commands if they're useful, but I would prefer that they're also useful with NBD on its own, i.e., without requiring an initiation or correlation of some state through another protocol or network connection or whatever. If that's needed, that feels like I didn't do my job properly, if you get my point.

On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the availability of sparse storage formats, it is often needed to 
> query status of a particular range and read only those blocks of data 
> that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS extension 
> with one new NBD_CMD_BLOCK_STATUS command, a new structured reply 
> chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required during, 
> for example, incremental block device backup. To express this concept 
> via NBD protocol, this patch also adds a flag to NBD_CMD_BLOCK_STATUS 
> to request dirtiness information rather than provisioning information; 
> however, with the current proposal, data dirtiness is only useful with 
> additional coordination outside of the NBD protocol (such as a way to 
> start and stop the server from tracking dirty sectors).  Future NBD 
> extensions may add commands to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET 
> LBA STATUS" command more closely, it has been chosen to return a list 
> of extents in the response of NBD_CMD_BLOCK_STATUS command, instead of 
> a bitmap.
> 
> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> CC: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 
> v3:
> 
> Hi all. This is almost a resend of v2 (by Eric Blake), The only change 
> is removing the restriction, that sum of status descriptor lengths 
> must be equal to requested length. I.e., let's permit the server to 
> replay with less data than required if it wants.

Reasonable, yes. The length that the client requests should be a maximum (i.e.
"I'm interested in this range"), not an exact request.

> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now  
> NBD_FLAG_CAN_MULTI_CONN in master branch.

Right.

> And, finally, I've rebased this onto current state of 
> extension-structured-reply branch (which itself should be rebased on 
> master IMHO).

Probably a good idea, given the above.

> By this resend I just want to continue the diqussion, started about 
> half a year ago. Here is a summary of some questions and ideas from v2
> diqussion:
> 
> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
>    A: This all is for read-only disks, so the data is static and unchangeable.

I think we should declare that it's up to the client to ensure no other writes happen without its knowledge. This may be because the client and server communicate out of band about state changes, or because the client somehow knows that it's the only writer, or whatever.

We can easily do that by declaring that the result of that command only talks about *current* state, and that concurrent writes by different clients may invalidate the state. This is true for NBD in general (i.e., concurrent read or write commands from other clients may confuse file systems on top of NBD), so it doesn't change expectations in any way.

> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>    A: 1: server replies with status descriptors of any size, granularity
>          is hidden from the client
>       2: dirty/allocated requests are separate and unrelated to each
>          other, so their granularities are not intersecting

Not entirely sure anymore what this is about?

> 3. Q: selecting of dirty bitmap to export
>    A: several variants:
>       1: id of bitmap is in flags field of request
>           pros: - simple
>           cons: - it's a hack. flags field is for other uses.
>                 - we'll have to map bitmap names to these "ids"
>       2: introduce extended nbd requests with variable length and exploit this
>          feature for BLOCK_STATUS command, specifying bitmap identifier.
>          pros: - looks like a true way
>          cons: - we have to create additional extension
>                - possible we have to create a map,
>                  {<QEMU bitmap name> <=> <NBD bitmap id>}
>       3: exteranl tool should select which bitmap to export. So, in case of Qemu
>          it should be something like qmp command block-export-dirty-bitmap.
>          pros: - simple
>                - we can extend it to behave like (2) later
>          cons: - additional qmp command to implement (possibly, the lesser evil)
>          note: Hmm, external tool can make chose between allocated/dirty data too,
>                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.

Downside of 3, though, is that it moves the definition of what the different states mean outside of the NBD protocol (i.e., the protocol messages are not entirely defined anymore, and their meaning depends on the clients and servers in use).

To avoid this, we should have a clear definition of what the reply means *by default*, but then we can add a note that clients and servers can possibly define other meanings out of band if they want to.

> 4. Q: Should not get_{allocated,dirty} be separate commands?
>    cons: Two commands with almost same semantic and similar means?
>    pros: However here is a good point of separating clearly defined and native
>          for block devices GET_BLOCK_STATUS from user-driven and actually
>          undefined data, called 'dirtyness'.

Yeah, having them separate commands might be a bad idea indeed.

> 5. Number of status descriptors, sent by server, should be restricted
>    variants:
>    1: just allow server to restrict this as it wants (which was done in v3)
>    2: (not excluding 1). Client specifies somehow the maximum for number
>       of descriptors.
>       2.1: add command flag, which will request only one descriptor
>            (otherwise, no restrictions from the client)
>       2.2: again, introduce extended nbd requests, and add field to
>            specify this maximum

I think having a flag which requests just one descriptor can be useful, but I'm hesitant to add it unless it's actually going to be used; so in other words, I'll leave the decision on that bit to you.

> 6. A: What to do with unspecified flags (in request/reply)?
>    I think the normal variant is to make them reserved. (Server should
>    return EINVAL if found unknown bits, client should consider replay
>    with unknown bits as an error)

Right, probably best to do that, yes.

> ======
> 
> Also, an idea on 2-4:
> 
>     As we say, that dirtiness is unknown for NBD, and external tool
>     should specify, manage and understand, which data is actually
>     transmitted, why not just call it user_data and leave status field
>     of reply chunk unspecified in this case?
> 
>     So, I propose one flag for NBD_CMD_BLOCK_STATUS:
>     NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
>     Eric's 'Block provisioning status' paragraph.  If it is set, we just
>     leave status field for some external... protocol? Who knows, what is
>     this user data.

Yes, this sounds like a reasonable approach.

>     Note: I'm not sure, that I like this (my) proposal. It's just an
>     idea, may be someone like it.  And, I think, it represents what we
>     are trying to do more honestly.

Indeed.

>     Note2: the next step of generalization will be NBD_CMD_USER, with
>     variable request size, structured reply and no definition :)

Well, er, no please, if we can avoid it :-)

> Another idea, about backups themselves:
> 
>     Why do we need allocated/zero status for backup? IMHO we don't.

Well, I've been thinking so all along, but then I don't really know what it is, in detail, that you want to do :-)

I can understand a "has this changed since time X" request, which the "dirty" thing seems to want to be. Whether something is allocated is just a special case of that.

Actually, come to think of that. What is the exact use case for this thing? I understand you're trying to create incremental backups of things, which would imply you don't write from the client that is getting the block status thingies, right? If so, how about:

- NBD_OPT_GET_SNAPSHOTS (during negotiation): returns a list of
  snapshots. Not required, optional, includes a machine-readable form,
  not defined by NBD, which explains what the snapshot is about (e.g., a
  qemu json file). The "base" version of that is just "allocation
  status", and is implied (i.e., you don't need to run
  NBD_OPT_GET_SNAPSHOTS if you're not interested in anything but the
  allocation status).
- NBD_CMD_BLOCK_STATUS (during transmission), returns block descriptors
  which tell you what the status of a block of data is for each of the
  relevant snapshots that we know about.

Perhaps this is somewhat overengineered, but it does bring most of the definition of what a snapshot is back into the NBD protocol, without having to say "this could be anything", and without requiring connectivity over two ports for this to be useful (e.g., you could store the machine-readable form of the snapshot description into your backup program and match what they mean with what you're interested in at restore time, etc).

This wouldn't work if you're interested in new snapshots that get created once we've already moved into transmission, but hey.

Thoughts?

>     Full backup: just do structured read - it will show us, which chunks
>     may be treaded as zeroes.

Right.

>     Incremental backup: get dirty bitmap (somehow, for example through
>     user-defined part of proposed command), than, for dirty blocks, read
>     them through structured read, so information about zero/unallocated
>     areas are here.
> 
> For me all the variants above are OK. Let's finally choose something.
> 
> v2:
> v1 was: 
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html
> 
> Since then, we've added the STRUCTURED_REPLY extension, which 
> necessitates a rather larger rebase; I've also changed things to 
> rename the command 'NBD_CMD_BLOCK_STATUS', changed the request modes 
> to be determined by boolean flags (rather than by fixed values of the 
> 16-bit flags field), changed the reply status fields to be bitwise-or 
> values (with a default of 0 always being sane), and changed the 
> descriptor layout to drop an offset but to include a 32-bit status so 
> that the descriptor is nicely 8-byte aligned without padding.
> 
>  doc/proto.md | 155 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 154 insertions(+), 1 deletion(-)

[...]

I'll commit this in a minute into a separate branch called "extension-blockstatus", under the understanding that changes are still required, as per above (i.e., don't assume that just because there's a branch I'm happy with the current result ;-)

Regards

--
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

------------------------------------------------------------------------------
_______________________________________________
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-29 10:50       ` Wouter Verhelst
  2016-11-29 12:41         ` Vladimir Sementsov-Ogievskiy
  2016-11-29 13:07         ` Alex Bligh
@ 2016-12-01 10:14         ` Wouter Verhelst
  2016-12-01 11:26           ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 37+ messages in thread
From: Wouter Verhelst @ 2016-12-01 10:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: nbd-general, kwolf, Vladimir Sementsov-Ogievskiy, qemu-devel,
	pborzenkov, pbonzini, mpa, den

Here's another update.

Changes since previous version:
- Rename "allocation context" to "metadata context"
- Stop making metadata context 0 be special; instead, name it
  "BASE:allocation" and allow it to be selected like all other contexts.
- Clarify in a bit more detail when a server MAY omit metadata
  information on a particular metadata context (i.e., only if another
  metadata context that we actually got information for implies it can't
  have meaning). This was always meant that way, but the spec could have
  been a bit more explicit about it.
- Change one SHOULD to a MUST, where it should not have been a SHOULD
  in the first place.

(This applies on top of my previous patch)

Applied and visible at the usual place.

diff --git a/doc/proto.md b/doc/proto.md
index fe7ae53..9c0981f 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -869,16 +869,16 @@ of the newstyle negotiation.
 
     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
-- `NBD_OPT_ALLOC_CONTEXT` (10)
+- `NBD_OPT_META_CONTEXT` (10)
 
-    Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
+    Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
     followed by an `NBD_REP_ACK`. If a server replies to such a request
     with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
     commands during the transmission phase.
 
     If the query string is syntactically invalid, the server SHOULD send
     `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
-    but finds no allocation contexts, the server MUST send a single
+    but finds no metadata contexts, the server MUST send a single
     reply of type `NBD_REP_ACK`.
 
     This option MUST NOT be requested unless structured replies have
@@ -887,9 +887,9 @@ of the newstyle negotiation.
 
     Data:
     - 32 bits, type
-    - String, query to select a subset of the available allocation
+    - String, query to select a subset of the available metadata
       contexts. If this is not specified (i.e., length is 4 and no
-      command is sent), then the server MUST send all the allocation
+      command is sent), then the server MUST send all the metadata
       contexts it knows about. If specified, this query string MUST
       start with a name that uniquely identifies a server
       implementation; e.g., the reference implementation that
@@ -897,18 +897,22 @@ of the newstyle negotiation.
       with 'nbd-server:'
 
     The type may be one of:
-    - `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts
+    - `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
       selected by the query string is returned to the client without
-      changing any state (i.e., this does not add allocation contexts
+      changing any state (i.e., this does not add metadata contexts
       for further usage).
-    - `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
+    - `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
       selected by the query string is added to the list of existing
-      allocation contexts.
-    - `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts
+      metadata contexts.
+    - `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts
       selected by the query string is removed from the list of used
-      allocation contexts. Servers SHOULD NOT reuse existing allocation
+      metadata contexts. Servers SHOULD NOT reuse existing metadata
       context IDs.
 
+    The syntax of the query string is not specified, except that
+    implementations MUST support adding and removing individual metadata
+    contexts by simply listing their names.
+
 #### Option reply types
 
 These values are used in the "reply type" field, sent by the server
@@ -920,7 +924,7 @@ during option haggling in the fixed newstyle negotiation.
     information is available, or when sending data related to the option
     (in the case of `NBD_OPT_LIST`) has finished. No data.
 
-* `NBD_REP_SERVER` (2)
+- `NBD_REP_SERVER` (2)
 
     A description of an export. Data:
 
@@ -935,21 +939,20 @@ during option haggling in the fixed newstyle negotiation.
       particular client request, this field is defined to be a string
       suitable for direct display to a human being.
 
-* `NBD_REP_INFO` (3)
+- `NBD_REP_INFO` (3)
 
     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
-- `NBD_REP_ALLOC_CONTEXT` (4)
+- `NBD_REP_META_CONTEXT` (4)
 
-    A description of an allocation context. Data:
+    A description of a metadata context. Data:
 
-    - 32 bits, NBD allocation context ID. If the request was NOT of type
-      `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero.
-    - String, name of the allocation context. This is not required to be
+    - 32 bits, NBD metadata context ID.
+    - String, name of the metadata context. This is not required to be
       a human-readable string, but it MUST be valid UTF-8 data.
 
-    Allocation context ID 0 is implied, and always exists. It cannot be
-    removed.
+    This specification declares one metadata context. It is called
+    "BASE:allocation" and contains the basic "exists at all" context.
 
 There are a number of error reply types, all of which are denoted by
 having bit 31 set. All error replies MAY have some data set, in which
@@ -997,36 +1000,37 @@ case that data is an error message string suitable for display to the user.
 
     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
-##### Allocation contexts
+##### Metadata contexts
 
-Allocation context 0 is the basic "exists at all" allocation context. If
-an extent is not allocated at allocation context 0, it MUST NOT be
-listed as allocated at another allocation context. This supports sparse
-file semantics on the server side. If a server has only one allocation
-context (the default), then writing to an extent which is allocated in
-that allocation context 0 MUST NOT fail with ENOSPC.
+The "BASE:allocation" metadata context is the basic "exists at all"
+metadata context. If an extent is marked with `NBD_STATE_HOLE` at that
+context, this means that the given extent is not allocated in the
+backend storage, and that writing to the extent MAY result in the ENOSPC
+error. This supports sparse file semantics on the server side. If a
+server has only one metadata context (the default), then writing to an
+extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.
 
 For all other cases, this specification requires no specific semantics
-of allocation contexts. Implementations could support allocation
+of metadata contexts. Implementations could support metadata
 contexts with semantics like the following:
 
-- Incremental snapshots; if a block is allocated in one allocation
+- Incremental snapshots; if a block is allocated in one metadata
   context, that implies that it is also allocated in the next level up.
 - Various bits of data about the backend of the storage; e.g., if the
-  storage is written on a RAID array, an allocation context could
+  storage is written on a RAID array, a metadata context could
   return information about the redundancy level of a given extent
 - If the backend implements a write-through cache of some sort, or
-  synchronises with other servers, an allocation context could state
-  that an extent is "allocated" once it has reached permanent storage
+  synchronises with other servers, a metadata context could state
+  that an extent is "active" once it has reached permanent storage
   and/or is synchronized with other servers.
 
-The only requirement of an allocation context is that it MUST be
+The only requirement of a metadata context is that it MUST be
 representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
 
 Likewise, the syntax of query strings is not specified by this document.
 
 Server implementations SHOULD document their syntax for query strings
-and semantics for resulting allocation contexts in a document like this
+and semantics for resulting metadata contexts in a document like this
 one.
 
 ### Transmission phase
@@ -1066,7 +1070,7 @@ valid may depend on negotiation during the handshake phase.
    flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
    `EOVERFLOW` error chunk, if the request length is too large.
 - bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
-  set, the client is interested in only one extent per allocation
+  set, the client is interested in only one extent per metadata
   context.
 
 ##### Structured reply flags
@@ -1462,17 +1466,22 @@ unless the client also negotiates the `STRUCTURED_REPLY` extension.
     represents a series of consecutive block descriptors where the sum
     of the lengths of the descriptors MUST not be greater than the
     length of the original request. This chunk type MUST appear at most
-    once per allocation ID in a structured reply. Valid as a reply to
+    once per metadata ID in a structured reply. Valid as a reply to
     `NBD_CMD_BLOCK_STATUS`.
 
     Servers MUST return an `NBD_REPLY_TYPE_BLOCK_STATUS` chunk for every
-    allocation context ID, except if the semantics of particular
-    allocation contexts mean that the information for one allocation
-    context is implied by the information for another.
+    metadata context ID, except if the semantics of particular
+    metadata contexts mean that the information for one active metadata
+    context is implied by the information for another; e.g., if a
+    particular metadata context can only have meaning for extents where
+    the `NBD_STATE_HOLE` flag is cleared on the "BASE:allocation"
+    context, servers MAY omit the relevant chunks for that context if
+    they already sent an extent with the `NBD_STATE_HOLE` flag set in
+    reply to the same `NBD_CMD_BLOCK_STATUS` command.
 
     The payload starts with:
 
-        * 32 bits, allocation context ID
+        * 32 bits, metadata context ID
 
     and is followed by a list of one or more descriptors, each with this
     layout:
@@ -1493,7 +1502,7 @@ unless the client also negotiates the `STRUCTURED_REPLY` extension.
 * `NBD_CMD_BLOCK_STATUS`
 
     A block status query request. Length and offset define the range of
-    interest. Clients SHOULD NOT use this request unless allocation
+    interest. Clients MUST NOT use this request unless metadata
     contexts have been negotiated, which in turn requires the client to
     first negotiate structured replies. For a successful return, the
     server MUST use a structured reply, containing at least one chunk of
@@ -1533,7 +1542,7 @@ unless the client also negotiates the `STRUCTURED_REPLY` extension.
         or the server could not otherwise determine its status.  Note
         that the use of `NBD_CMD_TRIM` is related to this status, but
         that the server MAY report a hole even where trim has not been
-        requested, and also that a server MAY report allocation even
+        requested, and also that a server MAY report metadata even
         where a trim has been requested.
       - `NBD_STATE_ZERO` (bit 1): if set, the block contents read as
         all zeroes; if clear, the block contents are not known.  Note
@@ -1547,25 +1556,25 @@ unless the client also negotiates the `STRUCTURED_REPLY` extension.
         been written; if clear, the block represents a portion of the
         file that is dirty, or where the server could not otherwise
         determine its status. The server MUST NOT set this bit for
-        allocation context 0, where it has no meaning.
+        the "BASE:allocation" context, where it has no meaning.
       - `NBD_STATE_ACTIVE` (bit 3): if set, the block represents a
-	portion of the file that is "active" in the given allocation
-	context. The server MUST NOT set this bit for allocation context
-	0, where it has no meaning.
+        portion of the file that is "active" in the given metadata
+        context. The server MUST NOT set this bit for the
+        "BASE:allocation" context, where it has no meaning.
 
     The exact semantics of what it means for a block to be "clean" or
-    "active" at a given allocation context is not defined by this
+    "active" at a given metadata context is not defined by this
     specification, except that the default in both cases should be to
-    clear the bit. That is, when the allocation context does not have
+    clear the bit. That is, when the metadata context does not have
     knowledge of the relevant status for the given extent, or when the
-    allocation context does not assign any meaning to it, the bits
+    metadata context does not assign any meaning to it, the bits
     should be cleared.
 
     A client SHOULD NOT read from an area that has both `NBD_STATE_HOLE`
     set and `NBD_STATE_ZERO` clear.
 
 A client MAY close the connection if it detects that the server has
-sent an invalid chunks (such as lengths in the
+sent an invalid chunk (such as lengths in the
 `NBD_REPLY_TYPE_BLOCK_STATUS` not summing up to the requested length).
 The server SHOULD return `EINVAL` if it receives a `BLOCK_STATUS`
 request including one or more sectors beyond the size of the device.
@@ -1574,7 +1583,7 @@ The extension adds the following new command flag:
 
 - `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`.
   SHOULD be set to 1 if the client wants to request information for only
-  one extent per allocation context.
+  one extent per metadata context.
 
 ## About this file
 

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-01 10:14         ` Wouter Verhelst
@ 2016-12-01 11:26           ` Vladimir Sementsov-Ogievskiy
  2016-12-02  9:25             ` Wouter Verhelst
  0 siblings, 1 reply; 37+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-01 11:26 UTC (permalink / raw)
  To: Wouter Verhelst, Stefan Hajnoczi
  Cc: nbd-general, kwolf, qemu-devel, pborzenkov, pbonzini, mpa, den

01.12.2016 13:14, Wouter Verhelst wrote:
> Here's another update.
>
> Changes since previous version:
> - Rename "allocation context" to "metadata context"
> - Stop making metadata context 0 be special; instead, name it
>    "BASE:allocation" and allow it to be selected like all other contexts.
> - Clarify in a bit more detail when a server MAY omit metadata
>    information on a particular metadata context (i.e., only if another
>    metadata context that we actually got information for implies it can't
>    have meaning). This was always meant that way, but the spec could have
>    been a bit more explicit about it.
> - Change one SHOULD to a MUST, where it should not have been a SHOULD
>    in the first place.
>
> (This applies on top of my previous patch)
>
> Applied and visible at the usual place.
>
> diff --git a/doc/proto.md b/doc/proto.md
> index fe7ae53..9c0981f 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -869,16 +869,16 @@ of the newstyle negotiation.
>   
>       Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
>   
> -- `NBD_OPT_ALLOC_CONTEXT` (10)
> +- `NBD_OPT_META_CONTEXT` (10)
>   
> -    Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
> +    Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
>       followed by an `NBD_REP_ACK`. If a server replies to such a request
>       with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
>       commands during the transmission phase.
>   
>       If the query string is syntactically invalid, the server SHOULD send
>       `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
> -    but finds no allocation contexts, the server MUST send a single
> +    but finds no metadata contexts, the server MUST send a single
>       reply of type `NBD_REP_ACK`.
>   
>       This option MUST NOT be requested unless structured replies have
> @@ -887,9 +887,9 @@ of the newstyle negotiation.
>   
>       Data:
>       - 32 bits, type
> -    - String, query to select a subset of the available allocation
> +    - String, query to select a subset of the available metadata
>         contexts. If this is not specified (i.e., length is 4 and no
> -      command is sent), then the server MUST send all the allocation
> +      command is sent), then the server MUST send all the metadata
>         contexts it knows about. If specified, this query string MUST
>         start with a name that uniquely identifies a server
>         implementation; e.g., the reference implementation that
> @@ -897,18 +897,22 @@ of the newstyle negotiation.
>         with 'nbd-server:'
>   
>       The type may be one of:
> -    - `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts
> +    - `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
>         selected by the query string is returned to the client without
> -      changing any state (i.e., this does not add allocation contexts
> +      changing any state (i.e., this does not add metadata contexts
>         for further usage).
> -    - `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
> +    - `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
>         selected by the query string is added to the list of existing

If I understand correctly, it should be not 'existing', but 'exporting'. 
So there are several contexts, server knows about. They are definitely 
exists. Some of them may be selected (by client) for export (to client, 
through get_block_status).

so, what about 'list of metadata contexts to export' or something like this?

> -      allocation contexts.
> -    - `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts
> +      metadata contexts.
> +    - `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts
>         selected by the query string is removed from the list of used
> -      allocation contexts. Servers SHOULD NOT reuse existing allocation
> +      metadata contexts. Servers SHOULD NOT reuse existing metadata
>         context IDs.
>   
> +    The syntax of the query string is not specified, except that
> +    implementations MUST support adding and removing individual metadata
> +    contexts by simply listing their names.
> +
>   #### Option reply types
>   
>   These values are used in the "reply type" field, sent by the server
> @@ -920,7 +924,7 @@ during option haggling in the fixed newstyle negotiation.
>       information is available, or when sending data related to the option
>       (in the case of `NBD_OPT_LIST`) has finished. No data.
>   
> -* `NBD_REP_SERVER` (2)
> +- `NBD_REP_SERVER` (2)
>   
>       A description of an export. Data:
>   
> @@ -935,21 +939,20 @@ during option haggling in the fixed newstyle negotiation.
>         particular client request, this field is defined to be a string
>         suitable for direct display to a human being.
>   
> -* `NBD_REP_INFO` (3)
> +- `NBD_REP_INFO` (3)
>   
>       Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
>   
> -- `NBD_REP_ALLOC_CONTEXT` (4)
> +- `NBD_REP_META_CONTEXT` (4)
>   
> -    A description of an allocation context. Data:
> +    A description of a metadata context. Data:
>   
> -    - 32 bits, NBD allocation context ID. If the request was NOT of type
> -      `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero.
> -    - String, name of the allocation context. This is not required to be
> +    - 32 bits, NBD metadata context ID.
> +    - String, name of the metadata context. This is not required to be
>         a human-readable string, but it MUST be valid UTF-8 data.
>   
> -    Allocation context ID 0 is implied, and always exists. It cannot be
> -    removed.
> +    This specification declares one metadata context. It is called
> +    "BASE:allocation" and contains the basic "exists at all" context.
>   
>   There are a number of error reply types, all of which are denoted by
>   having bit 31 set. All error replies MAY have some data set, in which
> @@ -997,36 +1000,37 @@ case that data is an error message string suitable for display to the user.
>   
>       Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
>   
> -##### Allocation contexts
> +##### Metadata contexts
>   
> -Allocation context 0 is the basic "exists at all" allocation context. If
> -an extent is not allocated at allocation context 0, it MUST NOT be
> -listed as allocated at another allocation context. This supports sparse
> -file semantics on the server side. If a server has only one allocation
> -context (the default), then writing to an extent which is allocated in
> -that allocation context 0 MUST NOT fail with ENOSPC.
> +The "BASE:allocation" metadata context is the basic "exists at all"
> +metadata context. If an extent is marked with `NBD_STATE_HOLE` at that
> +context, this means that the given extent is not allocated in the
> +backend storage, and that writing to the extent MAY result in the ENOSPC
> +error. This supports sparse file semantics on the server side. If a
> +server has only one metadata context (the default), then writing to an
> +extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.

this dependence looks strange. user defined metadata, why it affects 
allocation? At least, 'only one' is not descriptive, it would be better 
to mention 'BASE:allocation' name. (I hope, I can ask server to export 
only dirty_bitmap context, not exporting allocation? In that case this 
'only one' would be dirty_bitmap)

>   
>   For all other cases, this specification requires no specific semantics

what are 'other cases'? For all other metadata contexts? Or for all 
cases when we have more than one context?

> -of allocation contexts. Implementations could support allocation
> +of metadata contexts. Implementations could support metadata
>   contexts with semantics like the following:
>   
> -- Incremental snapshots; if a block is allocated in one allocation
> +- Incremental snapshots; if a block is allocated in one metadata
>     context, that implies that it is also allocated in the next level up.
>   - Various bits of data about the backend of the storage; e.g., if the
> -  storage is written on a RAID array, an allocation context could
> +  storage is written on a RAID array, a metadata context could
>     return information about the redundancy level of a given extent
>   - If the backend implements a write-through cache of some sort, or
> -  synchronises with other servers, an allocation context could state
> -  that an extent is "allocated" once it has reached permanent storage
> +  synchronises with other servers, a metadata context could state
> +  that an extent is "active" once it has reached permanent storage
>     and/or is synchronized with other servers.

Incremental snapshots sounds strange for me. Snapshots are just 
snapshots.. Backup may be incremental, but it is not about snapshots.. I 
think this example may be safely deleted from the spec.

>   
> -The only requirement of an allocation context is that it MUST be
> +The only requirement of a metadata context is that it MUST be
>   representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
>   
>   Likewise, the syntax of query strings is not specified by this document.
>   
>   Server implementations SHOULD document their syntax for query strings
> -and semantics for resulting allocation contexts in a document like this
> +and semantics for resulting metadata contexts in a document like this
>   one.
>   
>   ### Transmission phase
> @@ -1066,7 +1070,7 @@ valid may depend on negotiation during the handshake phase.
>      flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
>      `EOVERFLOW` error chunk, if the request length is too large.
>   - bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
> -  set, the client is interested in only one extent per allocation
> +  set, the client is interested in only one extent per metadata
>     context.
>   
>   ##### Structured reply flags
> @@ -1462,17 +1466,22 @@ unless the client also negotiates the `STRUCTURED_REPLY` extension.
>       represents a series of consecutive block descriptors where the sum
>       of the lengths of the descriptors MUST not be greater than the
>       length of the original request. This chunk type MUST appear at most
> -    once per allocation ID in a structured reply. Valid as a reply to
> +    once per metadata ID in a structured reply. Valid as a reply to
>       `NBD_CMD_BLOCK_STATUS`.
>   
>       Servers MUST return an `NBD_REPLY_TYPE_BLOCK_STATUS` chunk for every
> -    allocation context ID, except if the semantics of particular
> -    allocation contexts mean that the information for one allocation
> -    context is implied by the information for another.
> +    metadata context ID, except if the semantics of particular
> +    metadata contexts mean that the information for one active metadata
> +    context is implied by the information for another; e.g., if a
> +    particular metadata context can only have meaning for extents where
> +    the `NBD_STATE_HOLE` flag is cleared on the "BASE:allocation"
> +    context, servers MAY omit the relevant chunks for that context if
> +    they already sent an extent with the `NBD_STATE_HOLE` flag set in
> +    reply to the same `NBD_CMD_BLOCK_STATUS` command.

Hmm stop. Are you saying that server may omit some status descriptors 
for some context? But how? We have not field 'offset' in status 
descriptor.. Or we can omit _context_, if _all_ descriptors of 
allocation context are holes?

Anyway, I'm still against this paragraph. These 8 lines actually say 
"server may omit contexts if he wants". I always can explain that 
semantics of some context means that metadata is implied (actually, I 
can just introduce such semantics).., but it not differs with "I want to 
omit this". In this spec there is no way to negotiate context semantics, 
so actually client doesn't know it. We just hope, that both client and 
server are managed by some layer, which knows what is it all about.


Also. If it all is about "metadata", name NBD_CMD_BLOCK_STATUS becomes 
not very descriptive... May be we should move to something like just 
NBD_CMD_GET_METADATA )

>   
>       The payload starts with:
>   
> -        * 32 bits, allocation context ID
> +        * 32 bits, metadata context ID
>   
>       and is followed by a list of one or more descriptors, each with this
>       layout:
> @@ -1493,7 +1502,7 @@ unless the client also negotiates the `STRUCTURED_REPLY` extension.
>   * `NBD_CMD_BLOCK_STATUS`
>   
>       A block status query request. Length and offset define the range of
> -    interest. Clients SHOULD NOT use this request unless allocation
> +    interest. Clients MUST NOT use this request unless metadata
>       contexts have been negotiated, which in turn requires the client to
>       first negotiate structured replies. For a successful return, the
>       server MUST use a structured reply, containing at least one chunk of
> @@ -1533,7 +1542,7 @@ unless the client also negotiates the `STRUCTURED_REPLY` extension.
>           or the server could not otherwise determine its status.  Note
>           that the use of `NBD_CMD_TRIM` is related to this status, but
>           that the server MAY report a hole even where trim has not been
> -        requested, and also that a server MAY report allocation even
> +        requested, and also that a server MAY report metadata even
>           where a trim has been requested.
>         - `NBD_STATE_ZERO` (bit 1): if set, the block contents read as
>           all zeroes; if clear, the block contents are not known.  Note
> @@ -1547,25 +1556,25 @@ unless the client also negotiates the `STRUCTURED_REPLY` extension.
>           been written; if clear, the block represents a portion of the
>           file that is dirty, or where the server could not otherwise
>           determine its status. The server MUST NOT set this bit for
> -        allocation context 0, where it has no meaning.
> +        the "BASE:allocation" context, where it has no meaning.
>         - `NBD_STATE_ACTIVE` (bit 3): if set, the block represents a
> -	portion of the file that is "active" in the given allocation
> -	context. The server MUST NOT set this bit for allocation context
> -	0, where it has no meaning.
> +        portion of the file that is "active" in the given metadata
> +        context. The server MUST NOT set this bit for the
> +        "BASE:allocation" context, where it has no meaning.
>   
>       The exact semantics of what it means for a block to be "clean" or
> -    "active" at a given allocation context is not defined by this
> +    "active" at a given metadata context is not defined by this
>       specification, except that the default in both cases should be to
> -    clear the bit. That is, when the allocation context does not have
> +    clear the bit. That is, when the metadata context does not have
>       knowledge of the relevant status for the given extent, or when the
> -    allocation context does not assign any meaning to it, the bits
> +    metadata context does not assign any meaning to it, the bits
>       should be cleared.

And again. If we said that it is user-defined metadata, and it is not 
defined in this spec, why to define not-defined flags? I propose to 
remove from here all tries to define internals of user-defined-metadata, 
and than, when we finish up with clean and simple spec of the feature, 
we can add separate paragraph, which will define metadata context for 
dirty bitmaps like this:

====
Metadata contexts for NBD_CMD_.... , with names not started with 
"BASE:", are defined by third-party tools, but to avoid conflicts and to 
have common documentation it is recommended to publish their names and 
short descriptions here.

QEMU_DIRTY_BITMAP:<name>  - contexts family of dirty bitmaps, defined by 
Qemu. Dirty bitmap is ...., It is used for ..., by some-company.
                         status extent flags:
                                  bit 0: 0 - means dirty  [If you want, 
but I prefer 0 - clean, 1 - dirty]
                                           1 - means clean
                                  bits 1-31: reserved, always zero.
RANDOM_DATA  - just random data
                         status extent flags:
                                 all 32 bits: random data
...
====

>   
>       A client SHOULD NOT read from an area that has both `NBD_STATE_HOLE`
>       set and `NBD_STATE_ZERO` clear.
>   
>   A client MAY close the connection if it detects that the server has
> -sent an invalid chunks (such as lengths in the
> +sent an invalid chunk (such as lengths in the
>   `NBD_REPLY_TYPE_BLOCK_STATUS` not summing up to the requested length).
>   The server SHOULD return `EINVAL` if it receives a `BLOCK_STATUS`
>   request including one or more sectors beyond the size of the device.
> @@ -1574,7 +1583,7 @@ The extension adds the following new command flag:
>   
>   - `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`.
>     SHOULD be set to 1 if the client wants to request information for only
> -  one extent per allocation context.
> +  one extent per metadata context.
>   
>   ## About this file
>   
>


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-11-29 12:57 ` [Qemu-devel] " Alex Bligh
  2016-11-29 14:36   ` Vladimir Sementsov-Ogievskiy
@ 2016-12-01 23:42   ` John Snow
  2016-12-02  9:16     ` Vladimir Sementsov-Ogievskiy
  2016-12-02 18:45     ` Alex Bligh
  1 sibling, 2 replies; 37+ messages in thread
From: John Snow @ 2016-12-01 23:42 UTC (permalink / raw)
  To: Alex Bligh, Vladimir Sementsov-Ogievskiy
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	mpa, Pavel Borzenkov, Denis V. Lunev, Wouter Verhelst,
	Paolo Bonzini

Hi Alex, let me try my hand at clarifying some points...

On 11/29/2016 07:57 AM, Alex Bligh wrote:
> Vladimir,
> 
> I went back to April to reread the previous train of conversation
> then found you had helpfully summarised some if it. Comments
> below.
> 
> Rather than comment on many of the points individual, the root
> of my confusion and to some extent uncomfortableness about this
> proposal is 'who owns the meaning the the bitmaps'.
> 
> Some of this is my own confusion (sorry) about the use to which
> this is being put, which is I think at root a documentation issue.
> To illustrate this, you write in the FAQ section that this is for
> read only disks, but the text talks about:
> 

I sent an earlier email in response to Wouter over our exact "goal" with
this spec extension that is a little more of an overview, but I will try
to address your questions specifically.

> +Some storage formats and operations over such formats express a
> +concept of data dirtiness. Whether the operation is block device
> +mirroring, incremental block device backup or any other operation with
> +a concept of data dirtiness, they all share a need to provide a list
> +of ranges that this particular operation treats as dirty.
> 
> How can data be 'dirty' if it is static and unchangeable? (I thought)
> 

In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
the VM and cause the bitmap that QEMU manages to become dirty.

We intend to expose the ability to fleece dirty blocks via NBD. What
happens in this scenario would be that a snapshot of the data at the
time of the request is exported over NBD in a read-only manner.

In this way, the drive itself is R/W, but the "view" of it from NBD is
RO. While a hypothetical backup client is busy copying data out of this
temporary view, new writes are coming in to the drive, but are not being
exposed through the NBD export.

(This goes into QEMU-specifics, but those new writes are dirtying a
version of the bitmap not intended to be exposed via the NBD channel.
NBD gets effectively a snapshot of both the bitmap AND the data.)

> I now think what you are talking about backing up a *snapshot* of a disk
> that's running, where the disk itself was not connected using NBD? IE it's
> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
> an opaque state represented in a bitmap, which is binary metadata
> at some particular level of granularity. It might as well be 'happiness'
> or 'is coloured blue'. The NBD server would (normally) have no way of
> manipulating this bitmap.
> 
> In previous comments, I said 'how come we can set the dirty bit through
> writes but can't clear it?'. This (my statement) is now I think wrong,
> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
> state of the bitmap comes from whatever sets the bitmap which is outside
> the scope of this protocol to transmit it.
> 

You know, this is a fair point. We have not (to my knowledge) yet
carefully considered the exact bitmap management scenario when NBD is
involved in retrieving dirty blocks.

Humor me for a moment while I talk about a (completely hypothetical, not
yet fully discussed) workflow for how I envision this feature.

(1) User sets up a drive in QEMU, a bitmap is initialized, an initial
backup is made, etc.

(2) As writes come in, QEMU's bitmap is dirtied.

(3) The user decides they want to root around to see what data has
changed and would like to use NBD to do so, in contrast to QEMU's own
facilities for dumping dirty blocks.

(4) A command is issued that creates a temporary, lightweight snapshot
('fleecing') and exports this snapshot over NBD. The bitmap is
associated with the NBD export at this point at NBD server startup. (For
the sake of QEMU discussion, maybe this command is "blockdev-fleece")

(5) At this moment, the snapshot is static and represents the data at
the time the NBD server was started. The bitmap is also forked and
represents only this snapshot. The live data and bitmap continue to change.

(6) Dirty blocks are queried and copied out via NBD.

(7) The user closes the NBD instance upon completion of their task,
whatever it was. (Making a new incremental backup? Just taking a peek at
some changed data? who knows.)

The point that's interesting here is what do we do with the two bitmaps
at this point? The data delta can be discarded (this was after all just
a lightweight read-only point-in-time snapshot) but the bitmap data
needs to be dealt with.

(A) In the case of "User made a new incremental backup," the bitmap that
got forked off to serve the NBD read should be discarded.

(B) In the case of "User just wanted to look around," the bitmap should
be merged back into the bitmap it was forked from.

I don't advise a hybrid where "User copied some data, but not all" where
we need to partially clear *and* merge, but conceivably this could
happen, because the things we don't want to happen always will.

At this point maybe it's becoming obvious that actually it would be very
prudent to allow the NBD client itself to inform QEMU via the NBD
protocol which extents/blocks/(etc) that it is "done" with.

Maybe it *would* actually be useful if, in NBD allowing us to add a
"dirty" bit to the specification, we allow users to clear those bits.

Then, whether the user was trying to do (A) or (B) or the unspeakable
amalgamation of both things, it's up to the user to clear the bits
desired and QEMU can do the simple task of simply always merging the
bitmap fork upon the conclusion of the NBD fleecing exercise.

Maybe this would allow the dirty bit to have a bit more concrete meaning
for the NBD spec: "The bit stays dirty until the user clears it, and is
set when the matching block/extent/etc is written to."

With an exception that external management may cause the bits to clear.
(I.e., someone fiddles with the backing store in a way opaque to NBD,
e.g. someone clears the bitmap directly through QEMU instead of via NBD.)

> However, we have the uncomfortable (to me) situation where the protocol
> describes a flag 'dirty', with implications as to what it does, but
> no actual strict definition of how it's set. So any 'other' user has
> no real idea how to use the information, or how to implement a server
> that provides a 'dirty' bit, because the semantics of that aren't within
> the protocol. This does not sit happily with me.
> 
> So I'm wondering whether we should simplify and generalise this spec. You
> say that for the dirty flag, there's no specification of when it is
> set and cleared - that's implementation defined. Would it not be better
> then to say 'that whole thing is private to Qemu - even the name'.
> 
> Rather you could read the list of bitmaps a server has, with a textual
> name, each having an index (and perhaps a granularity). You could then
> ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that
> bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec,
> and some (e.g. ones beginning with 'X-') could be reserved for user
> usage. So you could use 'X-QEMU-DIRTY'). If you then change what your
> dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not
> need a protocol change to support it.
> 

If we can't work out a more precise, semantically meaningful spec
extension then I am very happy with simply the ability to have a user
bit, or X-QEMU bits, etc etc etc.

> IE rather than looking at 'a way of reading the dirty bit', we could
> have this as a generic way of reading opaque bitmaps. Only one (allocation)
> might be given meaning to start off with, and it wouldn't be necessary
> for all servers to support that - i.e. you could support bitmap reading
> without having an ALLOCATION bitmap available.
> 
> This spec would then be limited to the transmission of the bitmaps
> (remove the word 'dirty' entirely, except perhaps as an illustrative
> use case), and include only the definition of the allocation bitmap.
> 
> Some more nits:
> 
>> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>> NBD_FLAG_CAN_MULTI_CONN in master branch.
>>
>> And, finally, I've rebased this onto current state of
>> extension-structured-reply branch (which itself should be rebased on
>> master IMHO).
> 
> Each documentation branch should normally be branched off master unless
> it depends on another extension (in which case it will be branched from that).
> I haven't been rebasing them frequently as it can disrupt those working
> on the branches. There's only really an issue around rebasing where you
> depend on another branch.
> 
> 
>> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>>   A: 1: server replies with status descriptors of any size, granularity
>>         is hidden from the client
>>      2: dirty/allocated requests are separate and unrelated to each
>>         other, so their granularities are not intersecting
> 
> I'm OK with this, but note that you do actually mention a granularity
> of sorts in the spec (512 byes) - I think you should replace that
> with the minimum block size.
> 
>> 3. Q: selecting of dirty bitmap to export
>>   A: several variants:
>>      1: id of bitmap is in flags field of request
>>          pros: - simple
>>          cons: - it's a hack. flags field is for other uses.
>>                - we'll have to map bitmap names to these "ids"
>>      2: introduce extended nbd requests with variable length and exploit this
>>         feature for BLOCK_STATUS command, specifying bitmap identifier.
>>         pros: - looks like a true way
>>         cons: - we have to create additional extension
>>               - possible we have to create a map,
>>                 {<QEMU bitmap name> <=> <NBD bitmap id>}
>>      3: exteranl tool should select which bitmap to export. So, in case of Qemu
>>         it should be something like qmp command block-export-dirty-bitmap.
>>         pros: - simple
>>               - we can extend it to behave like (2) later
>>         cons: - additional qmp command to implement (possibly, the lesser evil)
>>         note: Hmm, external tool can make chose between allocated/dirty data too,
>>               so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> 
> Yes, this is all pretty horrible. I suspect we want to do something like (2),
> and permit extra data across (in my proposal, it would only be one byte to select
> the index). I suppose one could ask for a list of bitmaps.
> 

Having missed most of the discussion on v1/v2, is it a given that we
want in-band identification of bitmaps?

I guess this might depend very heavily on the nature of the definition
of the "dirty bit" in the NBD spec.

>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>   cons: Two commands with almost same semantic and similar means?
>>   pros: However here is a good point of separating clearly defined and native
>>         for block devices GET_BLOCK_STATUS from user-driven and actually
>>         undefined data, called 'dirtyness'.
> 
> I'm suggesting one generic 'read bitmap' command like you.
> 
>> 5. Number of status descriptors, sent by server, should be restricted
>>   variants:
>>   1: just allow server to restrict this as it wants (which was done in v3)
>>   2: (not excluding 1). Client specifies somehow the maximum for number
>>      of descriptors.
>>      2.1: add command flag, which will request only one descriptor
>>           (otherwise, no restrictions from the client)
>>      2.2: again, introduce extended nbd requests, and add field to
>>           specify this maximum
> 
> I think some form of extended request is the way to go, but out of
> interest, what's the issue with as many descriptors being sent as it
> takes to encode the reply? The client can just consume the remainder
> (without buffering) and reissue the request at a later point for
> the areas it discarded.
> 
>>
>> 6. A: What to do with unspecified flags (in request/reply)?
>>   I think the normal variant is to make them reserved. (Server should
>>   return EINVAL if found unknown bits, client should consider replay
>>   with unknown bits as an error)
> 
> Yeah.
> 
>>
>> +
>> +* `NBD_CMD_BLOCK_STATUS`
>> +
>> +    A block status query request. Length and offset define the range
>> +    of interest. Clients SHOULD NOT use this request unless the server
> 
> MUST NOT is what we say elsewhere I believe.
> 
>> +    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
>> +    in turn requires the client to first negotiate structured replies.
>> +    For a successful return, the server MUST use a structured reply,
>> +    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> 
> Nit: are you saying that non-structured error replies are permissible?
> You're always/often going to get a non-structured  (simple) error reply
> if the server doesn't support the command, but I think it would be fair to say the
> server MUST use a structured reply to NBD_CMD_SEND_BLOCK_STATUS if
> it supports the command. This is effectively what we say re NBD_CMD_READ.
> 
>> +
>> +    The list of block status descriptors within the
>> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>> +    of the file starting from specified *offset*, and the sum of the
>> +    *length* fields of each descriptor MUST not be greater than the
>> +    overall *length* of the request. This means that the server MAY
>> +    return less data than required. However the server MUST return at
>> +    least one status descriptor
> 
> I'm not sure I understand why that's useful. What should the client
> infer from the server refusing to provide information? We don't
> permit short reads etc.
> 
>> .  The server SHOULD use different
>> +    *status* values between consecutive descriptors, and SHOULD use
>> +    descriptor lengths that are an integer multiple of 512 bytes where
>> +    possible (the first and last descriptor of an unaligned query being
>> +    the most obvious places for an exception).
> 
> Surely better would be an an integer multiple of the minimum block
> size. Being able to offer bitmap support at finer granularity than
> the absolute minimum block size helps no one, and if it were possible
> to support a 256 byte block size (I think some floppy disks had that)
> I see no reason not to support that as a granularity.
> 

Anyway, I hope I am being useful and just not more confounding. It seems
to me that we're having difficulty conveying precisely what it is we're
trying to accomplish, so I hope that I am making a good effort in
elaborating on our goals/requirements.

If not, well. You've got my address for hatemail :)

Thanks,
--John

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-01 23:42   ` [Qemu-devel] " John Snow
@ 2016-12-02  9:16     ` Vladimir Sementsov-Ogievskiy
  2016-12-02 18:45     ` Alex Bligh
  1 sibling, 0 replies; 37+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-02  9:16 UTC (permalink / raw)
  To: John Snow, Alex Bligh
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	mpa, Pavel Borzenkov, Denis V. Lunev, Wouter Verhelst,
	Paolo Bonzini

02.12.2016 02:42, John Snow wrote:
> (B) In the case of "User just wanted to look around," the bitmap should
> be merged back into the bitmap it was forked from.
currently existing example: "failed incremental backup"


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-01 11:26           ` Vladimir Sementsov-Ogievskiy
@ 2016-12-02  9:25             ` Wouter Verhelst
  0 siblings, 0 replies; 37+ messages in thread
From: Wouter Verhelst @ 2016-12-02  9:25 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Stefan Hajnoczi, nbd-general, kwolf, qemu-devel, pborzenkov, den,
	mpa, pbonzini

Hi Vladimir,

On Thu, Dec 01, 2016 at 02:26:28PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 01.12.2016 13:14, Wouter Verhelst wrote:
[...]
> > -    - `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
> > +    - `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
> >         selected by the query string is added to the list of existing
> > -      allocation contexts.
> 
> If I understand correctly, it should be not 'existing', but 'exporting'. 
> So there are several contexts, server knows about. They are definitely 
> exists. Some of them may be selected (by client) for export (to client, 
> through get_block_status).
> 
> so, what about 'list of metadata contexts to export' or something like this?

Yes, good idea. Thanks.

[...]
> > +The "BASE:allocation" metadata context is the basic "exists at all"
> > +metadata context. If an extent is marked with `NBD_STATE_HOLE` at that
> > +context, this means that the given extent is not allocated in the
> > +backend storage, and that writing to the extent MAY result in the ENOSPC
> > +error. This supports sparse file semantics on the server side. If a
> > +server has only one metadata context (the default), then writing to an
> > +extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.
> 
> this dependence looks strange. user defined metadata, why it affects 
> allocation?

The reference nbd-server implementation (i.e., the one that accompanies
this document) has a "copy-on-write" mode in which modifications are
written to a separate file. This separate file can easily be a sparse
file.

Let's say you have an export which is fully allocated; the
BASE:allocation state would have STATE_HOLE cleared. However, when
writing to a particular block that has not yet been written to during
that session, the copy-on-write mode would have to allocate a new block
in the copy-on-write file, which may fail with ENOSPC.

Perhaps the above paragraph could be updated in the sense that it SHOULD
NOT fail in other cases, but that it may depend on the semantics of the
other metadata contexts, whether active (i.e., selected with
NBD_OPT_META_CONTEXT) or not.

Side note: that does mean that some metadata contexts may be specific to
particular exports (e.g., you may have a list of named snapshots to
select, and which ones would exist would depend on the specific export
chosen). I guess that means the metadata contexts should have the name
of the export, too.

> At least, 'only one' is not descriptive, it would be better to mention
> 'BASE:allocation' name.

Yes, that would make things clearer, indeed.

> (I hope, I can ask server to export only dirty_bitmap context, not
> exporting allocation?

That was the point of naming that context and making it selectable :-)

> In that case this 'only one' would be dirty_bitmap)
> 
> >   For all other cases, this specification requires no specific semantics
> 
> what are 'other cases'? For all other metadata contexts? Or for all 
> cases when we have more than one context?

Other metadata contexts. I'll rephrease it in that sense.

> > -of allocation contexts. Implementations could support allocation
> > +of metadata contexts. Implementations could support metadata
> >   contexts with semantics like the following:
> >   
> > -- Incremental snapshots; if a block is allocated in one allocation
> > +- Incremental snapshots; if a block is allocated in one metadata
> >     context, that implies that it is also allocated in the next level up.
> >   - Various bits of data about the backend of the storage; e.g., if the
> > -  storage is written on a RAID array, an allocation context could
> > +  storage is written on a RAID array, a metadata context could
> >     return information about the redundancy level of a given extent
> >   - If the backend implements a write-through cache of some sort, or
> > -  synchronises with other servers, an allocation context could state
> > -  that an extent is "allocated" once it has reached permanent storage
> > +  synchronises with other servers, a metadata context could state
> > +  that an extent is "active" once it has reached permanent storage
> >     and/or is synchronized with other servers.
> 
> Incremental snapshots sounds strange for me. Snapshots are just 
> snapshots.. Backup may be incremental, but it is not about snapshots.. I 
> think this example may be safely deleted from the spec.

Yeah, I'll just remove all the examples; they're not really critical,
anyway, and might indeed confuse people.

[...]
> >       Servers MUST return an `NBD_REPLY_TYPE_BLOCK_STATUS` chunk for every
> > -    allocation context ID, except if the semantics of particular
> > -    allocation contexts mean that the information for one allocation
> > -    context is implied by the information for another.
> > +    metadata context ID, except if the semantics of particular
> > +    metadata contexts mean that the information for one active metadata
> > +    context is implied by the information for another; e.g., if a
> > +    particular metadata context can only have meaning for extents where
> > +    the `NBD_STATE_HOLE` flag is cleared on the "BASE:allocation"
> > +    context, servers MAY omit the relevant chunks for that context if
> > +    they already sent an extent with the `NBD_STATE_HOLE` flag set in
> > +    reply to the same `NBD_CMD_BLOCK_STATUS` command.
> 
> Hmm stop. Are you saying that server may omit some status descriptors 
> for some context? But how? We have not field 'offset' in status 
> descriptor..

Ah. Yes. I'm an idiot :-)

> Or we can omit _context_, if _all_ descriptors of allocation context
> are holes?
> 
> Anyway, I'm still against this paragraph. These 8 lines actually say 
> "server may omit contexts if he wants". I always can explain that 
> semantics of some context means that metadata is implied (actually, I 
> can just introduce such semantics).., but it not differs with "I want to 
> omit this". In this spec there is no way to negotiate context semantics, 
> so actually client doesn't know it. We just hope, that both client and 
> server are managed by some layer, which knows what is it all about.

Yes, I suppose you're right. I'll remove it. It never hurts to send too
much information, and otherwise clients might have to become "too smart"
to understand things properly.

> Also. If it all is about "metadata", name NBD_CMD_BLOCK_STATUS becomes 
> not very descriptive... May be we should move to something like just 
> NBD_CMD_GET_METADATA )

I don't think that's critical (and I'd rather not rename the branch ;-P )

[...]
> >       The exact semantics of what it means for a block to be "clean" or
> > -    "active" at a given allocation context is not defined by this
> > +    "active" at a given metadata context is not defined by this
> >       specification, except that the default in both cases should be to
> > -    clear the bit. That is, when the allocation context does not have
> > +    clear the bit. That is, when the metadata context does not have
> >       knowledge of the relevant status for the given extent, or when the
> > -    allocation context does not assign any meaning to it, the bits
> > +    metadata context does not assign any meaning to it, the bits
> >       should be cleared.
> 
> And again. If we said that it is user-defined metadata, and it is not 
> defined in this spec, why to define not-defined flags? I propose to 
> remove from here all tries to define internals of user-defined-metadata, 
> and than, when we finish up with clean and simple spec of the feature, 
> we can add separate paragraph, which will define metadata context for 
> dirty bitmaps like this:

Good point. I'll leave the "bits should be cleared" bit in there,
though.

> ====
> Metadata contexts for NBD_CMD_.... , with names not started with 
> "BASE:", are defined by third-party tools, but to avoid conflicts and to 
> have common documentation it is recommended to publish their names and 
> short descriptions here.
> 
> QEMU_DIRTY_BITMAP:<name>  - contexts family of dirty bitmaps, defined by 
> Qemu. Dirty bitmap is ...., It is used for ..., by some-company.
>                          status extent flags:
>                                   bit 0: 0 - means dirty  [If you want, 
> but I prefer 0 - clean, 1 - dirty]
>                                            1 - means clean
>                                   bits 1-31: reserved, always zero.
> RANDOM_DATA  - just random data
>                          status extent flags:
>                                  all 32 bits: random data
> ....
> ====

I don't think semantics of third-party implementations' metadata modes
should be part of this document, but I would be happy to add a link to a
document describing such semantics.

Thanks for your review!

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-01 23:42   ` [Qemu-devel] " John Snow
  2016-12-02  9:16     ` Vladimir Sementsov-Ogievskiy
@ 2016-12-02 18:45     ` Alex Bligh
  2016-12-02 20:39       ` John Snow
  2016-12-08  3:39       ` [Qemu-devel] " Alex Bligh
  1 sibling, 2 replies; 37+ messages in thread
From: Alex Bligh @ 2016-12-02 18:45 UTC (permalink / raw)
  To: John Snow
  Cc: Alex Bligh, Vladimir Sementsov-Ogievskiy, nbd-general,
	Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Markus Pargmann, Pavel Borzenkov, Denis V. Lunev,
	Wouter Verhelst, Paolo Bonzini

John,

>> +Some storage formats and operations over such formats express a
>> +concept of data dirtiness. Whether the operation is block device
>> +mirroring, incremental block device backup or any other operation with
>> +a concept of data dirtiness, they all share a need to provide a list
>> +of ranges that this particular operation treats as dirty.
>> 
>> How can data be 'dirty' if it is static and unchangeable? (I thought)
>> 
> 
> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
> the VM and cause the bitmap that QEMU manages to become dirty.
> 
> We intend to expose the ability to fleece dirty blocks via NBD. What
> happens in this scenario would be that a snapshot of the data at the
> time of the request is exported over NBD in a read-only manner.
> 
> In this way, the drive itself is R/W, but the "view" of it from NBD is
> RO. While a hypothetical backup client is busy copying data out of this
> temporary view, new writes are coming in to the drive, but are not being
> exposed through the NBD export.
> 
> (This goes into QEMU-specifics, but those new writes are dirtying a
> version of the bitmap not intended to be exposed via the NBD channel.
> NBD gets effectively a snapshot of both the bitmap AND the data.)

Thanks. That makes sense - or enough sense for me to carry on commenting!

>> I now think what you are talking about backing up a *snapshot* of a disk
>> that's running, where the disk itself was not connected using NBD? IE it's
>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
>> an opaque state represented in a bitmap, which is binary metadata
>> at some particular level of granularity. It might as well be 'happiness'
>> or 'is coloured blue'. The NBD server would (normally) have no way of
>> manipulating this bitmap.
>> 
>> In previous comments, I said 'how come we can set the dirty bit through
>> writes but can't clear it?'. This (my statement) is now I think wrong,
>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
>> state of the bitmap comes from whatever sets the bitmap which is outside
>> the scope of this protocol to transmit it.
>> 
> 
> You know, this is a fair point. We have not (to my knowledge) yet
> carefully considered the exact bitmap management scenario when NBD is
> involved in retrieving dirty blocks.
> 
> Humor me for a moment while I talk about a (completely hypothetical, not
> yet fully discussed) workflow for how I envision this feature.
> 
> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
> backup is made, etc.
> 
> (2) As writes come in, QEMU's bitmap is dirtied.
> 
> (3) The user decides they want to root around to see what data has
> changed and would like to use NBD to do so, in contrast to QEMU's own
> facilities for dumping dirty blocks.
> 
> (4) A command is issued that creates a temporary, lightweight snapshot
> ('fleecing') and exports this snapshot over NBD. The bitmap is
> associated with the NBD export at this point at NBD server startup. (For
> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
> 
> (5) At this moment, the snapshot is static and represents the data at
> the time the NBD server was started. The bitmap is also forked and
> represents only this snapshot. The live data and bitmap continue to change.
> 
> (6) Dirty blocks are queried and copied out via NBD.
> 
> (7) The user closes the NBD instance upon completion of their task,
> whatever it was. (Making a new incremental backup? Just taking a peek at
> some changed data? who knows.)
> 
> The point that's interesting here is what do we do with the two bitmaps
> at this point? The data delta can be discarded (this was after all just
> a lightweight read-only point-in-time snapshot) but the bitmap data
> needs to be dealt with.
> 
> (A) In the case of "User made a new incremental backup," the bitmap that
> got forked off to serve the NBD read should be discarded.
> 
> (B) In the case of "User just wanted to look around," the bitmap should
> be merged back into the bitmap it was forked from.
> 
> I don't advise a hybrid where "User copied some data, but not all" where
> we need to partially clear *and* merge, but conceivably this could
> happen, because the things we don't want to happen always will.
> 
> At this point maybe it's becoming obvious that actually it would be very
> prudent to allow the NBD client itself to inform QEMU via the NBD
> protocol which extents/blocks/(etc) that it is "done" with.
> 
> Maybe it *would* actually be useful if, in NBD allowing us to add a
> "dirty" bit to the specification, we allow users to clear those bits.
> 
> Then, whether the user was trying to do (A) or (B) or the unspeakable
> amalgamation of both things, it's up to the user to clear the bits
> desired and QEMU can do the simple task of simply always merging the
> bitmap fork upon the conclusion of the NBD fleecing exercise.
> 
> Maybe this would allow the dirty bit to have a bit more concrete meaning
> for the NBD spec: "The bit stays dirty until the user clears it, and is
> set when the matching block/extent/etc is written to."
> 
> With an exception that external management may cause the bits to clear.
> (I.e., someone fiddles with the backing store in a way opaque to NBD,
> e.g. someone clears the bitmap directly through QEMU instead of via NBD.)

There is currently one possible "I've done with the entire bitmap"
signal, which is closing the connection. This has two obvious
problems. Firstly if used, it discards the entire bitmap (not bits).
Secondly, it makes recovery from a broken TCP session difficult
(as either you treat a dirty close as meaning the bitmap needs
to hang around, in which case you have a garbage collection issue,
or you treat it as needing to drop the bitmap, in which case you
can't recover).

I think in your plan the block status doesn't change once the bitmap
is forked. In that case, adding some command (optional) to change
the status of the bitmap (or simply to set a given extent to status X)
would be reasonable. Of course whether it's supported could be dependent
on the bitmap.

> Having missed most of the discussion on v1/v2, is it a given that we
> want in-band identification of bitmaps?
> 
> I guess this might depend very heavily on the nature of the definition
> of the "dirty bit" in the NBD spec.

I don't think it's a given. I think Wouter & I came up with it at
the same time as a way to abstract the bitmap/extent concept and
remove the need to specify a dirty bit at all (well, that's my excuse
anyway).

> Anyway, I hope I am being useful and just not more confounding. It seems
> to me that we're having difficulty conveying precisely what it is we're
> trying to accomplish, so I hope that I am making a good effort in
> elaborating on our goals/requirements.

Yes absolutely. I think part of the challenge is that you are quite
reasonably coming at it from the point of view of qemu's particular
need, and I'm coming at it from 'what should the nbd protocol look
like in general' position, having done lots of work on the protocol
docs (though I'm an occasional qemu contributor). So there's necessarily
a gap of approach to be bridged.

I'm overdue on a review of Wouter's latest patch (partly because I need
to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in),
but I think it's a bridge worth building.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-02 18:45     ` Alex Bligh
@ 2016-12-02 20:39       ` John Snow
  2016-12-03 11:08         ` Alex Bligh
                           ` (2 more replies)
  2016-12-08  3:39       ` [Qemu-devel] " Alex Bligh
  1 sibling, 3 replies; 37+ messages in thread
From: John Snow @ 2016-12-02 20:39 UTC (permalink / raw)
  To: Alex Bligh
  Cc: Vladimir Sementsov-Ogievskiy, nbd-general, Kevin Wolf,
	Stefan stefanha@redhat. com, qemu-devel, Markus Pargmann,
	Pavel Borzenkov, Denis V. Lunev, Wouter Verhelst, Paolo Bonzini



On 12/02/2016 01:45 PM, Alex Bligh wrote:
> John,
> 
>>> +Some storage formats and operations over such formats express a
>>> +concept of data dirtiness. Whether the operation is block device
>>> +mirroring, incremental block device backup or any other operation with
>>> +a concept of data dirtiness, they all share a need to provide a list
>>> +of ranges that this particular operation treats as dirty.
>>>
>>> How can data be 'dirty' if it is static and unchangeable? (I thought)
>>>
>>
>> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
>> the VM and cause the bitmap that QEMU manages to become dirty.
>>
>> We intend to expose the ability to fleece dirty blocks via NBD. What
>> happens in this scenario would be that a snapshot of the data at the
>> time of the request is exported over NBD in a read-only manner.
>>
>> In this way, the drive itself is R/W, but the "view" of it from NBD is
>> RO. While a hypothetical backup client is busy copying data out of this
>> temporary view, new writes are coming in to the drive, but are not being
>> exposed through the NBD export.
>>
>> (This goes into QEMU-specifics, but those new writes are dirtying a
>> version of the bitmap not intended to be exposed via the NBD channel.
>> NBD gets effectively a snapshot of both the bitmap AND the data.)
> 
> Thanks. That makes sense - or enough sense for me to carry on commenting!
> 

Whew! I'm glad.

>>> I now think what you are talking about backing up a *snapshot* of a disk
>>> that's running, where the disk itself was not connected using NBD? IE it's
>>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
>>> an opaque state represented in a bitmap, which is binary metadata
>>> at some particular level of granularity. It might as well be 'happiness'
>>> or 'is coloured blue'. The NBD server would (normally) have no way of
>>> manipulating this bitmap.
>>>
>>> In previous comments, I said 'how come we can set the dirty bit through
>>> writes but can't clear it?'. This (my statement) is now I think wrong,
>>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
>>> state of the bitmap comes from whatever sets the bitmap which is outside
>>> the scope of this protocol to transmit it.
>>>
>>
>> You know, this is a fair point. We have not (to my knowledge) yet
>> carefully considered the exact bitmap management scenario when NBD is
>> involved in retrieving dirty blocks.
>>
>> Humor me for a moment while I talk about a (completely hypothetical, not
>> yet fully discussed) workflow for how I envision this feature.
>>
>> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
>> backup is made, etc.
>>
>> (2) As writes come in, QEMU's bitmap is dirtied.
>>
>> (3) The user decides they want to root around to see what data has
>> changed and would like to use NBD to do so, in contrast to QEMU's own
>> facilities for dumping dirty blocks.
>>
>> (4) A command is issued that creates a temporary, lightweight snapshot
>> ('fleecing') and exports this snapshot over NBD. The bitmap is
>> associated with the NBD export at this point at NBD server startup. (For
>> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
>>
>> (5) At this moment, the snapshot is static and represents the data at
>> the time the NBD server was started. The bitmap is also forked and
>> represents only this snapshot. The live data and bitmap continue to change.
>>
>> (6) Dirty blocks are queried and copied out via NBD.
>>
>> (7) The user closes the NBD instance upon completion of their task,
>> whatever it was. (Making a new incremental backup? Just taking a peek at
>> some changed data? who knows.)
>>
>> The point that's interesting here is what do we do with the two bitmaps
>> at this point? The data delta can be discarded (this was after all just
>> a lightweight read-only point-in-time snapshot) but the bitmap data
>> needs to be dealt with.
>>
>> (A) In the case of "User made a new incremental backup," the bitmap that
>> got forked off to serve the NBD read should be discarded.
>>
>> (B) In the case of "User just wanted to look around," the bitmap should
>> be merged back into the bitmap it was forked from.
>>
>> I don't advise a hybrid where "User copied some data, but not all" where
>> we need to partially clear *and* merge, but conceivably this could
>> happen, because the things we don't want to happen always will.
>>
>> At this point maybe it's becoming obvious that actually it would be very
>> prudent to allow the NBD client itself to inform QEMU via the NBD
>> protocol which extents/blocks/(etc) that it is "done" with.
>>
>> Maybe it *would* actually be useful if, in NBD allowing us to add a
>> "dirty" bit to the specification, we allow users to clear those bits.
>>
>> Then, whether the user was trying to do (A) or (B) or the unspeakable
>> amalgamation of both things, it's up to the user to clear the bits
>> desired and QEMU can do the simple task of simply always merging the
>> bitmap fork upon the conclusion of the NBD fleecing exercise.
>>
>> Maybe this would allow the dirty bit to have a bit more concrete meaning
>> for the NBD spec: "The bit stays dirty until the user clears it, and is
>> set when the matching block/extent/etc is written to."
>>
>> With an exception that external management may cause the bits to clear.
>> (I.e., someone fiddles with the backing store in a way opaque to NBD,
>> e.g. someone clears the bitmap directly through QEMU instead of via NBD.)
> 
> There is currently one possible "I've done with the entire bitmap"
> signal, which is closing the connection. This has two obvious
> problems. Firstly if used, it discards the entire bitmap (not bits).
> Secondly, it makes recovery from a broken TCP session difficult
> (as either you treat a dirty close as meaning the bitmap needs
> to hang around, in which case you have a garbage collection issue,
> or you treat it as needing to drop the bitmap, in which case you
> can't recover).
> 

In my mind, I wasn't treating closing the connection as the end of the
point-in-time snapshot; that would be stopping the export.

I wouldn't advocate for a control channel (QEMU, here) clearing the
bitmap just because a client disappeared.

Either:

(A) QEMU clears the bitmap because the NBD export was *stopped*, or
(B) QEMU, acting as the NBD server, clears the bitmap as instructed by
the NBD client, if we admit a provision to clear bits from the NBD
protocol itself.

I don't think there's room for the NBD server (QEMU) deciding to clear
bits based on connection status. It has to be an explicit decision --
either via NBD or QMP.

> I think in your plan the block status doesn't change once the bitmap
> is forked. In that case, adding some command (optional) to change
> the status of the bitmap (or simply to set a given extent to status X)
> would be reasonable. Of course whether it's supported could be dependent
> on the bitmap.
> 

What I describe as "forking" was kind of a bad description. What really
happens when we have a divergence is that the bitmap with data is split
into two bitmaps that are related:

- A new bitmap that is created takes over for the old bitmap. This new
bitmap is empty. It records writes on the live version of the data.
- The old bitmap as it existed remains in a read-only state, and
describes some point-in-time snapshot view of the data.

In the case of an incremental backup, once we've made a backup of the
data, that read-only bitmap can actually be discarded without further
thought.

In the case of a failed incremental backup, or in the case of "I just
wanted to look and see what has changed, but wasn't prepared to reset
the counter yet," this bitmap gets merged back with the live bitmap as
if nothing ever happened.

ANYWAY, allowing the NBD client to request bits be cleared has an
obvious use case even for QEMU, IMO -- which is, the NBD client itself
gains the ability to, without relying on a control plane to the server,
decide for itself if it is going to "make a backup" or "just look around."

The client gains the ability to leave the bitmap alone (QEMU will
re-merge it later once the snapshot is closed) or the ability to clear
it ("I made my backup, we're done with this.")

That usefulness would allow us to have an explicit dirty bit mechanism
directly in NBD, IMO, because:

(1) A RW NBD server has enough information to mark a bit dirty
(2) Since there exists an in-spec mechanism to reset the bitmap, the
dirty bit is meaningful to the server and the client

>> Having missed most of the discussion on v1/v2, is it a given that we
>> want in-band identification of bitmaps?
>>
>> I guess this might depend very heavily on the nature of the definition
>> of the "dirty bit" in the NBD spec.
> 
> I don't think it's a given. I think Wouter & I came up with it at
> the same time as a way to abstract the bitmap/extent concept and
> remove the need to specify a dirty bit at all (well, that's my excuse
> anyway).
> 

OK. We do certainly support multiple bitmaps being active at a time in
QEMU, but I had personally always envisioned that you'd associate them
one-at-a-time when starting the NBD export of a particular device.

I don't have a use case in my head where two distinct bitmaps being
exposed simultaneously offer any particular benefit, but maybe there is
something. I'm sure there is.

I will leave this aspect of it more to you NBD folks. I think QEMU could
cope with either.

(Vladimir, am I wrong? Do you have thoughts on this in particular? I
haven't thought through this aspect of it very much.)

>> Anyway, I hope I am being useful and just not more confounding. It seems
>> to me that we're having difficulty conveying precisely what it is we're
>> trying to accomplish, so I hope that I am making a good effort in
>> elaborating on our goals/requirements.
> 
> Yes absolutely. I think part of the challenge is that you are quite
> reasonably coming at it from the point of view of qemu's particular
> need, and I'm coming at it from 'what should the nbd protocol look
> like in general' position, having done lots of work on the protocol
> docs (though I'm an occasional qemu contributor). So there's necessarily
> a gap of approach to be bridged.
> 

Yeah, I understand quite well that we need to make sure the NBD spec is
sane and useful in a QEMU-agnostic way, so my goal here is just to help
elucidate our needs to enable you to reach a good consensus.

> I'm overdue on a review of Wouter's latest patch (partly because I need
> to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in),
> but I think it's a bridge worth building.
> 

Same. Thank you for your patience!

Cheers,
--js

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-02 20:39       ` John Snow
@ 2016-12-03 11:08         ` Alex Bligh
  2016-12-05  8:36         ` Vladimir Sementsov-Ogievskiy
  2016-12-06 13:32         ` [Qemu-devel] [Nbd] " Wouter Verhelst
  2 siblings, 0 replies; 37+ messages in thread
From: Alex Bligh @ 2016-12-03 11:08 UTC (permalink / raw)
  To: John Snow
  Cc: Alex Bligh, Vladimir Sementsov-Ogievskiy, nbd-general,
	Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Markus Pargmann, Pavel Borzenkov, Denis V. Lunev,
	Wouter Verhelst, Paolo Bonzini


> On 2 Dec 2016, at 20:39, John Snow <jsnow@redhat.com> wrote:
> 
> OK. We do certainly support multiple bitmaps being active at a time in
> QEMU, but I had personally always envisioned that you'd associate them
> one-at-a-time when starting the NBD export of a particular device.
> 
> I don't have a use case in my head where two distinct bitmaps being
> exposed simultaneously offer any particular benefit, but maybe there is
> something. I'm sure there is.

The obvious case is an allocation bitmap, and a dirty bitmap.

It's possible one might want more than one dirty bitmap at once.
Perhaps two sorts of backup, or perhaps live migrate of storage
backed by NBD, or perhaps inspecting the *state* of live migration
via NBD and a bitmap, or perhaps determining what extents in
a QCOW image are in that image itself (as opposed to the image
on which it is based).

I tried to pick some QEMU-like ones, but I am sure there are
examples that would work outside of QEMU.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-02 20:39       ` John Snow
  2016-12-03 11:08         ` Alex Bligh
@ 2016-12-05  8:36         ` Vladimir Sementsov-Ogievskiy
  2016-12-06 13:32         ` [Qemu-devel] [Nbd] " Wouter Verhelst
  2 siblings, 0 replies; 37+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-05  8:36 UTC (permalink / raw)
  To: John Snow, Alex Bligh
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Markus Pargmann, Pavel Borzenkov, Denis V. Lunev,
	Wouter Verhelst, Paolo Bonzini

02.12.2016 23:39, John Snow wrote:
>
> On 12/02/2016 01:45 PM, Alex Bligh wrote:
>> John,
>>
>>>> +Some storage formats and operations over such formats express a
>>>> +concept of data dirtiness. Whether the operation is block device
>>>> +mirroring, incremental block device backup or any other operation with
>>>> +a concept of data dirtiness, they all share a need to provide a list
>>>> +of ranges that this particular operation treats as dirty.
>>>>
>>>> How can data be 'dirty' if it is static and unchangeable? (I thought)
>>>>
>>> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
>>> the VM and cause the bitmap that QEMU manages to become dirty.
>>>
>>> We intend to expose the ability to fleece dirty blocks via NBD. What
>>> happens in this scenario would be that a snapshot of the data at the
>>> time of the request is exported over NBD in a read-only manner.
>>>
>>> In this way, the drive itself is R/W, but the "view" of it from NBD is
>>> RO. While a hypothetical backup client is busy copying data out of this
>>> temporary view, new writes are coming in to the drive, but are not being
>>> exposed through the NBD export.
>>>
>>> (This goes into QEMU-specifics, but those new writes are dirtying a
>>> version of the bitmap not intended to be exposed via the NBD channel.
>>> NBD gets effectively a snapshot of both the bitmap AND the data.)
>> Thanks. That makes sense - or enough sense for me to carry on commenting!
>>
> Whew! I'm glad.
>
>>>> I now think what you are talking about backing up a *snapshot* of a disk
>>>> that's running, where the disk itself was not connected using NBD? IE it's
>>>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
>>>> an opaque state represented in a bitmap, which is binary metadata
>>>> at some particular level of granularity. It might as well be 'happiness'
>>>> or 'is coloured blue'. The NBD server would (normally) have no way of
>>>> manipulating this bitmap.
>>>>
>>>> In previous comments, I said 'how come we can set the dirty bit through
>>>> writes but can't clear it?'. This (my statement) is now I think wrong,
>>>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
>>>> state of the bitmap comes from whatever sets the bitmap which is outside
>>>> the scope of this protocol to transmit it.
>>>>
>>> You know, this is a fair point. We have not (to my knowledge) yet
>>> carefully considered the exact bitmap management scenario when NBD is
>>> involved in retrieving dirty blocks.
>>>
>>> Humor me for a moment while I talk about a (completely hypothetical, not
>>> yet fully discussed) workflow for how I envision this feature.
>>>
>>> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
>>> backup is made, etc.
>>>
>>> (2) As writes come in, QEMU's bitmap is dirtied.
>>>
>>> (3) The user decides they want to root around to see what data has
>>> changed and would like to use NBD to do so, in contrast to QEMU's own
>>> facilities for dumping dirty blocks.
>>>
>>> (4) A command is issued that creates a temporary, lightweight snapshot
>>> ('fleecing') and exports this snapshot over NBD. The bitmap is
>>> associated with the NBD export at this point at NBD server startup. (For
>>> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
>>>
>>> (5) At this moment, the snapshot is static and represents the data at
>>> the time the NBD server was started. The bitmap is also forked and
>>> represents only this snapshot. The live data and bitmap continue to change.
>>>
>>> (6) Dirty blocks are queried and copied out via NBD.
>>>
>>> (7) The user closes the NBD instance upon completion of their task,
>>> whatever it was. (Making a new incremental backup? Just taking a peek at
>>> some changed data? who knows.)
>>>
>>> The point that's interesting here is what do we do with the two bitmaps
>>> at this point? The data delta can be discarded (this was after all just
>>> a lightweight read-only point-in-time snapshot) but the bitmap data
>>> needs to be dealt with.
>>>
>>> (A) In the case of "User made a new incremental backup," the bitmap that
>>> got forked off to serve the NBD read should be discarded.
>>>
>>> (B) In the case of "User just wanted to look around," the bitmap should
>>> be merged back into the bitmap it was forked from.
>>>
>>> I don't advise a hybrid where "User copied some data, but not all" where
>>> we need to partially clear *and* merge, but conceivably this could
>>> happen, because the things we don't want to happen always will.
>>>
>>> At this point maybe it's becoming obvious that actually it would be very
>>> prudent to allow the NBD client itself to inform QEMU via the NBD
>>> protocol which extents/blocks/(etc) that it is "done" with.
>>>
>>> Maybe it *would* actually be useful if, in NBD allowing us to add a
>>> "dirty" bit to the specification, we allow users to clear those bits.
>>>
>>> Then, whether the user was trying to do (A) or (B) or the unspeakable
>>> amalgamation of both things, it's up to the user to clear the bits
>>> desired and QEMU can do the simple task of simply always merging the
>>> bitmap fork upon the conclusion of the NBD fleecing exercise.
>>>
>>> Maybe this would allow the dirty bit to have a bit more concrete meaning
>>> for the NBD spec: "The bit stays dirty until the user clears it, and is
>>> set when the matching block/extent/etc is written to."
>>>
>>> With an exception that external management may cause the bits to clear.
>>> (I.e., someone fiddles with the backing store in a way opaque to NBD,
>>> e.g. someone clears the bitmap directly through QEMU instead of via NBD.)
>> There is currently one possible "I've done with the entire bitmap"
>> signal, which is closing the connection. This has two obvious
>> problems. Firstly if used, it discards the entire bitmap (not bits).
>> Secondly, it makes recovery from a broken TCP session difficult
>> (as either you treat a dirty close as meaning the bitmap needs
>> to hang around, in which case you have a garbage collection issue,
>> or you treat it as needing to drop the bitmap, in which case you
>> can't recover).
>>
> In my mind, I wasn't treating closing the connection as the end of the
> point-in-time snapshot; that would be stopping the export.
>
> I wouldn't advocate for a control channel (QEMU, here) clearing the
> bitmap just because a client disappeared.
>
> Either:
>
> (A) QEMU clears the bitmap because the NBD export was *stopped*, or
> (B) QEMU, acting as the NBD server, clears the bitmap as instructed by
> the NBD client, if we admit a provision to clear bits from the NBD
> protocol itself.
>
> I don't think there's room for the NBD server (QEMU) deciding to clear
> bits based on connection status. It has to be an explicit decision --
> either via NBD or QMP.
>
>> I think in your plan the block status doesn't change once the bitmap
>> is forked. In that case, adding some command (optional) to change
>> the status of the bitmap (or simply to set a given extent to status X)
>> would be reasonable. Of course whether it's supported could be dependent
>> on the bitmap.
>>
> What I describe as "forking" was kind of a bad description. What really
> happens when we have a divergence is that the bitmap with data is split
> into two bitmaps that are related:
>
> - A new bitmap that is created takes over for the old bitmap. This new
> bitmap is empty. It records writes on the live version of the data.
> - The old bitmap as it existed remains in a read-only state, and
> describes some point-in-time snapshot view of the data.
>
> In the case of an incremental backup, once we've made a backup of the
> data, that read-only bitmap can actually be discarded without further
> thought.
>
> In the case of a failed incremental backup, or in the case of "I just
> wanted to look and see what has changed, but wasn't prepared to reset
> the counter yet," this bitmap gets merged back with the live bitmap as
> if nothing ever happened.
>
> ANYWAY, allowing the NBD client to request bits be cleared has an
> obvious use case even for QEMU, IMO -- which is, the NBD client itself
> gains the ability to, without relying on a control plane to the server,
> decide for itself if it is going to "make a backup" or "just look around."
>
> The client gains the ability to leave the bitmap alone (QEMU will
> re-merge it later once the snapshot is closed) or the ability to clear
> it ("I made my backup, we're done with this.")
>
> That usefulness would allow us to have an explicit dirty bit mechanism
> directly in NBD, IMO, because:
>
> (1) A RW NBD server has enough information to mark a bit dirty
> (2) Since there exists an in-spec mechanism to reset the bitmap, the
> dirty bit is meaningful to the server and the client
>
>>> Having missed most of the discussion on v1/v2, is it a given that we
>>> want in-band identification of bitmaps?
>>>
>>> I guess this might depend very heavily on the nature of the definition
>>> of the "dirty bit" in the NBD spec.
>> I don't think it's a given. I think Wouter & I came up with it at
>> the same time as a way to abstract the bitmap/extent concept and
>> remove the need to specify a dirty bit at all (well, that's my excuse
>> anyway).
>>
> OK. We do certainly support multiple bitmaps being active at a time in
> QEMU, but I had personally always envisioned that you'd associate them
> one-at-a-time when starting the NBD export of a particular device.
>
> I don't have a use case in my head where two distinct bitmaps being
> exposed simultaneously offer any particular benefit, but maybe there is
> something. I'm sure there is.
>
> I will leave this aspect of it more to you NBD folks. I think QEMU could
> cope with either.
>
> (Vladimir, am I wrong? Do you have thoughts on this in particular? I
> haven't thought through this aspect of it very much.)

I'm ok with either too.

Yes, with online external backup (fleecing), the bitmap is already 
selected in Qemu. And this is most interesting case, anyway.

For offline external backup, when we have RO disk, it may have several 
bitmaps, for example for different backup frequency and it may be not 
bad for client to have an ability to chose.. But again, we can select 
exporting bitmap through qmp (even same fleecing scheme will be ok, with 
overhead of creating empty delta)

>
>>> Anyway, I hope I am being useful and just not more confounding. It seems
>>> to me that we're having difficulty conveying precisely what it is we're
>>> trying to accomplish, so I hope that I am making a good effort in
>>> elaborating on our goals/requirements.
>> Yes absolutely. I think part of the challenge is that you are quite
>> reasonably coming at it from the point of view of qemu's particular
>> need, and I'm coming at it from 'what should the nbd protocol look
>> like in general' position, having done lots of work on the protocol
>> docs (though I'm an occasional qemu contributor). So there's necessarily
>> a gap of approach to be bridged.
>>
> Yeah, I understand quite well that we need to make sure the NBD spec is
> sane and useful in a QEMU-agnostic way, so my goal here is just to help
> elucidate our needs to enable you to reach a good consensus.
>
>> I'm overdue on a review of Wouter's latest patch (partly because I need
>> to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in),
>> but I think it's a bridge worth building.
>>
> Same. Thank you for your patience!
>
> Cheers,
> --js


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-02 20:39       ` John Snow
  2016-12-03 11:08         ` Alex Bligh
  2016-12-05  8:36         ` Vladimir Sementsov-Ogievskiy
@ 2016-12-06 13:32         ` Wouter Verhelst
  2016-12-06 16:39           ` John Snow
  2 siblings, 1 reply; 37+ messages in thread
From: Wouter Verhelst @ 2016-12-06 13:32 UTC (permalink / raw)
  To: John Snow
  Cc: Alex Bligh, nbd-general, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy, qemu-devel, Paolo Bonzini,
	Pavel Borzenkov, Stefan stefanha@redhat. com, Markus Pargmann,
	Denis V. Lunev

Hi John

Sorry for the late reply; weekend was busy, and so was monday.

On Fri, Dec 02, 2016 at 03:39:08PM -0500, John Snow wrote:
> On 12/02/2016 01:45 PM, Alex Bligh wrote:
> > John,
> > 
> >>> +Some storage formats and operations over such formats express a
> >>> +concept of data dirtiness. Whether the operation is block device
> >>> +mirroring, incremental block device backup or any other operation with
> >>> +a concept of data dirtiness, they all share a need to provide a list
> >>> +of ranges that this particular operation treats as dirty.
> >>>
> >>> How can data be 'dirty' if it is static and unchangeable? (I thought)
> >>>
> >>
> >> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
> >> the VM and cause the bitmap that QEMU manages to become dirty.
> >>
> >> We intend to expose the ability to fleece dirty blocks via NBD. What
> >> happens in this scenario would be that a snapshot of the data at the
> >> time of the request is exported over NBD in a read-only manner.
> >>
> >> In this way, the drive itself is R/W, but the "view" of it from NBD is
> >> RO. While a hypothetical backup client is busy copying data out of this
> >> temporary view, new writes are coming in to the drive, but are not being
> >> exposed through the NBD export.
> >>
> >> (This goes into QEMU-specifics, but those new writes are dirtying a
> >> version of the bitmap not intended to be exposed via the NBD channel.
> >> NBD gets effectively a snapshot of both the bitmap AND the data.)
> > 
> > Thanks. That makes sense - or enough sense for me to carry on commenting!
> > 
> 
> Whew! I'm glad.
> 
> >>> I now think what you are talking about backing up a *snapshot* of a disk
> >>> that's running, where the disk itself was not connected using NBD? IE it's
> >>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
> >>> an opaque state represented in a bitmap, which is binary metadata
> >>> at some particular level of granularity. It might as well be 'happiness'
> >>> or 'is coloured blue'. The NBD server would (normally) have no way of
> >>> manipulating this bitmap.
> >>>
> >>> In previous comments, I said 'how come we can set the dirty bit through
> >>> writes but can't clear it?'. This (my statement) is now I think wrong,
> >>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
> >>> state of the bitmap comes from whatever sets the bitmap which is outside
> >>> the scope of this protocol to transmit it.
> >>>
> >>
> >> You know, this is a fair point. We have not (to my knowledge) yet
> >> carefully considered the exact bitmap management scenario when NBD is
> >> involved in retrieving dirty blocks.
> >>
> >> Humor me for a moment while I talk about a (completely hypothetical, not
> >> yet fully discussed) workflow for how I envision this feature.
> >>
> >> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
> >> backup is made, etc.
> >>
> >> (2) As writes come in, QEMU's bitmap is dirtied.
> >>
> >> (3) The user decides they want to root around to see what data has
> >> changed and would like to use NBD to do so, in contrast to QEMU's own
> >> facilities for dumping dirty blocks.
> >>
> >> (4) A command is issued that creates a temporary, lightweight snapshot
> >> ('fleecing') and exports this snapshot over NBD. The bitmap is
> >> associated with the NBD export at this point at NBD server startup. (For
> >> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
> >>
> >> (5) At this moment, the snapshot is static and represents the data at
> >> the time the NBD server was started. The bitmap is also forked and
> >> represents only this snapshot. The live data and bitmap continue to change.
> >>
> >> (6) Dirty blocks are queried and copied out via NBD.
> >>
> >> (7) The user closes the NBD instance upon completion of their task,
> >> whatever it was. (Making a new incremental backup? Just taking a peek at
> >> some changed data? who knows.)
> >>
> >> The point that's interesting here is what do we do with the two bitmaps
> >> at this point? The data delta can be discarded (this was after all just
> >> a lightweight read-only point-in-time snapshot) but the bitmap data
> >> needs to be dealt with.
> >>
> >> (A) In the case of "User made a new incremental backup," the bitmap that
> >> got forked off to serve the NBD read should be discarded.
> >>
> >> (B) In the case of "User just wanted to look around," the bitmap should
> >> be merged back into the bitmap it was forked from.
> >>
> >> I don't advise a hybrid where "User copied some data, but not all" where
> >> we need to partially clear *and* merge, but conceivably this could
> >> happen, because the things we don't want to happen always will.
> >>
> >> At this point maybe it's becoming obvious that actually it would be very
> >> prudent to allow the NBD client itself to inform QEMU via the NBD
> >> protocol which extents/blocks/(etc) that it is "done" with.
> >>
> >> Maybe it *would* actually be useful if, in NBD allowing us to add a
> >> "dirty" bit to the specification, we allow users to clear those bits.
> >>
> >> Then, whether the user was trying to do (A) or (B) or the unspeakable
> >> amalgamation of both things, it's up to the user to clear the bits
> >> desired and QEMU can do the simple task of simply always merging the
> >> bitmap fork upon the conclusion of the NBD fleecing exercise.
> >>
> >> Maybe this would allow the dirty bit to have a bit more concrete meaning
> >> for the NBD spec: "The bit stays dirty until the user clears it, and is
> >> set when the matching block/extent/etc is written to."
> >>
> >> With an exception that external management may cause the bits to clear.
> >> (I.e., someone fiddles with the backing store in a way opaque to NBD,
> >> e.g. someone clears the bitmap directly through QEMU instead of via NBD.)
> > 
> > There is currently one possible "I've done with the entire bitmap"
> > signal, which is closing the connection. This has two obvious
> > problems. Firstly if used, it discards the entire bitmap (not bits).
> > Secondly, it makes recovery from a broken TCP session difficult
> > (as either you treat a dirty close as meaning the bitmap needs
> > to hang around, in which case you have a garbage collection issue,
> > or you treat it as needing to drop the bitmap, in which case you
> > can't recover).
> > 
> 
> In my mind, I wasn't treating closing the connection as the end of the
> point-in-time snapshot; that would be stopping the export.
> 
> I wouldn't advocate for a control channel (QEMU, here) clearing the
> bitmap just because a client disappeared.
> 
> Either:
> 
> (A) QEMU clears the bitmap because the NBD export was *stopped*, or
> (B) QEMU, acting as the NBD server, clears the bitmap as instructed by
> the NBD client, if we admit a provision to clear bits from the NBD
> protocol itself.
> 
> I don't think there's room for the NBD server (QEMU) deciding to clear
> bits based on connection status. It has to be an explicit decision --
> either via NBD or QMP.
> 
> > I think in your plan the block status doesn't change once the bitmap
> > is forked. In that case, adding some command (optional) to change
> > the status of the bitmap (or simply to set a given extent to status X)
> > would be reasonable. Of course whether it's supported could be dependent
> > on the bitmap.
> > 
> 
> What I describe as "forking" was kind of a bad description. What really
> happens when we have a divergence is that the bitmap with data is split
> into two bitmaps that are related:
> 
> - A new bitmap that is created takes over for the old bitmap. This new
> bitmap is empty. It records writes on the live version of the data.
> - The old bitmap as it existed remains in a read-only state, and
> describes some point-in-time snapshot view of the data.
> 
> In the case of an incremental backup, once we've made a backup of the
> data, that read-only bitmap can actually be discarded without further
> thought.
> 
> In the case of a failed incremental backup, or in the case of "I just
> wanted to look and see what has changed, but wasn't prepared to reset
> the counter yet," this bitmap gets merged back with the live bitmap as
> if nothing ever happened.
> 
> ANYWAY, allowing the NBD client to request bits be cleared has an
> obvious use case even for QEMU, IMO -- which is, the NBD client itself
> gains the ability to, without relying on a control plane to the server,
> decide for itself if it is going to "make a backup" or "just look around."
> 
> The client gains the ability to leave the bitmap alone (QEMU will
> re-merge it later once the snapshot is closed) or the ability to clear
> it ("I made my backup, we're done with this.")
> 
> That usefulness would allow us to have an explicit dirty bit mechanism
> directly in NBD, IMO, because:
> 
> (1) A RW NBD server has enough information to mark a bit dirty
> (2) Since there exists an in-spec mechanism to reset the bitmap, the
> dirty bit is meaningful to the server and the client

While I can see that the ability to manipulate metadata might have
advantages for certain use cases, I don't think that the ability to
*inspect* metadata should require the ability to manipulate it in any
way.

So I'd like to finish the block_status extension before moving on to
manipulation :)

> >> Having missed most of the discussion on v1/v2, is it a given that we
> >> want in-band identification of bitmaps?
> >>
> >> I guess this might depend very heavily on the nature of the definition
> >> of the "dirty bit" in the NBD spec.
> > 
> > I don't think it's a given. I think Wouter & I came up with it at
> > the same time as a way to abstract the bitmap/extent concept and
> > remove the need to specify a dirty bit at all (well, that's my excuse
> > anyway).
> > 
> 
> OK. We do certainly support multiple bitmaps being active at a time in
> QEMU, but I had personally always envisioned that you'd associate them
> one-at-a-time when starting the NBD export of a particular device.
> 
> I don't have a use case in my head where two distinct bitmaps being
> exposed simultaneously offer any particular benefit, but maybe there is
> something. I'm sure there is.

The ability to do something does not in any way imply the requirement to
do the same :-)

The idea is that the client negotiates one or more forms of metadata
information from the server that it might be interested in, and then
asks the server that information for a given extent where it has
interest.

The protocol spec does not define what that metadata is (beyond the "is
allocated" one that we define in the spec currently, and possibly
something else in the future). So if qemu only cares about just one type
of metadata, there's no reason why it should *have* to export more than
one type.

> I will leave this aspect of it more to you NBD folks. I think QEMU could
> cope with either.
> 
> (Vladimir, am I wrong? Do you have thoughts on this in particular? I
> haven't thought through this aspect of it very much.)
> 
> >> Anyway, I hope I am being useful and just not more confounding. It seems
> >> to me that we're having difficulty conveying precisely what it is we're
> >> trying to accomplish, so I hope that I am making a good effort in
> >> elaborating on our goals/requirements.
> > 
> > Yes absolutely. I think part of the challenge is that you are quite
> > reasonably coming at it from the point of view of qemu's particular
> > need, and I'm coming at it from 'what should the nbd protocol look
> > like in general' position, having done lots of work on the protocol
> > docs (though I'm an occasional qemu contributor). So there's necessarily
> > a gap of approach to be bridged.
> > 
> 
> Yeah, I understand quite well that we need to make sure the NBD spec is
> sane and useful in a QEMU-agnostic way, so my goal here is just to help
> elucidate our needs to enable you to reach a good consensus.

Right, that's why I was reluctant to merge the original spec as it
stood.

> > I'm overdue on a review of Wouter's latest patch (partly because I need
> > to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in),
> > but I think it's a bridge worth building.
> > 
> 
> Same. Thank you for your patience!

I can do some updates given a few of the suggestions that were made on
this list (no guarantee when that will happen), but if people are
interested in reviewing things in the mean time, be my guest...

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-06 13:32         ` [Qemu-devel] [Nbd] " Wouter Verhelst
@ 2016-12-06 16:39           ` John Snow
  0 siblings, 0 replies; 37+ messages in thread
From: John Snow @ 2016-12-06 16:39 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, Vladimir Sementsov-Ogievskiy,
	Stefan stefanha@redhat. com, qemu-devel, Pavel Borzenkov,
	Alex Bligh, Denis V. Lunev, Markus Pargmann, Paolo Bonzini



On 12/06/2016 08:32 AM, Wouter Verhelst wrote:
> Hi John
> 
> Sorry for the late reply; weekend was busy, and so was monday.
> 

No problems.

> On Fri, Dec 02, 2016 at 03:39:08PM -0500, John Snow wrote:
>> On 12/02/2016 01:45 PM, Alex Bligh wrote:
>>> John,
>>>
>>>>> +Some storage formats and operations over such formats express a
>>>>> +concept of data dirtiness. Whether the operation is block device
>>>>> +mirroring, incremental block device backup or any other operation with
>>>>> +a concept of data dirtiness, they all share a need to provide a list
>>>>> +of ranges that this particular operation treats as dirty.
>>>>>
>>>>> How can data be 'dirty' if it is static and unchangeable? (I thought)
>>>>>
>>>>
>>>> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
>>>> the VM and cause the bitmap that QEMU manages to become dirty.
>>>>
>>>> We intend to expose the ability to fleece dirty blocks via NBD. What
>>>> happens in this scenario would be that a snapshot of the data at the
>>>> time of the request is exported over NBD in a read-only manner.
>>>>
>>>> In this way, the drive itself is R/W, but the "view" of it from NBD is
>>>> RO. While a hypothetical backup client is busy copying data out of this
>>>> temporary view, new writes are coming in to the drive, but are not being
>>>> exposed through the NBD export.
>>>>
>>>> (This goes into QEMU-specifics, but those new writes are dirtying a
>>>> version of the bitmap not intended to be exposed via the NBD channel.
>>>> NBD gets effectively a snapshot of both the bitmap AND the data.)
>>>
>>> Thanks. That makes sense - or enough sense for me to carry on commenting!
>>>
>>
>> Whew! I'm glad.
>>
>>>>> I now think what you are talking about backing up a *snapshot* of a disk
>>>>> that's running, where the disk itself was not connected using NBD? IE it's
>>>>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
>>>>> an opaque state represented in a bitmap, which is binary metadata
>>>>> at some particular level of granularity. It might as well be 'happiness'
>>>>> or 'is coloured blue'. The NBD server would (normally) have no way of
>>>>> manipulating this bitmap.
>>>>>
>>>>> In previous comments, I said 'how come we can set the dirty bit through
>>>>> writes but can't clear it?'. This (my statement) is now I think wrong,
>>>>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
>>>>> state of the bitmap comes from whatever sets the bitmap which is outside
>>>>> the scope of this protocol to transmit it.
>>>>>
>>>>
>>>> You know, this is a fair point. We have not (to my knowledge) yet
>>>> carefully considered the exact bitmap management scenario when NBD is
>>>> involved in retrieving dirty blocks.
>>>>
>>>> Humor me for a moment while I talk about a (completely hypothetical, not
>>>> yet fully discussed) workflow for how I envision this feature.
>>>>
>>>> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
>>>> backup is made, etc.
>>>>
>>>> (2) As writes come in, QEMU's bitmap is dirtied.
>>>>
>>>> (3) The user decides they want to root around to see what data has
>>>> changed and would like to use NBD to do so, in contrast to QEMU's own
>>>> facilities for dumping dirty blocks.
>>>>
>>>> (4) A command is issued that creates a temporary, lightweight snapshot
>>>> ('fleecing') and exports this snapshot over NBD. The bitmap is
>>>> associated with the NBD export at this point at NBD server startup. (For
>>>> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
>>>>
>>>> (5) At this moment, the snapshot is static and represents the data at
>>>> the time the NBD server was started. The bitmap is also forked and
>>>> represents only this snapshot. The live data and bitmap continue to change.
>>>>
>>>> (6) Dirty blocks are queried and copied out via NBD.
>>>>
>>>> (7) The user closes the NBD instance upon completion of their task,
>>>> whatever it was. (Making a new incremental backup? Just taking a peek at
>>>> some changed data? who knows.)
>>>>
>>>> The point that's interesting here is what do we do with the two bitmaps
>>>> at this point? The data delta can be discarded (this was after all just
>>>> a lightweight read-only point-in-time snapshot) but the bitmap data
>>>> needs to be dealt with.
>>>>
>>>> (A) In the case of "User made a new incremental backup," the bitmap that
>>>> got forked off to serve the NBD read should be discarded.
>>>>
>>>> (B) In the case of "User just wanted to look around," the bitmap should
>>>> be merged back into the bitmap it was forked from.
>>>>
>>>> I don't advise a hybrid where "User copied some data, but not all" where
>>>> we need to partially clear *and* merge, but conceivably this could
>>>> happen, because the things we don't want to happen always will.
>>>>
>>>> At this point maybe it's becoming obvious that actually it would be very
>>>> prudent to allow the NBD client itself to inform QEMU via the NBD
>>>> protocol which extents/blocks/(etc) that it is "done" with.
>>>>
>>>> Maybe it *would* actually be useful if, in NBD allowing us to add a
>>>> "dirty" bit to the specification, we allow users to clear those bits.
>>>>
>>>> Then, whether the user was trying to do (A) or (B) or the unspeakable
>>>> amalgamation of both things, it's up to the user to clear the bits
>>>> desired and QEMU can do the simple task of simply always merging the
>>>> bitmap fork upon the conclusion of the NBD fleecing exercise.
>>>>
>>>> Maybe this would allow the dirty bit to have a bit more concrete meaning
>>>> for the NBD spec: "The bit stays dirty until the user clears it, and is
>>>> set when the matching block/extent/etc is written to."
>>>>
>>>> With an exception that external management may cause the bits to clear.
>>>> (I.e., someone fiddles with the backing store in a way opaque to NBD,
>>>> e.g. someone clears the bitmap directly through QEMU instead of via NBD.)
>>>
>>> There is currently one possible "I've done with the entire bitmap"
>>> signal, which is closing the connection. This has two obvious
>>> problems. Firstly if used, it discards the entire bitmap (not bits).
>>> Secondly, it makes recovery from a broken TCP session difficult
>>> (as either you treat a dirty close as meaning the bitmap needs
>>> to hang around, in which case you have a garbage collection issue,
>>> or you treat it as needing to drop the bitmap, in which case you
>>> can't recover).
>>>
>>
>> In my mind, I wasn't treating closing the connection as the end of the
>> point-in-time snapshot; that would be stopping the export.
>>
>> I wouldn't advocate for a control channel (QEMU, here) clearing the
>> bitmap just because a client disappeared.
>>
>> Either:
>>
>> (A) QEMU clears the bitmap because the NBD export was *stopped*, or
>> (B) QEMU, acting as the NBD server, clears the bitmap as instructed by
>> the NBD client, if we admit a provision to clear bits from the NBD
>> protocol itself.
>>
>> I don't think there's room for the NBD server (QEMU) deciding to clear
>> bits based on connection status. It has to be an explicit decision --
>> either via NBD or QMP.
>>
>>> I think in your plan the block status doesn't change once the bitmap
>>> is forked. In that case, adding some command (optional) to change
>>> the status of the bitmap (or simply to set a given extent to status X)
>>> would be reasonable. Of course whether it's supported could be dependent
>>> on the bitmap.
>>>
>>
>> What I describe as "forking" was kind of a bad description. What really
>> happens when we have a divergence is that the bitmap with data is split
>> into two bitmaps that are related:
>>
>> - A new bitmap that is created takes over for the old bitmap. This new
>> bitmap is empty. It records writes on the live version of the data.
>> - The old bitmap as it existed remains in a read-only state, and
>> describes some point-in-time snapshot view of the data.
>>
>> In the case of an incremental backup, once we've made a backup of the
>> data, that read-only bitmap can actually be discarded without further
>> thought.
>>
>> In the case of a failed incremental backup, or in the case of "I just
>> wanted to look and see what has changed, but wasn't prepared to reset
>> the counter yet," this bitmap gets merged back with the live bitmap as
>> if nothing ever happened.
>>
>> ANYWAY, allowing the NBD client to request bits be cleared has an
>> obvious use case even for QEMU, IMO -- which is, the NBD client itself
>> gains the ability to, without relying on a control plane to the server,
>> decide for itself if it is going to "make a backup" or "just look around."
>>
>> The client gains the ability to leave the bitmap alone (QEMU will
>> re-merge it later once the snapshot is closed) or the ability to clear
>> it ("I made my backup, we're done with this.")
>>
>> That usefulness would allow us to have an explicit dirty bit mechanism
>> directly in NBD, IMO, because:
>>
>> (1) A RW NBD server has enough information to mark a bit dirty
>> (2) Since there exists an in-spec mechanism to reset the bitmap, the
>> dirty bit is meaningful to the server and the client
> 
> While I can see that the ability to manipulate metadata might have
> advantages for certain use cases, I don't think that the ability to
> *inspect* metadata should require the ability to manipulate it in any
> way.
> 
> So I'd like to finish the block_status extension before moving on to
> manipulation :)
> 

Understood. The problem I was trying to correct by admitting that
manipulation of bits may have a purpose in NBD was to attempt to clarify
the exact meaning of the dirty bit.

It had seemed to me that by not specifying (or disallowing) the
manipulation of those bits from within NBD necessarily meant that their
meaning existed entirely out-of-spec for NBD, which could be a show-stopper.

So I was attempting to show that by allowing their manipulation in NBD,
they'd have full in-spec meaning. It all depends on what exactly we name
those bits and how you'd like to define their meaning. There are many
ways we can expose this information in a useful manner, so this was just
another option.

>>>> Having missed most of the discussion on v1/v2, is it a given that we
>>>> want in-band identification of bitmaps?
>>>>
>>>> I guess this might depend very heavily on the nature of the definition
>>>> of the "dirty bit" in the NBD spec.
>>>
>>> I don't think it's a given. I think Wouter & I came up with it at
>>> the same time as a way to abstract the bitmap/extent concept and
>>> remove the need to specify a dirty bit at all (well, that's my excuse
>>> anyway).
>>>
>>
>> OK. We do certainly support multiple bitmaps being active at a time in
>> QEMU, but I had personally always envisioned that you'd associate them
>> one-at-a-time when starting the NBD export of a particular device.
>>
>> I don't have a use case in my head where two distinct bitmaps being
>> exposed simultaneously offer any particular benefit, but maybe there is
>> something. I'm sure there is.
> 
> The ability to do something does not in any way imply the requirement to
> do the same :-)
> 

hence the ask. It's not something QEMU currently needs, but I am a bad
psychic.

> The idea is that the client negotiates one or more forms of metadata
> information from the server that it might be interested in, and then
> asks the server that information for a given extent where it has
> interest.
> 

via the NBD protocol, you mean?

> The protocol spec does not define what that metadata is (beyond the "is
> allocated" one that we define in the spec currently, and possibly
> something else in the future). So if qemu only cares about just one type
> of metadata, there's no reason why it should *have* to export more than
> one type.
> 
>> I will leave this aspect of it more to you NBD folks. I think QEMU could
>> cope with either.
>>
>> (Vladimir, am I wrong? Do you have thoughts on this in particular? I
>> haven't thought through this aspect of it very much.)
>>
>>>> Anyway, I hope I am being useful and just not more confounding. It seems
>>>> to me that we're having difficulty conveying precisely what it is we're
>>>> trying to accomplish, so I hope that I am making a good effort in
>>>> elaborating on our goals/requirements.
>>>
>>> Yes absolutely. I think part of the challenge is that you are quite
>>> reasonably coming at it from the point of view of qemu's particular
>>> need, and I'm coming at it from 'what should the nbd protocol look
>>> like in general' position, having done lots of work on the protocol
>>> docs (though I'm an occasional qemu contributor). So there's necessarily
>>> a gap of approach to be bridged.
>>>
>>
>> Yeah, I understand quite well that we need to make sure the NBD spec is
>> sane and useful in a QEMU-agnostic way, so my goal here is just to help
>> elucidate our needs to enable you to reach a good consensus.
> 
> Right, that's why I was reluctant to merge the original spec as it
> stood.
> 
>>> I'm overdue on a review of Wouter's latest patch (partly because I need
>>> to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in),
>>> but I think it's a bridge worth building.
>>>
>>
>> Same. Thank you for your patience!
> 
> I can do some updates given a few of the suggestions that were made on
> this list (no guarantee when that will happen), but if people are
> interested in reviewing things in the mean time, be my guest...
> 

I'll take a look at your revision(s), thanks.

--js

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-02 18:45     ` Alex Bligh
  2016-12-02 20:39       ` John Snow
@ 2016-12-08  3:39       ` Alex Bligh
  2016-12-08  6:58         ` Vladimir Sementsov-Ogievskiy
  2016-12-08  9:44         ` [Qemu-devel] [Nbd] " Wouter Verhelst
  1 sibling, 2 replies; 37+ messages in thread
From: Alex Bligh @ 2016-12-08  3:39 UTC (permalink / raw)
  To: John Snow
  Cc: Alex Bligh, Vladimir Sementsov-Ogievskiy, nbd-general,
	Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Markus Pargmann, Pavel Borzenkov, Denis V. Lunev,
	Wouter Verhelst, Paolo Bonzini


> On 2 Dec 2016, at 18:45, Alex Bligh <alex@alex.org.uk> wrote:
> 
> Thanks. That makes sense - or enough sense for me to carry on commenting!


I finally had some time to go through this extension in detail. Rather
than comment on all the individual patches, I squashed them into a single
commit, did a 'git format-patch' on it, and commented on that.

diff --git a/doc/proto.md b/doc/proto.md
index c443494..9c0981f 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -871,6 +869,50 @@ of the newstyle negotiation.

    Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).

+- `NBD_OPT_META_CONTEXT` (10)
+
+    Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
+    followed by an `NBD_REP_ACK`. If a server replies to such a request
+    with no error message, clients

"*the* server" / "*the* cient"

Perhaps only an 'NBD_REP_ERR_UNSUP' error should prevent the
client querying the server.

+    MAY send NBD_CMD_BLOCK_STATUS
+    commands during the transmission phase.

Add: "If a server replies to such a request with NBD_REP_ERR_UNSUP,
the client MUST NOT send NBD_CMD_BLOCK_STATUS commands during the
transmission phase."

+
+    If the query string is syntactically invalid, the server SHOULD send
+    `NBD_REP_ERR_INVALID`.

'MUST send' else it implies sending nothing is permissible.

+    If the query string is syntactically valid
+    but finds no metadata contexts, the server MUST send a single
+    reply of type `NBD_REP_ACK`.
+
+    This option MUST NOT be requested unless structured replies have

Active voice better:

"The client MUST NOT send this option unless" ...

+    been negotiated first. If a client attempts to do so, a server
+    SHOULD send `NBD_REP_ERR_INVALID`.
+
+    Data:
+    - 32 bits, type
+    - String, query to select a subset of the available metadata
+      contexts. If this is not specified (i.e., length is 4 and no
+      command is sent), then the server MUST send all the metadata
+      contexts it knows about. If specified, this query string MUST
+      start with a name that uniquely identifies a server
+      implementation; e.g., the reference implementation that
+      accompanies this document would support query strings starting
+      with 'nbd-server:'

Why not just define the format of a metadata-context string to be
of the form '<namespace>:<name>' (perhaps this definition goes
elsewhere), and here just say the returned list is a left-match
of all the available metadata-contexts, i.e. all those metadata
contexts whose names either start or consist entirely of the
specified string. If an empty string is specified, all metadata
contexts are returned.

+
+    The type may be one of:
+    - `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
+      selected by the query string is returned to the client without
+      changing any state (i.e., this does not add metadata contexts
+      for further usage).

Somewhere it should say this list is returned by sending
zero or more NBD_REP_META_CONTEXT records followed by a NBD_REP_ACK.

+    - `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
+      selected by the query string is added to the list of existing
+      metadata contexts.
+    - `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts
+      selected by the query string is removed from the list of used
+      metadata contexts. Servers SHOULD NOT reuse existing metadata
+      context IDs.
+
+    The syntax of the query string is not specified, except that
+    implementations MUST support adding and removing individual metadata
+    contexts by simply listing their names.

This seems slightly over complicated. Rather than have a list held
by the server of active metadata contexts, we could simply have
two NBD_OPT_ commands, say NBD_OPT_LIST_META_CONTEXTS and
NBD_OPT_SET_META_CONTEXTS (which simply sets a list). Then no
need for 'type', and _ADD_ and _DEL_.

#### Option reply types

These values are used in the "reply type" field, sent by the server
@@ -882,7 +924,7 @@ during option haggling in the fixed newstyle negotiation.
    information is available, or when sending data related to the option
    (in the case of `NBD_OPT_LIST`) has finished. No data.

...

+- `NBD_REP_META_CONTEXT` (4)
+
+    A description of a metadata context. Data:
+
+    - 32 bits, NBD metadata context ID.
+    - String, name of the metadata context. This is not required to be
+      a human-readable string, but it MUST be valid UTF-8 data.

I would suggest puttig in a length of the string before the string,
which will allow us to expand this later to add fields if necessary.
This seems to be what we are doing elsewhere.

+
+    This specification declares one metadata context. It is called
+    "BASE:allocation" and contains the basic "exists at all" context.
+
There are a number of error reply types, all of which are denoted by
having bit 31 set. All error replies MAY have some data set, in which
case that data is an error message string suitable for display to the user.
@@ -938,15 +991,48 @@ case that data is an error message string suitable for display to the user.

...

+##### Metadata contexts
+

We're missing some explanation as to what these 'metadata contexts'
things actually are. I suggest:

A metadata context represents metadata concerning the selected
export in the form of a list of extents, each with a status. The
meaning of the metadata and the status is dependent upon the
context. Each metadata context has a name which is colon separated,
in the form '<namespace>:<name>'. Namespaces that start with "X-"
are vendor dependent extensions. The 'BASE' namespace represents
metadata contexts defined within this document. Other namespaces
may be registered by reference within this document.

+The "BASE:allocation" 

Backticks: `BASE:allocation`

+ metadata context is the basic "exists at all" metadata context.

Disagree. You're saying that if a server supports metadata contexts
at all, it must support this one. Why? It might just want to do the
backup thing. There's no reason to make this compulsory.

+If an extent is marked with `NBD_STATE_HOLE` at that
+context, this means that the given extent is not allocated in the
+backend storage, and that writing to the extent MAY result in the ENOSPC
+error. This supports sparse file semantics on the server side. If a
+server has only one metadata context (the default), then writing to an
+extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.

I don't understand the purpose of the last sentence here. If a server
does not support the 'BASE:allocation' metadata context then writing
to any extent can fail with ENOSPC. What's the significance of having
exactly one metadata context?

I think the last sentence is probably meant to read something like:

If a server supports the "BASE:allocation" metadata context, then
writing to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail
with ENOSPC.

+For all other cases, this specification requires no specific semantics
+of metadata contexts. Implementations could support metadata
+contexts with semantics like the following:
+
+- Incremental snapshots; if a block is allocated in one metadata
+  context, that implies that it is also allocated in the next level up.
+- Various bits of data about the backend of the storage; e.g., if the
+  storage is written on a RAID array, a metadata context could
+  return information about the redundancy level of a given extent
+- If the backend implements a write-through cache of some sort, or
+  synchronises with other servers, a metadata context could state
+  that an extent is "active" once it has reached permanent storage
+  and/or is synchronized with other servers.
+
+The only requirement of a metadata context is that it MUST be
+representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
+
+Likewise, the syntax of query strings is not specified by this document.
+
+Server implementations SHOULD document their syntax for query strings
+and semantics for resulting metadata contexts in a document like this
+one.

This will need slightly tweaking with the namespace thing. Happy to
have a go if that works.

+
### Transmission phase

#### Flag fields
@@ -983,6 +1069,9 @@ valid may depend on negotiation during the handshake phase.
   content chunk in reply.  MUST NOT be set unless the transmission
   flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
   `EOVERFLOW` error chunk, if the request length is too large.
+- bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
+  set, the client is interested in only one extent per metadata
+  context.

##### Structured reply flags

@@ -1051,6 +1140,10 @@ interpret the "length" bytes of payload.
  64 bits: offset (unsigned)  
  32 bits: hole size (unsigned, MUST be nonzero)  

+- `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
+
+  Defined by the experimental extension `BLOCK_STATUS`; see below.
+
All error chunk types have bit 15 set, and begin with the same
*error*, *message length*, and optional *message* fields as
`NBD_REPLY_TYPE_ERROR`.  If non-zero, *message length* indicates
@@ -1085,7 +1178,7 @@ remaining structured fields at the end.
  were sent earlier in the structured reply, the server SHOULD NOT
  send multiple distinct offsets that lie within the bounds of a
  single content chunk.  Valid as a reply to `NBD_CMD_READ`,
-  `NBD_CMD_WRITE`, and `NBD_CMD_TRIM`.
+  `NBD_CMD_WRITE`, `NBD_CMD_TRIM`, and `NBD_CMD_BLOCK_STATUS`.

  The payload is structured as:

@@ -1259,6 +1352,11 @@ The following request types exist:

    Defined by the experimental `WRITE_ZEROES` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-write-zeroes/doc/proto.md).

+* `NBD_CMD_BLOCK_STATUS` (7)
+
+    Defined by the experimental `BLOCK_STATUS` extension; see below.

This is wrong, because the way we document extensions is that in the branch,
it's as if the relevant extension is no longer experimental. So remove
'experimental'.

+
* Other requests

    Some third-party implementations may require additional protocol
@@ -1345,6 +1443,148 @@ written as branches which can be merged into master if and
when those extensions are promoted to the normative version
of the document in the master branch.

+### `BLOCK_STATUS` extension
+
+With the availability of sparse storage formats, it is often needed to
+query the status of a particular range and read only those blocks of
+data that are actually present on the block device.
+
+Some storage formats and operations over such formats express a
+concept of data dirtiness. Whether the operation is block device
+mirroring, incremental block device backup or any other operation with
+a concept of data dirtiness, they all share a need to provide a list
+of ranges that this particular operation treats as dirty.
+
+To provide such class of information, the `BLOCK_STATUS` extension
+adds a new `NBD_CMD_BLOCK_STATUS` command which returns a list of
+ranges with their respective states.  This extension is not available
+unless the client also negotiates the `STRUCTURED_REPLY` extension.

'unless the client and server negotiate'

+
+* `NBD_REPLY_TYPE_BLOCK_STATUS`
+
+    *length* MUST be 4 + (a positive integer multiple of 8).  This reply
+    represents a series of consecutive block descriptors where the sum
+    of the lengths of the descriptors MUST not be greater than the
+    length of the original request.

I'm a bit unhappy with this. The way structured replies work, the
length of the replies is meant to add up to the lengh requested. I
think implementations might write a reply processor and want to assume
this. I know that the idea here is that the server can effectively
'give up' sending stuff - though with structured replies to be honest
that's less of an issue. If we really need this ability, why not
allow the server to send a final chunk that's in "don't know" state?

+ This chunk type MUST appear at most
+    once per metadata ID in a structured reply. Valid as a reply to
+    `NBD_CMD_BLOCK_STATUS`.
+
+    Servers MUST return an `NBD_REPLY_TYPE_BLOCK_STATUS` chunk for every
+    metadata context ID, except if the semantics of particular
+    metadata contexts mean that the information for one active metadata
+    context is implied by the information for another; e.g., if a
+    particular metadata context can only have meaning for extents where
+    the `NBD_STATE_HOLE` flag is cleared on the "BASE:allocation"
+    context, servers MAY omit the relevant chunks for that context if
+    they already sent an extent with the `NBD_STATE_HOLE` flag set in
+    reply to the same `NBD_CMD_BLOCK_STATUS` command.
+
+    The payload starts with:
+
+        * 32 bits, metadata context ID
+
+    and is followed by a list of one or more descriptors, each with this
+    layout:
+
+        * 32 bits, length (unsigned, MUST NOT be zero)
+        * 32 bits, status flags
+
+    If the client used the `NBD_CMD_FLAG_REQ_ONE` flag in the request,
+    then every reply chunk MUST NOT contain more than one descriptor.

Given you've defined length as '4+(a positive multiple of 8)'
this suggests that you mean 'exactly one' here. One of those
is wrong.

+    Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
+    its request, the server MAY return less descriptors in the reply

s/less/fewer/

+    than would be required to fully specify the whole range of requested
+    information to the client, if the number of descriptors would be
+    over 16 otherwise and looking up the information would be too
+    resource-intensive for the server.

Seems a bit odd we permit this sort of rate limiting but don't
e.g. rate-limit read.

+
+* `NBD_CMD_BLOCK_STATUS`
+
+    A block status query request. Length and offset define the range of
+    interest. Clients MUST NOT use this request unless metadata
+    contexts have been negotiated, which in turn requires the client to
+    first negotiate structured replies. For a successful return, the
+    server MUST use a structured reply, containing at least one chunk of
+    type `NBD_REPLY_TYPE_BLOCK_STATUS`.
+
+    The list of block status descriptors within the
+    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
+    of the file starting from specified *offset*, and the sum of the
+    *length* fields of each descriptor MUST not be greater than the
+    overall *length* of the request. This means that the server MAY
+    return less data than required. However the server MUST return at
+    least one status descriptor.  The server SHOULD use different
+    *status* values between consecutive descriptors,

Why? This seems like a needless restriction.

+     and SHOULD use
+    descriptor lengths that are an integer multiple of 512 bytes where
+    possible (the first and last descriptor of an unaligned query being
+    the most obvious places for an exception).

Why 512 bytes as opposed to 'minimum block size' (or is it because
that is also an experimental extension)?

+  The status flags are
+    intentionally defined so that a server MAY always safely report a
+    status of 0 for any block, although the server SHOULD return
+    additional status values when they can be easily detected.
+
+    If an error occurs, the server SHOULD set the appropriate error
+    code in the error field of either a simple reply or an error
+    chunk.

We should probably point out that you can return an error half way
through - that being the point of structured replies.

+  However, if the error does not involve invalid usage (such
+    as a request beyond the bounds of the file), a server MAY reply
+    with a single block status descriptor with *length* matching the
+    requested length, and *status* of 0 rather than reporting the
+    error.

Wht's the point of this? This appears to say that a server can lie
and return everything as not a hole, and not zero! Surely we're
already covered from the DoS angle?

+    Upon receiving an `NBD_CMD_BLOCK_STATUS` command, the server MUST
+    return the status of the device,

status of the metadata context

+ where the status field of each
+    descriptor is determined by the following bits (all combinations of
+    these bits are possible):

In my mind these status bits are defined entirely by the metadata
context, and the definitions below apply only to `BASE:allocation`

+
+      - `NBD_STATE_HOLE` (bit 0): if set, the block represents a hole
+        (and future writes to that area may cause fragmentation or
+        encounter an `ENOSPC` error); if clear, the block is allocated
+        or the server could not otherwise determine its status.  Note
+        that the use of `NBD_CMD_TRIM` is related to this status, but
+        that the server MAY report a hole even where trim has not been
+        requested, and also that a server MAY report metadata even
+        where a trim has been requested.
+      - `NBD_STATE_ZERO` (bit 1): if set, the block contents read as
+        all zeroes; if clear, the block contents are not known.  Note
+        that the use of `NBD_CMD_WRITE_ZEROES` is related to this
+        status, but that the server MAY report zeroes even where write
+        zeroes has not been requested, and also that a server MAY
+        report unknown content even where write zeroes has been
+        requested.

So the above two are `BASE:allocation` only, but ...

+      - `NBD_STATE_CLEAN` (bit 2): if set, the block represents a
+        portion of the file that is still clean because it has not
+        been written; if clear, the block represents a portion of the
+        file that is dirty, or where the server could not otherwise
+        determine its status. The server MUST NOT set this bit for
+        the "BASE:allocation" context, where it has no meaning.
+      - `NBD_STATE_ACTIVE` (bit 3): if set, the block represents a
+        portion of the file that is "active" in the given metadata
+        context. The server MUST NOT set this bit for the
+        "BASE:allocation" context, where it has no meaning.
+
+    The exact semantics of what it means for a block to be "clean" or
+    "active" at a given metadata context is not defined by this
+    specification, except that the default in both cases should be to
+    clear the bit. That is, when the metadata context does not have
+    knowledge of the relevant status for the given extent, or when the
+    metadata context does not assign any meaning to it, the bits
+    should be cleared.

... all the above should go as these describe the QEMU incremental
backup thing, which we agreed, I think, not to describe here.

+
+    A client SHOULD NOT read from an area that has both `NBD_STATE_HOLE`
+    set and `NBD_STATE_ZERO` clear.

That makes no sense, as normal data has both these bits clear! This
also implies that to comply with this SHOULD, a client needs to
request block status before any read, which is ridiculous. This
should be dropped.

+
+A client MAY close the connection if it detects that the server has
+sent an invalid chunk (such as lengths in the
+`NBD_REPLY_TYPE_BLOCK_STATUS` not summing up to the requested length).

I agree with the above, but this goes counter to the text above allowing
the server to return lengths that do not sum to the requested length.
Also the expression we use elsewhere is 'terminates transmission'
for a close during the transmission phase.

+The server SHOULD return `EINVAL` if it receives a `BLOCK_STATUS`
+request including one or more sectors beyond the size of the device.
+
+The extension adds the following new command flag:
+
+- `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`.
+  SHOULD be set to 1 if the client wants to request information for only
+  one extent per metadata context.
+

Already handled above - this is a duplication

## About this file

This file tries to document the NBD protocol as it is currently


-- 
Alex Bligh

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-08  3:39       ` [Qemu-devel] " Alex Bligh
@ 2016-12-08  6:58         ` Vladimir Sementsov-Ogievskiy
  2016-12-08 14:13           ` Alex Bligh
  2016-12-08  9:44         ` [Qemu-devel] [Nbd] " Wouter Verhelst
  1 sibling, 1 reply; 37+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-12-08  6:58 UTC (permalink / raw)
  To: Alex Bligh, John Snow
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Markus Pargmann, Pavel Borzenkov, Denis V. Lunev,
	Wouter Verhelst, Paolo Bonzini

08.12.2016 06:39, Alex Bligh wrote:

[...]

>> are vendor dependent extensions. The 'BASE' namespace represents
>> metadata contexts defined within this document. Other namespaces
>> may be registered by reference within this document.
>>
>> +The "BASE:allocation"
>>
>> Backticks: `BASE:allocation`

An idea: let's not use uppercase. Why to shout the namespace? 'base' and 
'x-' would be better I think. BASE and X- will provoke all user-defined 
namespaces be in uppercase too and a lot of uppercase will come to the 
code =(


-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-08  3:39       ` [Qemu-devel] " Alex Bligh
  2016-12-08  6:58         ` Vladimir Sementsov-Ogievskiy
@ 2016-12-08  9:44         ` Wouter Verhelst
  2016-12-08 14:40           ` Alex Bligh
  1 sibling, 1 reply; 37+ messages in thread
From: Wouter Verhelst @ 2016-12-08  9:44 UTC (permalink / raw)
  To: Alex Bligh
  Cc: John Snow, nbd-general, Kevin Wolf, Vladimir Sementsov-Ogievskiy,
	qemu-devel, Paolo Bonzini, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Markus Pargmann, Denis V. Lunev

On Thu, Dec 08, 2016 at 03:39:19AM +0000, Alex Bligh wrote:
> 
> > On 2 Dec 2016, at 18:45, Alex Bligh <alex@alex.org.uk> wrote:
> > 
> > Thanks. That makes sense - or enough sense for me to carry on commenting!
> 
> 
> I finally had some time to go through this extension in detail. Rather
> than comment on all the individual patches, I squashed them into a single
> commit, did a 'git format-patch' on it, and commented on that.
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index c443494..9c0981f 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -871,6 +869,50 @@ of the newstyle negotiation.
> 
>     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
> 
> +- `NBD_OPT_META_CONTEXT` (10)
> +
> +    Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
> +    followed by an `NBD_REP_ACK`. If a server replies to such a request
> +    with no error message, clients
> 
> "*the* server" / "*the* cient"
> 
> Perhaps only an 'NBD_REP_ERR_UNSUP' error should prevent the
> client querying the server.

I don't think that's necessarily a good idea. I think if the server
replies with NBD_REP_ERR_INVALID, that means it understands the option
but the user sent something invalid and now we haven't selected any
contexts. That means the server won't be able to provide any metadata
during transmission, either.

Perhaps it should be made clearer that once contexts have been selected
(even if errors occurred later on), NBD_CMD_BLOCK_STATUS MAY be used.

> +    MAY send NBD_CMD_BLOCK_STATUS
> +    commands during the transmission phase.
> 
> Add: "If a server replies to such a request with NBD_REP_ERR_UNSUP,
> the client MUST NOT send NBD_CMD_BLOCK_STATUS commands during the
> transmission phase."

That would be counter to the above.

> +    If the query string is syntactically invalid, the server SHOULD send
> +    `NBD_REP_ERR_INVALID`.
> 
> 'MUST send' else it implies sending nothing is permissible.

Yes, I had already fixed that locally (but didn't push yet, because that
patch is... big, and needs some rework. I'll look at it again later,
incorporating your comments)

> +    If the query string is syntactically valid
> +    but finds no metadata contexts, the server MUST send a single
> +    reply of type `NBD_REP_ACK`.
> +
> +    This option MUST NOT be requested unless structured replies have
> 
> Active voice better:
> 
> "The client MUST NOT send this option unless" ...

Right.

> +    been negotiated first. If a client attempts to do so, a server
> +    SHOULD send `NBD_REP_ERR_INVALID`.
> +
> +    Data:
> +    - 32 bits, type
> +    - String, query to select a subset of the available metadata
> +      contexts. If this is not specified (i.e., length is 4 and no
> +      command is sent), then the server MUST send all the metadata
> +      contexts it knows about. If specified, this query string MUST
> +      start with a name that uniquely identifies a server
> +      implementation; e.g., the reference implementation that
> +      accompanies this document would support query strings starting
> +      with 'nbd-server:'
> 
> Why not just define the format of a metadata-context string to be
> of the form '<namespace>:<name>' (perhaps this definition goes
> elsewhere), and here just say the returned list is a left-match
> of all the available metadata-contexts, i.e. all those metadata
> contexts whose names either start or consist entirely of the
> specified string. If an empty string is specified, all metadata
> contexts are returned.

I also want to make it possible for an implementation to define its own
syntax. Say, a "last-snapshot" thing, or something that says
"diff:snapshot1-snapshot2", or whatever.

> +    The type may be one of:
> +    - `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
> +      selected by the query string is returned to the client without
> +      changing any state (i.e., this does not add metadata contexts
> +      for further usage).
> 
> Somewhere it should say this list is returned by sending
> zero or more NBD_REP_META_CONTEXT records followed by a NBD_REP_ACK.

We do that above already.

> +    - `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
> +      selected by the query string is added to the list of existing
> +      metadata contexts.
> +    - `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts
> +      selected by the query string is removed from the list of used
> +      metadata contexts. Servers SHOULD NOT reuse existing metadata
> +      context IDs.
> +
> +    The syntax of the query string is not specified, except that
> +    implementations MUST support adding and removing individual metadata
> +    contexts by simply listing their names.
> 
> This seems slightly over complicated. Rather than have a list held
> by the server of active metadata contexts, we could simply have
> two NBD_OPT_ commands, say NBD_OPT_LIST_META_CONTEXTS and
> NBD_OPT_SET_META_CONTEXTS (which simply sets a list). Then no
> need for 'type', and _ADD_ and _DEL_.

Hrm. Probably better, yes.

> #### Option reply types
> 
> These values are used in the "reply type" field, sent by the server
> @@ -882,7 +924,7 @@ during option haggling in the fixed newstyle negotiation.
>     information is available, or when sending data related to the option
>     (in the case of `NBD_OPT_LIST`) has finished. No data.
> 
> ....
> 
> +- `NBD_REP_META_CONTEXT` (4)
> +
> +    A description of a metadata context. Data:
> +
> +    - 32 bits, NBD metadata context ID.
> +    - String, name of the metadata context. This is not required to be
> +      a human-readable string, but it MUST be valid UTF-8 data.
> 
> I would suggest puttig in a length of the string before the string,
> which will allow us to expand this later to add fields if necessary.
> This seems to be what we are doing elsewhere.

True enough.

> +    This specification declares one metadata context. It is called
> +    "BASE:allocation" and contains the basic "exists at all" context.
> +
> There are a number of error reply types, all of which are denoted by
> having bit 31 set. All error replies MAY have some data set, in which
> case that data is an error message string suitable for display to the user.
> @@ -938,15 +991,48 @@ case that data is an error message string suitable for display to the user.
> 
> ....
> 
> +##### Metadata contexts
> +
> 
> We're missing some explanation as to what these 'metadata contexts'
> things actually are. I suggest:
> 
> A metadata context represents metadata concerning the selected
> export in the form of a list of extents, each with a status. The
> meaning of the metadata and the status is dependent upon the
> context. Each metadata context has a name which is colon separated,
> in the form '<namespace>:<name>'. Namespaces that start with "X-"
> are vendor dependent extensions.

No, I wouldn't do that, since by definition, every namespace is
vendor-dependent.

Maybe we could ask that people who want to implement their own metadata
context type (rather than be compatible with an existing one) should
register their namespace somewhere, though.

> The 'BASE' namespace represents metadata contexts defined within this
> document. Other namespaces may be registered by reference within this
> document.
> 
> +The "BASE:allocation" 
> 
> Backticks: `BASE:allocation`

Right.

> + metadata context is the basic "exists at all" metadata context.
> 
> Disagree. You're saying that if a server supports metadata contexts
> at all, it must support this one.

No, I'm trying to say that this metadata context exposes whether the
*block* exists at all (i.e., it exports NBD_STATE_HOLE). I should
probably clarify that wording then, if you misunderstood it in that way.

No server MUST implement it (although we might want to make it a SHOULD)

> Why? It might just want to do the backup thing. There's no reason to
> make this compulsory.
> 
> +If an extent is marked with `NBD_STATE_HOLE` at that
> +context, this means that the given extent is not allocated in the
> +backend storage, and that writing to the extent MAY result in the ENOSPC
> +error. This supports sparse file semantics on the server side. If a
> +server has only one metadata context (the default), then writing to an
> +extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.
> 
> I don't understand the purpose of the last sentence here. If a server
> does not support the 'BASE:allocation' metadata context then writing
> to any extent can fail with ENOSPC. What's the significance of having
> exactly one metadata context?

Yes, Vladimir also made that comment, and I've been modifying the text a
bit to that extent.

> I think the last sentence is probably meant to read something like:
> 
> If a server supports the "BASE:allocation" metadata context, then
> writing to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail
> with ENOSPC.

No, it can't.

Other metadata contexts may change the semantics to the extent that if
the block is allocated at the BASE:allocation context, it would still
need to space on another plane. In that case, writing to an area which
is not STATE_HOLE might still fail.

> +For all other cases, this specification requires no specific semantics
> +of metadata contexts. Implementations could support metadata
> +contexts with semantics like the following:
> +
> +- Incremental snapshots; if a block is allocated in one metadata
> +  context, that implies that it is also allocated in the next level up.
> +- Various bits of data about the backend of the storage; e.g., if the
> +  storage is written on a RAID array, a metadata context could
> +  return information about the redundancy level of a given extent
> +- If the backend implements a write-through cache of some sort, or
> +  synchronises with other servers, a metadata context could state
> +  that an extent is "active" once it has reached permanent storage
> +  and/or is synchronized with other servers.
> +
> +The only requirement of a metadata context is that it MUST be
> +representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
> +
> +Likewise, the syntax of query strings is not specified by this document.
> +
> +Server implementations SHOULD document their syntax for query strings
> +and semantics for resulting metadata contexts in a document like this
> +one.
> 
> This will need slightly tweaking with the namespace thing. Happy to
> have a go if that works.

Sure.

> ### Transmission phase
> 
> #### Flag fields
> @@ -983,6 +1069,9 @@ valid may depend on negotiation during the handshake phase.
>    content chunk in reply.  MUST NOT be set unless the transmission
>    flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
>    `EOVERFLOW` error chunk, if the request length is too large.
> +- bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
> +  set, the client is interested in only one extent per metadata
> +  context.
> 
> ##### Structured reply flags
> 
> @@ -1051,6 +1140,10 @@ interpret the "length" bytes of payload.
>   64 bits: offset (unsigned)  
>   32 bits: hole size (unsigned, MUST be nonzero)  
> 
> +- `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
> +
> +  Defined by the experimental extension `BLOCK_STATUS`; see below.
> +
> All error chunk types have bit 15 set, and begin with the same
> *error*, *message length*, and optional *message* fields as
> `NBD_REPLY_TYPE_ERROR`.  If non-zero, *message length* indicates
> @@ -1085,7 +1178,7 @@ remaining structured fields at the end.
>   were sent earlier in the structured reply, the server SHOULD NOT
>   send multiple distinct offsets that lie within the bounds of a
>   single content chunk.  Valid as a reply to `NBD_CMD_READ`,
> -  `NBD_CMD_WRITE`, and `NBD_CMD_TRIM`.
> +  `NBD_CMD_WRITE`, `NBD_CMD_TRIM`, and `NBD_CMD_BLOCK_STATUS`.
> 
>   The payload is structured as:
> 
> @@ -1259,6 +1352,11 @@ The following request types exist:
> 
>     Defined by the experimental `WRITE_ZEROES` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-write-zeroes/doc/proto.md).
> 
> +* `NBD_CMD_BLOCK_STATUS` (7)
> +
> +    Defined by the experimental `BLOCK_STATUS` extension; see below.
> 
> This is wrong, because the way we document extensions is that in the branch,
> it's as if the relevant extension is no longer experimental. So remove
> 'experimental'.
> 
> +
> * Other requests
> 
>     Some third-party implementations may require additional protocol
> @@ -1345,6 +1443,148 @@ written as branches which can be merged into master if and
> when those extensions are promoted to the normative version
> of the document in the master branch.
> 
> +### `BLOCK_STATUS` extension
> +
> +With the availability of sparse storage formats, it is often needed to
> +query the status of a particular range and read only those blocks of
> +data that are actually present on the block device.
> +
> +Some storage formats and operations over such formats express a
> +concept of data dirtiness. Whether the operation is block device
> +mirroring, incremental block device backup or any other operation with
> +a concept of data dirtiness, they all share a need to provide a list
> +of ranges that this particular operation treats as dirty.
> +
> +To provide such class of information, the `BLOCK_STATUS` extension
> +adds a new `NBD_CMD_BLOCK_STATUS` command which returns a list of
> +ranges with their respective states.  This extension is not available
> +unless the client also negotiates the `STRUCTURED_REPLY` extension.
> 
> 'unless the client and server negotiate'
> 
> +
> +* `NBD_REPLY_TYPE_BLOCK_STATUS`
> +
> +    *length* MUST be 4 + (a positive integer multiple of 8).  This reply
> +    represents a series of consecutive block descriptors where the sum
> +    of the lengths of the descriptors MUST not be greater than the
> +    length of the original request.
> 
> I'm a bit unhappy with this. The way structured replies work, the
> length of the replies is meant to add up to the lengh requested. I
> think implementations might write a reply processor and want to assume
> this. I know that the idea here is that the server can effectively
> 'give up' sending stuff - though with structured replies to be honest
> that's less of an issue. If we really need this ability, why not
> allow the server to send a final chunk that's in "don't know" state?

Not a bad idea.

> + This chunk type MUST appear at most
> +    once per metadata ID in a structured reply. Valid as a reply to
> +    `NBD_CMD_BLOCK_STATUS`.
> +
> +    Servers MUST return an `NBD_REPLY_TYPE_BLOCK_STATUS` chunk for every
> +    metadata context ID, except if the semantics of particular
> +    metadata contexts mean that the information for one active metadata
> +    context is implied by the information for another; e.g., if a
> +    particular metadata context can only have meaning for extents where
> +    the `NBD_STATE_HOLE` flag is cleared on the "BASE:allocation"
> +    context, servers MAY omit the relevant chunks for that context if
> +    they already sent an extent with the `NBD_STATE_HOLE` flag set in
> +    reply to the same `NBD_CMD_BLOCK_STATUS` command.
> +
> +    The payload starts with:
> +
> +        * 32 bits, metadata context ID
> +
> +    and is followed by a list of one or more descriptors, each with this
> +    layout:
> +
> +        * 32 bits, length (unsigned, MUST NOT be zero)
> +        * 32 bits, status flags
> +
> +    If the client used the `NBD_CMD_FLAG_REQ_ONE` flag in the request,
> +    then every reply chunk MUST NOT contain more than one descriptor.
> 
> Given you've defined length as '4+(a positive multiple of 8)'
> this suggests that you mean 'exactly one' here. One of those
> is wrong.

That would make it clearer, indeed (although "1" is "not more than one",
and "8" is also "a positive multiple of 8", so it's not *wrong*, per se,
it's just confusing)

> +    Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
> +    its request, the server MAY return less descriptors in the reply
> 
> s/less/fewer/
> 
> +    than would be required to fully specify the whole range of requested
> +    information to the client, if the number of descriptors would be
> +    over 16 otherwise and looking up the information would be too
> +    resource-intensive for the server.
> 
> Seems a bit odd we permit this sort of rate limiting but don't
> e.g. rate-limit read.

Yes, I suppose you're right.

> +* `NBD_CMD_BLOCK_STATUS`
> +
> +    A block status query request. Length and offset define the range of
> +    interest. Clients MUST NOT use this request unless metadata
> +    contexts have been negotiated, which in turn requires the client to
> +    first negotiate structured replies. For a successful return, the
> +    server MUST use a structured reply, containing at least one chunk of
> +    type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> +
> +    The list of block status descriptors within the
> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
> +    of the file starting from specified *offset*, and the sum of the
> +    *length* fields of each descriptor MUST not be greater than the
> +    overall *length* of the request. This means that the server MAY
> +    return less data than required. However the server MUST return at
> +    least one status descriptor.  The server SHOULD use different
> +    *status* values between consecutive descriptors,
> 
> Why? This seems like a needless restriction.
> 
> +     and SHOULD use
> +    descriptor lengths that are an integer multiple of 512 bytes where
> +    possible (the first and last descriptor of an unaligned query being
> +    the most obvious places for an exception).
> 
> Why 512 bytes as opposed to 'minimum block size' (or is it because
> that is also an experimental extension)?

Yes, and this extension does not depend on that one. Hence why it is
also a SHOULD and not a MUST.

> +  The status flags are
> +    intentionally defined so that a server MAY always safely report a
> +    status of 0 for any block, although the server SHOULD return
> +    additional status values when they can be easily detected.
> +
> +    If an error occurs, the server SHOULD set the appropriate error
> +    code in the error field of either a simple reply or an error
> +    chunk.
> 
> We should probably point out that you can return an error half way
> through - that being the point of structured replies.

Right.

(also, I think it MUST NOT send a simple reply, but I'll fix that up
separately)

> +  However, if the error does not involve invalid usage (such
> +    as a request beyond the bounds of the file), a server MAY reply
> +    with a single block status descriptor with *length* matching the
> +    requested length, and *status* of 0 rather than reporting the
> +    error.
> 
> Wht's the point of this? This appears to say that a server can lie
> and return everything as not a hole, and not zero! Surely we're
> already covered from the DoS angle?

I'm not sure, I believe that wording came from the original patch on
which I based my work.

> +    Upon receiving an `NBD_CMD_BLOCK_STATUS` command, the server MUST
> +    return the status of the device,
> 
> status of the metadata context

No, status of the device. A metadata context *describes* status, it
*isn't* one.

Perhaps "status of the device as per the given metadata context", but
hey.

> + where the status field of each
> +    descriptor is determined by the following bits (all combinations of
> +    these bits are possible):
> 
> In my mind these status bits are defined entirely by the metadata
> context, and the definitions below apply only to `BASE:allocation`

Yes, Vladimir made a similar observation, and my WIP patch has that too.

> +
> +      - `NBD_STATE_HOLE` (bit 0): if set, the block represents a hole
> +        (and future writes to that area may cause fragmentation or
> +        encounter an `ENOSPC` error); if clear, the block is allocated
> +        or the server could not otherwise determine its status.  Note
> +        that the use of `NBD_CMD_TRIM` is related to this status, but
> +        that the server MAY report a hole even where trim has not been
> +        requested, and also that a server MAY report metadata even
> +        where a trim has been requested.
> +      - `NBD_STATE_ZERO` (bit 1): if set, the block contents read as
> +        all zeroes; if clear, the block contents are not known.  Note
> +        that the use of `NBD_CMD_WRITE_ZEROES` is related to this
> +        status, but that the server MAY report zeroes even where write
> +        zeroes has not been requested, and also that a server MAY
> +        report unknown content even where write zeroes has been
> +        requested.
> 
> So the above two are `BASE:allocation` only, but ...
> 
> +      - `NBD_STATE_CLEAN` (bit 2): if set, the block represents a
> +        portion of the file that is still clean because it has not
> +        been written; if clear, the block represents a portion of the
> +        file that is dirty, or where the server could not otherwise
> +        determine its status. The server MUST NOT set this bit for
> +        the "BASE:allocation" context, where it has no meaning.
> +      - `NBD_STATE_ACTIVE` (bit 3): if set, the block represents a
> +        portion of the file that is "active" in the given metadata
> +        context. The server MUST NOT set this bit for the
> +        "BASE:allocation" context, where it has no meaning.
> +
> +    The exact semantics of what it means for a block to be "clean" or
> +    "active" at a given metadata context is not defined by this
> +    specification, except that the default in both cases should be to
> +    clear the bit. That is, when the metadata context does not have
> +    knowledge of the relevant status for the given extent, or when the
> +    metadata context does not assign any meaning to it, the bits
> +    should be cleared.
> 
> .... all the above should go as these describe the QEMU incremental
> backup thing, which we agreed, I think, not to describe here.

Right.

> +    A client SHOULD NOT read from an area that has both `NBD_STATE_HOLE`
> +    set and `NBD_STATE_ZERO` clear.
> 
> That makes no sense, as normal data has both these bits clear! This
> also implies that to comply with this SHOULD, a client needs to
> request block status before any read, which is ridiculous. This
> should be dropped.

No, it should not, although it may need rewording. It clarifies that
having STATE_HOLE set (i.e., there's no valid data in the given range)
and STATE_ZERO clear (i.e., we don't assert that it would read as
all-zeroes) is not an invalid thing for a server to set. The spec here
clarifies what a client should do with that information if it gets it
(i.e., "don't read it, it doesn't contain anything interesting").

> +A client MAY close the connection if it detects that the server has
> +sent an invalid chunk (such as lengths in the
> +`NBD_REPLY_TYPE_BLOCK_STATUS` not summing up to the requested length).
> 
> I agree with the above, but this goes counter to the text above allowing
> the server to return lengths that do not sum to the requested length.
> Also the expression we use elsewhere is 'terminates transmission'
> for a close during the transmission phase.

Yes, indeed. I think I've been convinced that allowing the server to
send less data than was requested is indeed a bad idea (except when the
REQ_ONE bit is set), so I'll drop that, too.

> +The server SHOULD return `EINVAL` if it receives a `BLOCK_STATUS`
> +request including one or more sectors beyond the size of the device.
> +
> +The extension adds the following new command flag:
> +
> +- `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`.
> +  SHOULD be set to 1 if the client wants to request information for only
> +  one extent per metadata context.
> +
> 
> Already handled above - this is a duplication

Yes.

My WIP patch moves this out from the (older) "BLOCK_STATUS extension"
section and into the main body of the spec. It also makes a few changes
in wording as per what Vladimir suggested, and I was working on an
NBD_OPT_LIST_META_CONTEXT rather than an NBD_OPT_META_CONTEXT
negotiation option, with the idea that I'd add an OPT_ADD_META_CONTEXT
and an OPT_DEL_META_CONTEXT later. Your idea of using a SET has merit
though, so I'll update it to that effect.

It already removed the two bits that BASE:allocation doesn't use, and
makes a few other changes as well. I haven't had the time to finish it
and send it out for review though, but I'll definitely include your
comments now.

Regards,

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-08  6:58         ` Vladimir Sementsov-Ogievskiy
@ 2016-12-08 14:13           ` Alex Bligh
  0 siblings, 0 replies; 37+ messages in thread
From: Alex Bligh @ 2016-12-08 14:13 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Alex Bligh, John Snow, nbd-general, Kevin Wolf,
	Stefan stefanha@redhat. com, qemu-devel, Markus Pargmann,
	Pavel Borzenkov, Denis V. Lunev, Wouter Verhelst, Paolo Bonzini


> On 8 Dec 2016, at 06:58, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> wrote:
> 
> An idea: let's not use uppercase. Why to shout the namespace? 'base' and 'x-' would be better I think. BASE and X- will provoke all user-defined namespaces be in uppercase too and a lot of uppercase will come to the code =(

I agree

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-08  9:44         ` [Qemu-devel] [Nbd] " Wouter Verhelst
@ 2016-12-08 14:40           ` Alex Bligh
  2016-12-08 15:59             ` Eric Blake
  0 siblings, 1 reply; 37+ messages in thread
From: Alex Bligh @ 2016-12-08 14:40 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: Alex Bligh, John Snow, nbd-general, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy, qemu-devel, Paolo Bonzini,
	Pavel Borzenkov, Stefan stefanha@redhat. com, Markus Pargmann,
	Denis V. Lunev

Wouter,

>> +- `NBD_OPT_META_CONTEXT` (10)
>> +
>> +    Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
>> +    followed by an `NBD_REP_ACK`. If a server replies to such a request
>> +    with no error message, clients
>> 
>> "*the* server" / "*the* cient"
>> 
>> Perhaps only an 'NBD_REP_ERR_UNSUP' error should prevent the
>> client querying the server.
> 
> I don't think that's necessarily a good idea. I think if the server
> replies with NBD_REP_ERR_INVALID, that means it understands the option
> but the user sent something invalid and now we haven't selected any
> contexts. That means the server won't be able to provide any metadata
> during transmission, either.
> 
> Perhaps it should be made clearer that once contexts have been selected
> (even if errors occurred later on), NBD_CMD_BLOCK_STATUS MAY be used.
> 
>> +    MAY send NBD_CMD_BLOCK_STATUS
>> +    commands during the transmission phase.
>> 
>> Add: "If a server replies to such a request with NBD_REP_ERR_UNSUP,
>> the client MUST NOT send NBD_CMD_BLOCK_STATUS commands during the
>> transmission phase."
> 
> That would be counter to the above.

My main point is that the current text does not prohibit sending
NBD_CMD_BLOCK_STATUS if it hasn't been established that the server
supports it.

>> Why not just define the format of a metadata-context string to be
>> of the form '<namespace>:<name>' (perhaps this definition goes
>> elsewhere), and here just say the returned list is a left-match
>> of all the available metadata-contexts, i.e. all those metadata
>> contexts whose names either start or consist entirely of the
>> specified string. If an empty string is specified, all metadata
>> contexts are returned.
> 
> I also want to make it possible for an implementation to define its own
> syntax. Say, a "last-snapshot" thing, or something that says
> "diff:snapshot1-snapshot2", or whatever.

That's a good idea, but doesn't preclude using a colon as a namespace
separator.

>> +    The type may be one of:
>> +    - `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
>> +      selected by the query string is returned to the client without
>> +      changing any state (i.e., this does not add metadata contexts
>> +      for further usage).
>> 
>> Somewhere it should say this list is returned by sending
>> zero or more NBD_REP_META_CONTEXT records followed by a NBD_REP_ACK.
> 
> We do that above already.

Must have missed it.

>> A metadata context represents metadata concerning the selected
>> export in the form of a list of extents, each with a status. The
>> meaning of the metadata and the status is dependent upon the
>> context. Each metadata context has a name which is colon separated,
>> in the form '<namespace>:<name>'. Namespaces that start with "X-"
>> are vendor dependent extensions.
> 
> No, I wouldn't do that, since by definition, every namespace is
> vendor-dependent.
> 
> Maybe we could ask that people who want to implement their own metadata
> context type (rather than be compatible with an existing one) should
> register their namespace somewhere, though.

What I'm trying to say is that there should be three types of namespace:

* "BASE" (or better "base" as Vladimir points out) which is defined within
  the document, i.e. it will say "base:allocated" does X, "base:foo" does
  Y.
* Registered, e.g. "qemu", where the document would say: "The following
  is a list of registered namespaces: ... qemu: to qemu.org" or whatever.
* Unregistered, e.g. "X-Alex-Experiment" where the document merely mentions
  that namespaces beginning with "X-" can be used by anyone, like X- headers
  in SMTP and HTTP.

The purpose of distinguishing between registered and unregistered is
that otherwise we might get two vendors with a product called "fastnbd"
(or whatever) who both pick the same namespace.

I suppose an alternative would be to go the Java-ish way and suggest
people use a domain name (so 'qemu' would be 'qemu.org:whatever').

>> + metadata context is the basic "exists at all" metadata context.
>> 
>> Disagree. You're saying that if a server supports metadata contexts
>> at all, it must support this one.
> 
> No, I'm trying to say that this metadata context exposes whether the
> *block* exists at all (i.e., it exports NBD_STATE_HOLE). I should
> probably clarify that wording then, if you misunderstood it in that way.

Ah. Perhaps 'exists at all' itself is misleading. 'Occupies storage
space'. Or 'is not a hole'?

> No server MUST implement it (although we might want to make it a SHOULD)

Don't see why it should even be a 'SHOULD' to be honest. An nbd
server cooked up for a specific purpose, or with a backend that
can't provide that (or where there is never a hole) shouldn't
be criticised.

>> I think the last sentence is probably meant to read something like:
>> 
>> If a server supports the "BASE:allocation" metadata context, then
>> writing to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail
>> with ENOSPC.
> 
> No, it can't.
> 
> Other metadata contexts may change the semantics to the extent that if
> the block is allocated at the BASE:allocation context, it would still
> need to space on another plane. In that case, writing to an area which
> is not STATE_HOLE might still fail.

I understand the point (though I'm slightly dubious about it given
the discussion we had with the WRITE_ZEROES extension where we agreed
I thought that the purpose of actual zeroes being written to disk
was to avoid ENOSPC errors). However, if that is the case, I'm not
sure what you're trying to say in the original text.

>> Why 512 bytes as opposed to 'minimum block size' (or is it because
>> that is also an experimental extension)?
> 
> Yes, and this extension does not depend on that one. Hence why it is
> also a SHOULD and not a MUST.

OK. As a separate discussion I think we should talk about promoting
WRITE_ZEROES and INFO. The former has a reference implementation,
I think Eric did a qemu implementation, and I did a gonbdserver
implementation. The latter I believe lacks a reference implementation.

>> +  The status flags are
>> +    intentionally defined so that a server MAY always safely report a
>> +    status of 0 for any block, although the server SHOULD return
>> +    additional status values when they can be easily detected.
>> +
>> +    If an error occurs, the server SHOULD set the appropriate error
>> +    code in the error field of either a simple reply or an error
>> +    chunk.
>> 
>> We should probably point out that you can return an error half way
>> through - that being the point of structured replies.
> 
> Right.
> 
> (also, I think it MUST NOT send a simple reply, but I'll fix that up
> separately)

A simple error reply would normally be permitted.

>> +  However, if the error does not involve invalid usage (such
>> +    as a request beyond the bounds of the file), a server MAY reply
>> +    with a single block status descriptor with *length* matching the
>> +    requested length, and *status* of 0 rather than reporting the
>> +    error.
>> 
>> Wht's the point of this? This appears to say that a server can lie
>> and return everything as not a hole, and not zero! Surely we're
>> already covered from the DoS angle?
> 
> I'm not sure, I believe that wording came from the original patch on
> which I based my work.

Sure, I think it's left over detritus.

>> +    Upon receiving an `NBD_CMD_BLOCK_STATUS` command, the server MUST
>> +    return the status of the device,
>> 
>> status of the metadata context
> 
> No, status of the device. A metadata context *describes* status, it
> *isn't* one.
> 
> Perhaps "status of the device as per the given metadata context", but
> hey.

It's actually 'device' I'm arguing with. Server side we refer to them
as 'exports'. Often they aren't files at all. In gonbdserver it
might be a Ceph connection elsewhere.

'return the status of the relevant portion of the export (as per the
given metadata context'?

>> +    A client SHOULD NOT read from an area that has both `NBD_STATE_HOLE`
>> +    set and `NBD_STATE_ZERO` clear.
>> 
>> That makes no sense, as normal data has both these bits clear! This
>> also implies that to comply with this SHOULD, a client needs to
>> request block status before any read, which is ridiculous. This
>> should be dropped.
> 
> No, it should not, although it may need rewording. It clarifies that
> having STATE_HOLE set (i.e., there's no valid data in the given range)
> and STATE_ZERO clear (i.e., we don't assert that it would read as
> all-zeroes) is not an invalid thing for a server to set. The spec here
> clarifies what a client should do with that information if it gets it
> (i.e., "don't read it, it doesn't contain anything interesting").

That's fair enough until the last bit in brackets. Rather than saying
a client SHOULD NOT read it, it should simply say that a read on
such areas will succeed but the data read is undefined (and may
not be stable).

> My WIP patch moves this out from the (older) "BLOCK_STATUS extension"
> section and into the main body of the spec. It also makes a few changes
> in wording as per what Vladimir suggested, and I was working on an
> NBD_OPT_LIST_META_CONTEXT rather than an NBD_OPT_META_CONTEXT
> negotiation option, with the idea that I'd add an OPT_ADD_META_CONTEXT
> and an OPT_DEL_META_CONTEXT later. Your idea of using a SET has merit
> though, so I'll update it to that effect.
> 
> It already removed the two bits that BASE:allocation doesn't use, and
> makes a few other changes as well. I haven't had the time to finish it
> and send it out for review though, but I'll definitely include your
> comments now.

Thanks.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-08 14:40           ` Alex Bligh
@ 2016-12-08 15:59             ` Eric Blake
  2016-12-08 16:03               ` Alex Bligh
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Blake @ 2016-12-08 15:59 UTC (permalink / raw)
  To: Alex Bligh, Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, Vladimir Sementsov-Ogievskiy,
	Stefan stefanha@redhat. com, qemu-devel, Pavel Borzenkov,
	Denis V. Lunev, Markus Pargmann, Paolo Bonzini, John Snow

[-- Attachment #1: Type: text/plain, Size: 4263 bytes --]

On 12/08/2016 08:40 AM, Alex Bligh wrote:

>>> + metadata context is the basic "exists at all" metadata context.
>>>
>>> Disagree. You're saying that if a server supports metadata contexts
>>> at all, it must support this one.
>>
>> No, I'm trying to say that this metadata context exposes whether the
>> *block* exists at all (i.e., it exports NBD_STATE_HOLE). I should
>> probably clarify that wording then, if you misunderstood it in that way.
> 
> Ah. Perhaps 'exists at all' itself is misleading. 'Occupies storage
> space'. Or 'is not a hole'?
> 
>> No server MUST implement it (although we might want to make it a SHOULD)
> 
> Don't see why it should even be a 'SHOULD' to be honest. An nbd
> server cooked up for a specific purpose, or with a backend that
> can't provide that (or where there is never a hole) shouldn't
> be criticised.
> 
>>> I think the last sentence is probably meant to read something like:
>>>
>>> If a server supports the "BASE:allocation" metadata context, then
>>> writing to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail
>>> with ENOSPC.
>>
>> No, it can't.
>>
>> Other metadata contexts may change the semantics to the extent that if
>> the block is allocated at the BASE:allocation context, it would still
>> need to space on another plane. In that case, writing to an area which
>> is not STATE_HOLE might still fail.

Not just that, but it is ALWAYS permissible to report NBD_STATE_HOLE as
clear (just not always optimal) - to allow servers that can't determine
sparseness information, but DO know how to communicate extents to the
client.  Yes, it is boring to communicate a single extent that the
entire file is data, and clients can't optimize their usage in that
case, but it should always be considered semantically correct to do so,
since the presence and knowledge of holes is merely an optimization
opportunity, not a data correctness issue.


>>> Why 512 bytes as opposed to 'minimum block size' (or is it because
>>> that is also an experimental extension)?
>>
>> Yes, and this extension does not depend on that one. Hence why it is
>> also a SHOULD and not a MUST.
> 
> OK. As a separate discussion I think we should talk about promoting
> WRITE_ZEROES and INFO. The former has a reference implementation,
> I think Eric did a qemu implementation, and I did a gonbdserver
> implementation. The latter I believe lacks a reference implementation.

Yes, I still have reference qemu patches for INFO; they did not make it
into qemu 2.8 (while WRITE_ZEROES did), but should make it into 2.9.

I also hope to get structured reads into qemu 2.9, but that's a bigger
task, as I don't have reference patches complete yet.  On the other
hand, since BLOCK_STATUS depends on structured reply, I have all the
more reason to complete it soon.

>>> +    A client SHOULD NOT read from an area that has both `NBD_STATE_HOLE`
>>> +    set and `NBD_STATE_ZERO` clear.
>>>
>>> That makes no sense, as normal data has both these bits clear! This
>>> also implies that to comply with this SHOULD, a client needs to
>>> request block status before any read, which is ridiculous. This
>>> should be dropped.
>>
>> No, it should not, although it may need rewording. It clarifies that
>> having STATE_HOLE set (i.e., there's no valid data in the given range)
>> and STATE_ZERO clear (i.e., we don't assert that it would read as
>> all-zeroes) is not an invalid thing for a server to set. The spec here
>> clarifies what a client should do with that information if it gets it
>> (i.e., "don't read it, it doesn't contain anything interesting").
> 
> That's fair enough until the last bit in brackets. Rather than saying
> a client SHOULD NOT read it, it should simply say that a read on
> such areas will succeed but the data read is undefined (and may
> not be stable).

We should use similar wording to whatever we already say about what a
client would see when reading data cleared by NBD_CMD_TRIM.  After all,
the status of STATE_HOLE set and STATE_ZERO clear is what you logically
get when TRIM cannot guarantee reads-as-zero.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-12-08 15:59             ` Eric Blake
@ 2016-12-08 16:03               ` Alex Bligh
  0 siblings, 0 replies; 37+ messages in thread
From: Alex Bligh @ 2016-12-08 16:03 UTC (permalink / raw)
  To: Eric Blake
  Cc: Alex Bligh, Wouter Verhelst, nbd-general, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy, Stefan stefanha@redhat. com,
	qemu-devel, Pavel Borzenkov, Denis V. Lunev, Markus Pargmann,
	Paolo Bonzini, John Snow

[-- Attachment #1: Type: text/plain, Size: 428 bytes --]


> On 8 Dec 2016, at 15:59, Eric Blake <eblake@redhat.com> wrote:
> 
> We should use similar wording to whatever we already say about what a
> client would see when reading data cleared by NBD_CMD_TRIM.  After all,
> the status of STATE_HOLE set and STATE_ZERO clear is what you logically
> get when TRIM cannot guarantee reads-as-zero.

Yes. It was actually exactly that discussion I was trying to remember.

--
Alex Bligh





[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 842 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2016-12-08 16:04 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-25 11:28 [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension Vladimir Sementsov-Ogievskiy
2016-11-25 14:02 ` Stefan Hajnoczi
2016-11-27 19:17 ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-11-28 11:19   ` Stefan Hajnoczi
2016-11-28 17:33     ` Wouter Verhelst
2016-11-29  9:17       ` Stefan Hajnoczi
2016-11-29 10:50       ` Wouter Verhelst
2016-11-29 12:41         ` Vladimir Sementsov-Ogievskiy
2016-11-29 13:08           ` Wouter Verhelst
2016-11-29 13:07         ` Alex Bligh
2016-12-01 10:14         ` Wouter Verhelst
2016-12-01 11:26           ` Vladimir Sementsov-Ogievskiy
2016-12-02  9:25             ` Wouter Verhelst
2016-11-28 23:15   ` John Snow
2016-11-29 10:18   ` Kevin Wolf
2016-11-29 11:34     ` Vladimir Sementsov-Ogievskiy
2016-11-30 10:41   ` Sergey Talantov
2016-11-29 12:57 ` [Qemu-devel] " Alex Bligh
2016-11-29 14:36   ` Vladimir Sementsov-Ogievskiy
2016-11-29 14:52     ` Alex Bligh
2016-11-29 15:07       ` Vladimir Sementsov-Ogievskiy
2016-11-29 15:17         ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-12-01 23:42   ` [Qemu-devel] " John Snow
2016-12-02  9:16     ` Vladimir Sementsov-Ogievskiy
2016-12-02 18:45     ` Alex Bligh
2016-12-02 20:39       ` John Snow
2016-12-03 11:08         ` Alex Bligh
2016-12-05  8:36         ` Vladimir Sementsov-Ogievskiy
2016-12-06 13:32         ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-12-06 16:39           ` John Snow
2016-12-08  3:39       ` [Qemu-devel] " Alex Bligh
2016-12-08  6:58         ` Vladimir Sementsov-Ogievskiy
2016-12-08 14:13           ` Alex Bligh
2016-12-08  9:44         ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-12-08 14:40           ` Alex Bligh
2016-12-08 15:59             ` Eric Blake
2016-12-08 16:03               ` Alex Bligh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.