[Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

* [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
@ 2016-11-25 11:28 Vladimir Sementsov-Ogievskiy
  2016-11-25 14:02 ` Stefan Hajnoczi
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-11-25 11:28 UTC (permalink / raw)
  To: nbd-general, qemu-devel
  Cc: kwolf, pbonzini, pborzenkov, stefanha, den, w, eblake, alex,
	vsementsov, mpa

With the availability of sparse storage formats, it is often needed
to query status of a particular range and read only those blocks of
data that are actually present on the block device.

To provide such information, the patch adds the BLOCK_STATUS
extension with one new NBD_CMD_BLOCK_STATUS command, a new
structured reply chunk format, and a new transmission flag.

There exists a concept of data dirtiness, which is required
during, for example, incremental block device backup. To express
this concept via NBD protocol, this patch also adds a flag to
NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
provisioning information; however, with the current proposal, data
dirtiness is only useful with additional coordination outside of
the NBD protocol (such as a way to start and stop the server from
tracking dirty sectors).  Future NBD extensions may add commands
to control dirtiness through NBD.

Since NBD protocol has no notion of block size, and to mimic SCSI
"GET LBA STATUS" command more closely, it has been chosen to return
a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
instead of a bitmap.

CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: Wouter Verhelst <w@uter.be>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---

v3:

Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
removing the restriction, that sum of status descriptor lengths must be equal
to requested length. I.e., let's permit the server to replay with less data
than required if it wants.

Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
 NBD_FLAG_CAN_MULTI_CONN in master branch.

And, finally, I've rebased this onto current state of
extension-structured-reply branch (which itself should be rebased on
master IMHO).

By this resend I just want to continue the diqussion, started about half
a year ago. Here is a summary of some questions and ideas from v2
diqussion:

1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
   A: This all is for read-only disks, so the data is static and unchangeable.

2. Q: different granularities of dirty/allocated bitmaps. Any problems?
   A: 1: server replies with status descriptors of any size, granularity
         is hidden from the client
      2: dirty/allocated requests are separate and unrelated to each
         other, so their granularities are not intersecting

3. Q: selecting of dirty bitmap to export
   A: several variants:
      1: id of bitmap is in flags field of request
          pros: - simple
          cons: - it's a hack. flags field is for other uses.
                - we'll have to map bitmap names to these "ids"
      2: introduce extended nbd requests with variable length and exploit this
         feature for BLOCK_STATUS command, specifying bitmap identifier.
         pros: - looks like a true way
         cons: - we have to create additional extension
               - possible we have to create a map,
                 {<QEMU bitmap name> <=> <NBD bitmap id>}
      3: exteranl tool should select which bitmap to export. So, in case of Qemu
         it should be something like qmp command block-export-dirty-bitmap.
         pros: - simple
               - we can extend it to behave like (2) later
         cons: - additional qmp command to implement (possibly, the lesser evil)
         note: Hmm, external tool can make chose between allocated/dirty data too,
               so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.

4. Q: Should not get_{allocated,dirty} be separate commands?
   cons: Two commands with almost same semantic and similar means?
   pros: However here is a good point of separating clearly defined and native
         for block devices GET_BLOCK_STATUS from user-driven and actually
         undefined data, called 'dirtyness'.

5. Number of status descriptors, sent by server, should be restricted
   variants:
   1: just allow server to restrict this as it wants (which was done in v3)
   2: (not excluding 1). Client specifies somehow the maximum for number
      of descriptors.
      2.1: add command flag, which will request only one descriptor
           (otherwise, no restrictions from the client)
      2.2: again, introduce extended nbd requests, and add field to
           specify this maximum

6. A: What to do with unspecified flags (in request/reply)?
   I think the normal variant is to make them reserved. (Server should
   return EINVAL if found unknown bits, client should consider replay
   with unknown bits as an error)

======

Also, an idea on 2-4:

    As we say, that dirtiness is unknown for NBD, and external tool
    should specify, manage and understand, which data is actually
    transmitted, why not just call it user_data and leave status field
    of reply chunk unspecified in this case?

    So, I propose one flag for NBD_CMD_BLOCK_STATUS:
    NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
    Eric's 'Block provisioning status' paragraph.  If it is set, we just
    leave status field for some external... protocol? Who knows, what is
    this user data.

    Note: I'm not sure, that I like this (my) proposal. It's just an
    idea, may be someone like it.  And, I think, it represents what we
    are trying to do more honestly.

    Note2: the next step of generalization will be NBD_CMD_USER, with
    variable request size, structured reply and no definition :)


Another idea, about backups themselves:

    Why do we need allocated/zero status for backup? IMHO we don't.

    Full backup: just do structured read - it will show us, which chunks
    may be treaded as zeroes.

    Incremental backup: get dirty bitmap (somehow, for example through
    user-defined part of proposed command), than, for dirty blocks, read
    them through structured read, so information about zero/unallocated
    areas are here.

For me all the variants above are OK. Let's finally choose something.

v2:
v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html

Since then, we've added the STRUCTURED_REPLY extension, which
necessitates a rather larger rebase; I've also changed things
to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
modes to be determined by boolean flags (rather than by fixed
values of the 16-bit flags field), changed the reply status fields
to be bitwise-or values (with a default of 0 always being sane),
and changed the descriptor layout to drop an offset but to include
a 32-bit status so that the descriptor is nicely 8-byte aligned
without padding.

 doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 154 insertions(+), 1 deletion(-)

diff --git a/doc/proto.md b/doc/proto.md
index 1c2fa5b..253b6f1 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -755,6 +755,8 @@ The field has the following format:
   MUST leave this flag clear if structured replies have not been
   negotiated. Clients MUST NOT set the `NBD_CMD_FLAG_DF` request
   flag unless this transmission flag is set.
+- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental
+  `BLOCK_STATUS` extension; see below.
 
 Clients SHOULD ignore unknown flags.
 
@@ -1036,6 +1038,10 @@ interpret the "length" bytes of payload.
   64 bits: offset (unsigned)  
   32 bits: hole size (unsigned, MUST be nonzero)  
 
+- `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
+
+  Defined by the experimental extension `BLOCK_STATUS`; see below.
+
 All error chunk types have bit 15 set, and begin with the same
 *error*, *message length*, and optional *message* fields as
 `NBD_REPLY_TYPE_ERROR`.  If non-zero, *message length* indicates
@@ -1070,7 +1076,7 @@ remaining structured fields at the end.
   were sent earlier in the structured reply, the server SHOULD NOT
   send multiple distinct offsets that lie within the bounds of a
   single content chunk.  Valid as a reply to `NBD_CMD_READ`,
-  `NBD_CMD_WRITE`, and `NBD_CMD_TRIM`.
+  `NBD_CMD_WRITE`, `NBD_CMD_TRIM`, and `NBD_CMD_BLOCK_STATUS`.
 
   The payload is structured as:
 
@@ -1247,6 +1253,11 @@ The following request types exist:
 
     Defined by the experimental `WRITE_ZEROES` [extension](https://github.com/yoe/nbd/blob/extension-write-zeroes/doc/proto.md).
 
+* `NBD_CMD_BLOCK_STATUS` (7)
+
+    Defined by the experimental `BLOCK_STATUS` extension; see below.
+
+
 * Other requests
 
     Some third-party implementations may require additional protocol
@@ -1331,6 +1342,148 @@ written as branches which can be merged into master if and
 when those extensions are promoted to the normative version
 of the document in the master branch.
 
+### `BLOCK_STATUS` extension
+
+With the availability of sparse storage formats, it is often needed to
+query the status of a particular range and read only those blocks of
+data that are actually present on the block device.
+
+Some storage formats and operations over such formats express a
+concept of data dirtiness. Whether the operation is block device
+mirroring, incremental block device backup or any other operation with
+a concept of data dirtiness, they all share a need to provide a list
+of ranges that this particular operation treats as dirty.
+
+To provide such class of information, the `BLOCK_STATUS` extension
+adds a new `NBD_CMD_BLOCK_STATUS` command which returns a list of
+ranges with their respective states.  This extension is not available
+unless the client also negotiates the `STRUCTURED_REPLY` extension.
+
+* `NBD_FLAG_SEND_BLOCK_STATUS`
+
+    The server SHOULD set this transmission flag to 1 if structured
+    replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS`
+    request is supported.
+
+* `NBD_REPLY_TYPE_BLOCK_STATUS`
+
+    *length* MUST be a positive integer multiple of 8.  This reply
+    represents a series of consecutive block descriptors where the sum
+    of the lengths of the descriptors MUST not be greater than the
+    length of the original request.  This chunk type MUST appear at most
+    once in a structured reply. Valid as a reply to
+    `NBD_CMD_BLOCK_STATUS`.
+
+    The payload is structured as a list of one or more descriptors,
+    each with this layout:
+
+        * 32 bits, length (unsigned, MUST NOT be zero)
+        * 32 bits, status flags
+
+    The definition of the status flags is determined based on the
+    flags present in the original request.
+
+* `NBD_CMD_BLOCK_STATUS`
+
+    A block status query request. Length and offset define the range
+    of interest. Clients SHOULD NOT use this request unless the server
+    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
+    in turn requires the client to first negotiate structured replies.
+    For a successful return, the server MUST use a structured reply,
+    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
+
+    The list of block status descriptors within the
+    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
+    of the file starting from specified *offset*, and the sum of the
+    *length* fields of each descriptor MUST not be greater than the
+    overall *length* of the request. This means that the server MAY
+    return less data than required. However the server MUST return at
+    least one status descriptor.  The server SHOULD use different
+    *status* values between consecutive descriptors, and SHOULD use
+    descriptor lengths that are an integer multiple of 512 bytes where
+    possible (the first and last descriptor of an unaligned query being
+    the most obvious places for an exception). The status flags are
+    intentionally defined so that a server MAY always safely report a
+    status of 0 for any block, although the server SHOULD return
+    additional status values when they can be easily detected.
+
+    If an error occurs, the server SHOULD set the appropriate error
+    code in the error field of either a simple reply or an error
+    chunk.  However, if the error does not involve invalid usage (such
+    as a request beyond the bounds of the file), a server MAY reply
+    with a single block status descriptor with *length* matching the
+    requested length, and *status* of 0 rather than reporting the
+    error.
+
+    The type of information requested by the client is determined by
+    the request flags, as follows:
+
+    1. Block provisioning status
+
+    Upon receiving an `NBD_CMD_BLOCK_STATUS` command with the flag
+    `NBD_FLAG_STATUS_DIRTY` clear, the server MUST return the
+    provisioning status of the device, where the status field of each
+    descriptor is determined by the following bits (all four
+    combinations of these two bits are possible):
+
+      - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole
+        (and future writes to that area may cause fragmentation or
+        encounter an `ENOSPC` error); if clear, the block is allocated
+        or the server could not otherwise determine its status.  Note
+        that the use of `NBD_CMD_TRIM` is related to this status, but
+        that the server MAY report a hole even where trim has not been
+        requested, and also that a server MAY report allocation even
+        where a trim has been requested.
+      - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as
+        all zeroes; if clear, the block contents are not known.  Note
+        that the use of `NBD_CMD_WRITE_ZEROES` is related to this
+        status, but that the server MAY report zeroes even where write
+        zeroes has not been requested, and also that a server MAY
+        report unknown content even where write zeroes has been
+        requested.
+
+    The client SHOULD NOT read from an area that has both
+    `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.
+
+    2. Block dirtiness status
+
+    This command is meant to operate in tandem with other (non-NBD)
+    channels to the server.  Generally, a "dirty" block is a block
+    that has been written to by someone, but the exact meaning of "has
+    been written" is left to the implementation.  For example, a
+    virtual machine monitor could provide a (non-NBD) command to start
+    tracking blocks written by the virtual machine.  A backup client
+    can then connect to an NBD server provided by the virtual machine
+    monitor and use `NBD_CMD_BLOCK_STATUS` with the
+    `NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty
+    blocks that the virtual machine has changed.
+
+    An implementation that doesn't track the "dirtiness" state of
+    blocks MUST either fail this command with `EINVAL`, or mark all
+    blocks as dirty in the descriptor that it returns.  Upon receiving
+    an `NBD_CMD_BLOCK_STATUS` command with the flag
+    `NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness
+    status of the device, where the status field of each descriptor is
+    determined by the following bit:
+
+      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
+        portion of the file that is still clean because it has not
+        been written; if clear, the block represents a portion of the
+        file that is dirty, or where the server could not otherwise
+        determine its status.
+
+A client MAY close the connection if it detects that the server has
+sent an invalid chunks (such as lengths in the
+`NBD_REPLY_TYPE_BLOCK_STATUS` not summing up to the requested length).
+The server SHOULD return `EINVAL` if it receives a `BLOCK_STATUS`
+request including one or more sectors beyond the size of the device.
+
+The extension adds the following new command flag:
+
+- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`.
+  SHOULD be set to 1 if the client wants to request dirtiness status
+  rather than provisioning status.
+
 ## About this file
 
 This file tries to document the NBD protocol as it is currently
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread