All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
@ 2016-04-04 16:39 Eric Blake
  2016-04-04 18:06 ` [Qemu-devel] [Nbd] " Alex Bligh
                   ` (4 more replies)
  0 siblings, 5 replies; 59+ messages in thread
From: Eric Blake @ 2016-04-04 16:39 UTC (permalink / raw)
  To: nbd-general
  Cc: Kevin Wolf, qemu-devel, Paolo Bonzini, Pavel Borzenkov,
	Stefan Hajnoczi, Denis V. Lunev, Wouter Verhelst

With the availability of sparse storage formats, it is often needed
to query status of a particular range and read only those blocks of
data that are actually present on the block device.

To provide such information, the patch adds the BLOCK_STATUS
extension with one new NBD_CMD_BLOCK_STATUS command, a new
structured reply chunk format, and a new transmission flag.

There exists a concept of data dirtiness, which is required
during, for example, incremental block device backup. To express
this concept via NBD protocol, this patch also adds a flag to
NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
provisioning information; however, with the current proposal, data
dirtiness is only useful with additional coordination outside of
the NBD protocol (such as a way to start and stop the server from
tracking dirty sectors).  Future NBD extensions may add commands
to control dirtiness through NBD.

Since NBD protocol has no notion of block size, and to mimic SCSI
"GET LBA STATUS" command more closely, it has been chosen to return
a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
instead of a bitmap.

CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: Wouter Verhelst <w@uter.be>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
---

v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html

Since then, we've added the STRUCTURED_REPLY extension, which
necessitates a rather larger rebase; I've also changed things
to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
modes to be determined by boolean flags (rather than by fixed
values of the 16-bit flags field), changed the reply status fields
to be bitwise-or values (with a default of 0 always being sane),
and changed the descriptor layout to drop an offset but to include
a 32-bit status so that the descriptor is nicely 8-byte aligned
without padding.

 doc/proto.md | 152 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 151 insertions(+), 1 deletion(-)

diff --git a/doc/proto.md b/doc/proto.md
index fb97217..d280bbd 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -296,6 +296,8 @@ immediately after the handshake flags field in oldstyle negotiation:
   `WRITE_ZEROES` extension; see below.
 - bit 7, `NBD_FLAG_SEND_DF`; defined by the experimental `STRUCTURED_REPLY`
   extension; see below.
+- bit 8, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental
+  `BLOCK_STATUS` extension; see below.

 Clients SHOULD ignore unknown flags.

@@ -583,6 +585,10 @@ The following request types exist:

     Defined by the experimental `WRITE_ZEROES` extension; see below.

+* `NBD_CMD_BLOCK_STATUS` (7)
+
+    Defined by the experimental `BLOCK_STATUS` extension; see below.
+
 * Other requests

     Some third-party implementations may require additional protocol
@@ -922,7 +928,8 @@ error, and alters the reply to the `NBD_CMD_READ` request.
       were sent earlier in the structured reply, the server SHOULD NOT
       send multiple distinct offsets that lie within the bounds of a
       single content chunk.  Valid as a reply to `NBD_CMD_READ`,
-      `NBD_CMD_WRITE`, `NBD_CMD_WRITE_ZEROES`, and `NBD_CMD_TRIM`.
+      `NBD_CMD_WRITE`, `NBD_CMD_WRITE_ZEROES`, `NBD_CMD_TRIM`, and
+      `NBD_CMD_BLOCK_STATUS`.

       The payload is structured as:

@@ -965,6 +972,10 @@ error, and alters the reply to the `NBD_CMD_READ` request.
       64 bits: offset (unsigned)  
       32 bits: hole size (unsigned, MUST be nonzero)  

+    - `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
+
+      Defined by the experimental extension `BLOCK_STATUS`; see below.
+
 * `NBD_OPT_STRUCTURED_REPLY`

     The client wishes to use structured replies during the
@@ -1094,6 +1105,145 @@ error, and alters the reply to the `NBD_CMD_READ` request.
     support unfragmented reads in which the client's request length
     does not exceed 65,536 bytes.

+### `BLOCK_STATUS` extension
+
+With the availability of sparse storage formats, it is often needed to
+query the status of a particular range and read only those blocks of
+data that are actually present on the block device.
+
+Some storage formats and operations over such formats express a
+concept of data dirtiness. Whether the operation is block device
+mirroring, incremental block device backup or any other operation with
+a concept of data dirtiness, they all share a need to provide a list
+of ranges that this particular operation treats as dirty.
+
+To provide such class of information, the `BLOCK_STATUS` extension
+adds a new `NBD_CMD_BLOCK_STATUS` command which returns a list of
+ranges with their respective states.  This extension is not available
+unless the client also negotiates the `STRUCTURED_REPLY` extension.
+
+* `NBD_FLAG_SEND_BLOCK_STATUS`
+
+    The server SHOULD set this transmission flag to 1 if structured
+    replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS`
+    request is supported.
+
+* `NBD_REPLY_TYPE_BLOCK_STATUS`
+
+    *length* MUST be a positive integer multiple of 8.  This reply
+    represents a series of consecutive block descriptors where the sum
+    of the lengths of the descriptors equals the length of the
+    original request.  This chunk type MUST appear at most once in a
+    structured reply. Valid as a reply to `NBD_CMD_BLOCK_STATUS.
+
+    The payload is structured as a list of one or more descriptors,
+    each with this layout:
+
+        * 32 bits, length (unsigned, MUST NOT be zero)
+        * 32 bits, status flags
+
+    The definition of the status flags is determined based on the
+    flags present in the original request.
+
+* `NBD_CMD_BLOCK_STATUS`
+
+    A block status query request. Length and offset define the range
+    of interest. Clients SHOULD NOT use this request unless the server
+    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
+    in turn requires the client to first negotiate structured replies.
+    For a successful return, the server MUST use a structured reply,
+    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
+
+    The list of block status descriptors within the
+    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
+    of the file, and the sum of the *length* fields of each descriptor
+    MUST match the overall *length* of the request.  The server SHOULD
+    use different *status* values between consecutive descriptors, and
+    SHOULD use descriptor lengths that are an integer multiple of 512
+    bytes where possible (the first and last descriptor of an
+    unaligned query being the most obvious places for an exception).
+    The status flags are intentionally defined so that a server MAY
+    always safely report a status of 0 for any block, although the
+    server SHOULD return additional status values when they can be
+    easily detected.
+
+    If an error occurs, the server SHOULD set the appropriate error
+    code in the error field of either a simple reply or an error
+    chunk.  However, if the error does not involve invalid usage (such
+    as a request beyond the bounds of the file), a server MAY reply
+    with a single block status descriptor with *length* matching the
+    requested length, and *status* of 0 rather than reporting the
+    error.
+
+    The type of information requested by the client is determined by
+    the request flags, as follows:
+
+    1. Block provisioning status
+
+    Upon receiving an `NBD_CMD_BLOCK_STATUS` command with the flag
+    `NBD_FLAG_STATUS_DIRTY` clear, the server MUST return the
+    provisioning status of the device, where the status field of each
+    descriptor is determined by the following bits (all four
+    combinations of these two bits are possible):
+
+      - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole
+        (and future writes to that area may cause fragmentation or
+        encounter an `ENOSPC` error); if clear, the block is allocated
+        or the server could not otherwise determine its status.  Note
+        that the use of `NBD_CMD_TRIM` is related to this status, but
+        that the server MAY report a hole even where trim has not been
+        requested, and also that a server MAY report allocation even
+        where a trim has been requested.
+      - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as
+        all zeroes; if clear, the block contents are not known.  Note
+        that the use of `NBD_CMD_WRITE_ZEROES` is related to this
+        status, but that the server MAY report zeroes even where write
+        zeroes has not been requested, and also that a server MAY
+        report unknown content even where write zeroes has been
+        requested.
+
+    The client SHOULD NOT read from an area that has both
+    `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.
+
+    2. Block dirtiness status
+
+    This command is meant to operate in tandem with other (non-NBD)
+    channels to the server.  Generally, a "dirty" block is a block
+    that has been written to by someone, but the exact meaning of "has
+    been written" is left to the implementation.  For example, a
+    virtual machine monitor could provide a (non-NBD) command to start
+    tracking blocks written by the virtual machine.  A backup client
+    can then connect to an NBD server provided by the virtual machine
+    monitor and use `NBD_CMD_BLOCK_STATUS` with the
+    `NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty
+    blocks that the virtual machine has changed.
+
+    An implementation that doesn't track the "dirtiness" state of
+    blocks MUST either fail this command with `EINVAL`, or mark all
+    blocks as dirty in the descriptor that it returns.  Upon receiving
+    an `NBD_CMD_BLOCK_STATUS` command with the flag
+    `NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness
+    status of the device, where the status field of each descriptor is
+    determined by the following bit:
+
+      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
+        portion of the file that is still clean because it has not
+        been written; if clear, the block represents a portion of the
+        file that is dirty, or where the server could not otherwise
+        determine its status.
+
+A client MAY close the connection if it detects that the server has
+sent an invalid chunks (such as lengths in the
+`NBD_REPLY_TYPE_BLOCK_STATUS` not summing up to the requested length).
+The server SHOULD return `EINVAL` if it receives a `BLOCK_STATUS`
+request including one or more sectors beyond the size of the device.
+
+The extension adds the following new command flag:
+
+- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`.
+  SHOULD be set to 1 if the client wants to request dirtiness status
+  rather than provisioning status.
+
 ## About this file

 This file tries to document the NBD protocol as it is currently
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 16:39 [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension Eric Blake
@ 2016-04-04 18:06 ` Alex Bligh
  2016-04-04 19:34   ` Eric Blake
  2016-04-04 22:40 ` Wouter Verhelst
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 59+ messages in thread
From: Alex Bligh @ 2016-04-04 18:06 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Denis V. Lunev, Wouter Verhelst,
	Paolo Bonzini


On 4 Apr 2016, at 17:39, Eric Blake <eblake@redhat.com> wrote:

> +    This command is meant to operate in tandem with other (non-NBD)
> +    channels to the server.  Generally, a "dirty" block is a block
> +    that has been written to by someone, but the exact meaning of "has
> +    been written" is left to the implementation.  For example, a
> +    virtual machine monitor could provide a (non-NBD) command to start
> +    tracking blocks written by the virtual machine.  A backup client
> +    can then connect to an NBD server provided by the virtual machine
> +    monitor and use `NBD_CMD_BLOCK_STATUS` with the
> +    `NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty
> +    blocks that the virtual machine has changed.
> +
> +    An implementation that doesn't track the "dirtiness" state of
> +    blocks MUST either fail this command with `EINVAL`, or mark all
> +    blocks as dirty in the descriptor that it returns.  Upon receiving
> +    an `NBD_CMD_BLOCK_STATUS` command with the flag
> +    `NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness
> +    status of the device, where the status field of each descriptor is
> +    determined by the following bit:
> +
> +      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
> +        portion of the file that is still clean because it has not
> +        been written; if clear, the block represents a portion of the
> +        file that is dirty, or where the server could not otherwise
> +        determine its status.

A couple of questions:

1. I am not sure that the block dirtiness and the zero/allocation/hole thing
   always have the same natural blocksize. It's pretty easy to imagine
   a server whose natural blocksize is a disk sector (and can therefore
   report presence of zeroes to that resolution) but where 'dirtiness'
   was maintained independently at a less fine-grained level. Maybe
   that suggests 2 commands would be useful.

2. Given the communication is out of band, how is it realistically
   possible to sync this backup? You'll ask for all the dirty blocks,
   but whilst the command is being executed (as well as immediately
   after the reply) further blocks may be dirtied. So your reply
   always overestimates what is clean (probably the wrong way around).
   Furthermore, the next time you do a 'backup', you don't know whether
   the blocks were dirty as they were dirty on the previous backup,
   or because they were dirty on this backup.

If I was designing a backup protocol (off the top of my head) I'd
make all commands return a monotonic 64 bit counter of the number of
writes to the disk since some arbitrary time, and provide a 'GETDIRTY'
command that returned all blocks with a monotonic counter greater than that.
That way I could precisely get the writes that were executed since
any particular read. You'd allow it to be 'slack' and include things
in that list that might not have changed (i.e. false positives) but
not false negatives.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 18:06 ` [Qemu-devel] [Nbd] " Alex Bligh
@ 2016-04-04 19:34   ` Eric Blake
  2016-04-04 19:54     ` Denis V. Lunev
                       ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Eric Blake @ 2016-04-04 19:34 UTC (permalink / raw)
  To: Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Denis V. Lunev, Wouter Verhelst,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 5503 bytes --]

On 04/04/2016 12:06 PM, Alex Bligh wrote:
> 
> On 4 Apr 2016, at 17:39, Eric Blake <eblake@redhat.com> wrote:
> 
>> +    This command is meant to operate in tandem with other (non-NBD)
>> +    channels to the server.  Generally, a "dirty" block is a block
>> +    that has been written to by someone, but the exact meaning of "has
>> +    been written" is left to the implementation.  For example, a
>> +    virtual machine monitor could provide a (non-NBD) command to start
>> +    tracking blocks written by the virtual machine.  A backup client
>> +    can then connect to an NBD server provided by the virtual machine
>> +    monitor and use `NBD_CMD_BLOCK_STATUS` with the
>> +    `NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty
>> +    blocks that the virtual machine has changed.
>> +
>> +    An implementation that doesn't track the "dirtiness" state of
>> +    blocks MUST either fail this command with `EINVAL`, or mark all
>> +    blocks as dirty in the descriptor that it returns.  Upon receiving
>> +    an `NBD_CMD_BLOCK_STATUS` command with the flag
>> +    `NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness
>> +    status of the device, where the status field of each descriptor is
>> +    determined by the following bit:
>> +
>> +      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
>> +        portion of the file that is still clean because it has not
>> +        been written; if clear, the block represents a portion of the
>> +        file that is dirty, or where the server could not otherwise
>> +        determine its status.
> 
> A couple of questions:
> 
> 1. I am not sure that the block dirtiness and the zero/allocation/hole thing
>    always have the same natural blocksize. It's pretty easy to imagine
>    a server whose natural blocksize is a disk sector (and can therefore
>    report presence of zeroes to that resolution) but where 'dirtiness'
>    was maintained independently at a less fine-grained level. Maybe
>    that suggests 2 commands would be useful.

In fact, qemu does just that with qcow2 images - the user can request a
dirtiness granularity that is much larger than cluster granularity
(where clusters are the current limitation on reporting holes, but where
Kevin Wolf has an idea about a potential qcow2 extension that would even
let us report holes at a sector granularity).

Nothing requires the two uses to report at the same granularity.  THe
NBD_REPLY_TYPE_BLOCK_STATUS allows the server to divide into descriptors
as it sees fit (so it could report holes at a 4k granularity, but
dirtiness only at a 64k granularity) - all that matters is that when all
the descriptors have been sent, they total up to the length of the
original client request.  So by itself, granularity does not require
another command.

> 
> 2. Given the communication is out of band, how is it realistically
>    possible to sync this backup? You'll ask for all the dirty blocks,
>    but whilst the command is being executed (as well as immediately
>    after the reply) further blocks may be dirtied. So your reply
>    always overestimates what is clean (probably the wrong way around).
>    Furthermore, the next time you do a 'backup', you don't know whether
>    the blocks were dirty as they were dirty on the previous backup,
>    or because they were dirty on this backup.

You are correct that as a one-way operation, querying dirtiness is not
very useful if there is not a way to mark something clean, or if
something else can be dirtying things in parallel.  But that doesn't
mean the command is not useful - if the NBD server is exporting a file
as read-only, where nothing else can be dirtying it in parallel, then a
single pass over the dirty information is sufficient to learn what
portions of the file to copy out.

At this point, I was just trying to rebase the proposal as originally
made by Denis and Pavel; perhaps they will have more insight on how they
envisioned using the command, or on whether we should try harder to make
this more of a two-way protocol (where the client can tell the server
when to mark something as clean, or when to start tracking whether
something is dirty).

> 
> If I was designing a backup protocol (off the top of my head) I'd
> make all commands return a monotonic 64 bit counter of the number of
> writes to the disk since some arbitrary time, and provide a 'GETDIRTY'
> command that returned all blocks with a monotonic counter greater than that.
> That way I could precisely get the writes that were executed since
> any particular read. You'd allow it to be 'slack' and include things
> in that list that might not have changed (i.e. false positives) but
> not false negatives.

Yes, that might work as an implementation - but there's the question of
whether other implementations would also work.  We want the protocol to
describe the concept, and not be too heavily tied to one particular
implementation.

The documentation is also trying to be very straightforward that asking
about dirtiness requires out-of-band coordination, and that a server can
just blindly report everything as dirty if there is no better thing to
report.  So anyone actually making use of this command already has to be
aware of the out-of-band coordination needed to make it useful.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 19:34   ` Eric Blake
@ 2016-04-04 19:54     ` Denis V. Lunev
  2016-04-04 20:03       ` Alex Bligh
  2016-04-04 23:08       ` Wouter Verhelst
  2016-04-04 19:58     ` Alex Bligh
  2016-04-05 13:38     ` Paolo Bonzini
  2 siblings, 2 replies; 59+ messages in thread
From: Denis V. Lunev @ 2016-04-04 19:54 UTC (permalink / raw)
  To: Eric Blake, Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Paolo Bonzini

On 04/04/2016 10:34 PM, Eric Blake wrote:
> On 04/04/2016 12:06 PM, Alex Bligh wrote:
>> On 4 Apr 2016, at 17:39, Eric Blake <eblake@redhat.com> wrote:
>>
>>> +    This command is meant to operate in tandem with other (non-NBD)
>>> +    channels to the server.  Generally, a "dirty" block is a block
>>> +    that has been written to by someone, but the exact meaning of "has
>>> +    been written" is left to the implementation.  For example, a
>>> +    virtual machine monitor could provide a (non-NBD) command to start
>>> +    tracking blocks written by the virtual machine.  A backup client
>>> +    can then connect to an NBD server provided by the virtual machine
>>> +    monitor and use `NBD_CMD_BLOCK_STATUS` with the
>>> +    `NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty
>>> +    blocks that the virtual machine has changed.
>>> +
>>> +    An implementation that doesn't track the "dirtiness" state of
>>> +    blocks MUST either fail this command with `EINVAL`, or mark all
>>> +    blocks as dirty in the descriptor that it returns.  Upon receiving
>>> +    an `NBD_CMD_BLOCK_STATUS` command with the flag
>>> +    `NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness
>>> +    status of the device, where the status field of each descriptor is
>>> +    determined by the following bit:
>>> +
>>> +      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
>>> +        portion of the file that is still clean because it has not
>>> +        been written; if clear, the block represents a portion of the
>>> +        file that is dirty, or where the server could not otherwise
>>> +        determine its status.
>> A couple of questions:
>>
>> 1. I am not sure that the block dirtiness and the zero/allocation/hole thing
>>     always have the same natural blocksize. It's pretty easy to imagine
>>     a server whose natural blocksize is a disk sector (and can therefore
>>     report presence of zeroes to that resolution) but where 'dirtiness'
>>     was maintained independently at a less fine-grained level. Maybe
>>     that suggests 2 commands would be useful.
> In fact, qemu does just that with qcow2 images - the user can request a
> dirtiness granularity that is much larger than cluster granularity
> (where clusters are the current limitation on reporting holes, but where
> Kevin Wolf has an idea about a potential qcow2 extension that would even
> let us report holes at a sector granularity).
>
> Nothing requires the two uses to report at the same granularity.  THe
> NBD_REPLY_TYPE_BLOCK_STATUS allows the server to divide into descriptors
> as it sees fit (so it could report holes at a 4k granularity, but
> dirtiness only at a 64k granularity) - all that matters is that when all
> the descriptors have been sent, they total up to the length of the
> original client request.  So by itself, granularity does not require
> another command.
exactly!


>> 2. Given the communication is out of band, how is it realistically
>>     possible to sync this backup? You'll ask for all the dirty blocks,
>>     but whilst the command is being executed (as well as immediately
>>     after the reply) further blocks may be dirtied. So your reply
>>     always overestimates what is clean (probably the wrong way around).
>>     Furthermore, the next time you do a 'backup', you don't know whether
>>     the blocks were dirty as they were dirty on the previous backup,
>>     or because they were dirty on this backup.
> You are correct that as a one-way operation, querying dirtiness is not
> very useful if there is not a way to mark something clean, or if
> something else can be dirtying things in parallel.  But that doesn't
> mean the command is not useful - if the NBD server is exporting a file
> as read-only, where nothing else can be dirtying it in parallel, then a
> single pass over the dirty information is sufficient to learn what
> portions of the file to copy out.
>
> At this point, I was just trying to rebase the proposal as originally
> made by Denis and Pavel; perhaps they will have more insight on how they
> envisioned using the command, or on whether we should try harder to make
> this more of a two-way protocol (where the client can tell the server
> when to mark something as clean, or when to start tracking whether
> something is dirty).
for now and for QEMU we want this to expose accumulated dirtiness
of the block device, which is collected by the server. Yes, this requires
external coordination. May be this COULD be the part of the protocol,
but QEMU will not use that part of the protocol.

saying about dirtiness, we would soon come to the fact, that
we can have several dirtiness states regarding different
lines of incremental backups. This complexity is hidden
inside QEMU and it would be very difficult to publish and
reuse it.


>> If I was designing a backup protocol (off the top of my head) I'd
>> make all commands return a monotonic 64 bit counter of the number of
>> writes to the disk since some arbitrary time, and provide a 'GETDIRTY'
>> command that returned all blocks with a monotonic counter greater than that.
>> That way I could precisely get the writes that were executed since
>> any particular read. You'd allow it to be 'slack' and include things
>> in that list that might not have changed (i.e. false positives) but
>> not false negatives.
> Yes, that might work as an implementation - but there's the question of
> whether other implementations would also work.  We want the protocol to
> describe the concept, and not be too heavily tied to one particular
> implementation.
>
> The documentation is also trying to be very straightforward that asking
> about dirtiness requires out-of-band coordination, and that a server can
> just blindly report everything as dirty if there is no better thing to
> report.  So anyone actually making use of this command already has to be
> aware of the out-of-band coordination needed to make it useful.
>
yes, and this approach is perfect. If there is no information about
dirtiness, we should report this as all dirty. Though this information
could be type-specific.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 19:34   ` Eric Blake
  2016-04-04 19:54     ` Denis V. Lunev
@ 2016-04-04 19:58     ` Alex Bligh
  2016-04-04 20:04       ` Denis V. Lunev
  2016-04-04 20:22       ` Eric Blake
  2016-04-05 13:38     ` Paolo Bonzini
  2 siblings, 2 replies; 59+ messages in thread
From: Alex Bligh @ 2016-04-04 19:58 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Denis V. Lunev, Wouter Verhelst,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1236 bytes --]

Eric,

> Nothing requires the two uses to report at the same granularity.  THe
> NBD_REPLY_TYPE_BLOCK_STATUS allows the server to divide into descriptors
> as it sees fit (so it could report holes at a 4k granularity, but
> dirtiness only at a 64k granularity) - all that matters is that when all
> the descriptors have been sent, they total up to the length of the
> original client request.  So by itself, granularity does not require
> another command.

Sure, but given you can't report dirtiness without also reporting
allocation, if they are are at different blocksize I'd rather they
were in different commands, because otherwise the code to report
block size needs to work at two different granularities.

> At this point, I was just trying to rebase the proposal as originally
> made by Denis and Pavel; perhaps they will have more insight on how they
> envisioned using the command, or on whether we should try harder to make
> this more of a two-way protocol (where the client can tell the server
> when to mark something as clean, or when to start tracking whether
> something is dirty).

I'm not sure it needs to be a two way protocol (see my strawman) but
I'd like to know it's at least possibly useful.

--
Alex Bligh





[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 842 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 19:54     ` Denis V. Lunev
@ 2016-04-04 20:03       ` Alex Bligh
  2016-04-04 20:08         ` Denis V. Lunev
  2016-04-05 14:15         ` Paolo Bonzini
  2016-04-04 23:08       ` Wouter Verhelst
  1 sibling, 2 replies; 59+ messages in thread
From: Alex Bligh @ 2016-04-04 20:03 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Wouter Verhelst, Paolo Bonzini


On 4 Apr 2016, at 20:54, Denis V. Lunev <den@openvz.org> wrote:

> for now and for QEMU we want this to expose accumulated dirtiness
> of the block device, which is collected by the server. Yes, this requires
> external coordination. May be this COULD be the part of the protocol,
> but QEMU will not use that part of the protocol.
> 
> saying about dirtiness, we would soon come to the fact, that
> we can have several dirtiness states regarding different
> lines of incremental backups. This complexity is hidden
> inside QEMU and it would be very difficult to publish and
> reuse it.

So qemu-nbdserver has some idea of 'dirtiness', and you want
to publish that, and the consumer is qemu (and only qemu),
because only qemu knows what 'dirtiness' means in the
sense the server provides it?

I've nothing against helping qemu out here (having
contributed to qemu myself), but perhaps it might be better then,
rather than call it 'dirtiness' to make it some named attribute
for private client/server communication.

This again makes me think this should be a different
command from something which is obviously useful and
comprehensible to more than one server/client (i.e.
allocation).

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 19:58     ` Alex Bligh
@ 2016-04-04 20:04       ` Denis V. Lunev
  2016-04-04 20:08         ` Alex Bligh
  2016-04-04 20:22       ` Eric Blake
  1 sibling, 1 reply; 59+ messages in thread
From: Denis V. Lunev @ 2016-04-04 20:04 UTC (permalink / raw)
  To: Alex Bligh, Eric Blake
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Paolo Bonzini

On 04/04/2016 10:58 PM, Alex Bligh wrote:
> Eric,
>
>> Nothing requires the two uses to report at the same granularity.  THe
>> NBD_REPLY_TYPE_BLOCK_STATUS allows the server to divide into descriptors
>> as it sees fit (so it could report holes at a 4k granularity, but
>> dirtiness only at a 64k granularity) - all that matters is that when all
>> the descriptors have been sent, they total up to the length of the
>> original client request.  So by itself, granularity does not require
>> another command.
> Sure, but given you can't report dirtiness without also reporting
> allocation, if they are are at different blocksize I'd rather they
> were in different commands, because otherwise the code to report
> block size needs to work at two different granularities.
>
'dirty' could come after the data was 'trimmed' from the region!
thus we could have dirty unallocated data.

>> At this point, I was just trying to rebase the proposal as originally
>> made by Denis and Pavel; perhaps they will have more insight on how they
>> envisioned using the command, or on whether we should try harder to make
>> this more of a two-way protocol (where the client can tell the server
>> when to mark something as clean, or when to start tracking whether
>> something is dirty).
> I'm not sure it needs to be a two way protocol (see my strawman) but
> I'd like to know it's at least possibly useful.
>
> --
> Alex Bligh
>
>
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:04       ` Denis V. Lunev
@ 2016-04-04 20:08         ` Alex Bligh
  2016-04-04 20:13           ` Denis V. Lunev
  2016-04-04 20:26           ` Eric Blake
  0 siblings, 2 replies; 59+ messages in thread
From: Alex Bligh @ 2016-04-04 20:08 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Wouter Verhelst, Paolo Bonzini


On 4 Apr 2016, at 21:04, Denis V. Lunev <den@openvz.org> wrote:

>> Sure, but given you can't report dirtiness without also reporting
>> allocation, if they are are at different blocksize I'd rather they
>> were in different commands, because otherwise the code to report
>> block size needs to work at two different granularities.
>> 
> 'dirty' could come after the data was 'trimmed' from the region!
> thus we could have dirty unallocated data.

Let me rephrase.

Under the current proposal it is not possible to report whether or
not a region is dirty without also reporting whether or not it
is allocated. As these two concepts exist at potentially
different block sizes, the code to support reporting on allocation
must now be able to run both at the allocation blocksize and
the dirtiness blocksize, which is going to be a pain.

If these were two different commands, they could each run at their
natural block size.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:03       ` Alex Bligh
@ 2016-04-04 20:08         ` Denis V. Lunev
  2016-04-04 20:34           ` Eric Blake
  2016-04-05 14:15         ` Paolo Bonzini
  1 sibling, 1 reply; 59+ messages in thread
From: Denis V. Lunev @ 2016-04-04 20:08 UTC (permalink / raw)
  To: Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Paolo Bonzini

On 04/04/2016 11:03 PM, Alex Bligh wrote:
> On 4 Apr 2016, at 20:54, Denis V. Lunev <den@openvz.org> wrote:
>
>> for now and for QEMU we want this to expose accumulated dirtiness
>> of the block device, which is collected by the server. Yes, this requires
>> external coordination. May be this COULD be the part of the protocol,
>> but QEMU will not use that part of the protocol.
>>
>> saying about dirtiness, we would soon come to the fact, that
>> we can have several dirtiness states regarding different
>> lines of incremental backups. This complexity is hidden
>> inside QEMU and it would be very difficult to publish and
>> reuse it.
> So qemu-nbdserver has some idea of 'dirtiness', and you want
> to publish that, and the consumer is qemu (and only qemu),
> because only qemu knows what 'dirtiness' means in the
> sense the server provides it?
>
> I've nothing against helping qemu out here (having
> contributed to qemu myself), but perhaps it might be better then,
> rather than call it 'dirtiness' to make it some named attribute
> for private client/server communication.
>
> This again makes me think this should be a different
> command from something which is obviously useful and
> comprehensible to more than one server/client (i.e.
> allocation).
>
original design of this command has used 16 number
to specify the NUMBER of the bitmap which was
exported by the server.

We have reserved number 0 for 'used' bitmap, i.e.
bitmap of allocated blocks and number 1 for 'dirty'
bitmap. Though we can skip specification of the
belonging of any number except '0' and put them
to server-client negotiations. Or we could reserve
'1' for dirtiness state as server-client agrees and
allow other applications to register their own bitmaps
as the deserve to.

Why not to do things this original way?

Den

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:08         ` Alex Bligh
@ 2016-04-04 20:13           ` Denis V. Lunev
  2016-04-04 20:15             ` Alex Bligh
  2016-04-04 20:26           ` Eric Blake
  1 sibling, 1 reply; 59+ messages in thread
From: Denis V. Lunev @ 2016-04-04 20:13 UTC (permalink / raw)
  To: Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Paolo Bonzini

On 04/04/2016 11:08 PM, Alex Bligh wrote:
> On 4 Apr 2016, at 21:04, Denis V. Lunev <den@openvz.org> wrote:
>
>>> Sure, but given you can't report dirtiness without also reporting
>>> allocation, if they are are at different blocksize I'd rather they
>>> were in different commands, because otherwise the code to report
>>> block size needs to work at two different granularities.
>>>
>> 'dirty' could come after the data was 'trimmed' from the region!
>> thus we could have dirty unallocated data.
> Let me rephrase.
>
> Under the current proposal it is not possible to report whether or
> not a region is dirty without also reporting whether or not it
> is allocated. As these two concepts exist at potentially
> different block sizes, the code to support reporting on allocation
> must now be able to run both at the allocation blocksize and
> the dirtiness blocksize, which is going to be a pain.
>
> If these were two different commands, they could each run at their
> natural block size.
>
could you look into V1 of this?

As far as I remember that text we have had a number in request
specifying which bitmap to query and the server should reply with one
bitmap at a time.

Would this work for you?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:13           ` Denis V. Lunev
@ 2016-04-04 20:15             ` Alex Bligh
  2016-04-04 20:27               ` Denis V. Lunev
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Bligh @ 2016-04-04 20:15 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Wouter Verhelst, Paolo Bonzini


On 4 Apr 2016, at 21:13, Denis V. Lunev <den@openvz.org> wrote:

> As far as I remember that text we have had a number in request
> specifying which bitmap to query and the server should reply with one
> bitmap at a time.
> 
> Would this work for you?

I think that would be much better, yes, though I'd suggest the
bitmap had an ID other than a number 0..15 so other people
can use it too.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 19:58     ` Alex Bligh
  2016-04-04 20:04       ` Denis V. Lunev
@ 2016-04-04 20:22       ` Eric Blake
  1 sibling, 0 replies; 59+ messages in thread
From: Eric Blake @ 2016-04-04 20:22 UTC (permalink / raw)
  To: Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Denis V. Lunev, Wouter Verhelst,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3742 bytes --]

On 04/04/2016 01:58 PM, Alex Bligh wrote:
> Eric,
> 
>> Nothing requires the two uses to report at the same granularity.  THe
>> NBD_REPLY_TYPE_BLOCK_STATUS allows the server to divide into descriptors
>> as it sees fit (so it could report holes at a 4k granularity, but
>> dirtiness only at a 64k granularity) - all that matters is that when all
>> the descriptors have been sent, they total up to the length of the
>> original client request.  So by itself, granularity does not require
>> another command.
> 
> Sure, but given you can't report dirtiness without also reporting
> allocation, if they are are at different blocksize I'd rather they
> were in different commands, because otherwise the code to report
> block size needs to work at two different granularities.

We already state the client has to make two queries.  The client is NOT
querying block size, but a map of contiguous bytes of the file with the
same properties.

As an example usage,

C: NBD_CMD_BLOCK_SIZE: length 128k, offset 0, flags 0
S: NBD_REPLY_TYPE_BLOCK_SIZE: length 32 (4 descriptors):
     length 16384, status 0
     length 16384, status NBD_STATE_HOLE|NBD_STATE_ZERO
     length 81920, status 0
     length 16384, status NBD_STATE_ZERO
C: NBD_CMD_BLOCK_SIZE: length 128k, offset 0, flags NBD_CMD_FLAG_DIRTY
S: NBD_REPLY_TYPE_BLOCK_SIZE: length 16 (2 descriptors):
     length 65536, status 0
     length 65536, status NBD_STATE_CLEAN

would be a perfectly valid sequence of replies.  It tells the client
nothing about the guest block size (that information would be the same
whether the guest uses 512-byte, 1024-byte, or 4096-byte sectors?).  In
fact, you can't even tell if it is possible to track dirty information
in a granularity smaller than 64k, only that the results of this command
did not have anything smaller than that.  The command intentionally
separates modes of operation (you can't query allocation and
NBD_CMD_FLAG_DIRTY at the same time), in case the server has different
block size code under the hood for the two types of queries.

> 
>> At this point, I was just trying to rebase the proposal as originally
>> made by Denis and Pavel; perhaps they will have more insight on how they
>> envisioned using the command, or on whether we should try harder to make
>> this more of a two-way protocol (where the client can tell the server
>> when to mark something as clean, or when to start tracking whether
>> something is dirty).
> 
> I'm not sure it needs to be a two way protocol (see my strawman) but
> I'd like to know it's at least possibly useful.

Tracking generation ids on every sector is one implementation, but it is
not currently used by qemu for qcow2 images.  So forcing the
implementation to be exposed by having NBD track dirty information by
generation id would require qemu to start tracking new information that
it does not currently have.  Qemu's current implementation of dirty
information is a bitmap with no generation id, so the idea is that the
NBD command for exposing dirty sectors should likewise be no more
specific than a listing of run-length-encoded alternations between "this
part of the file is clean" and "this part of the file is dirty".

Even for an implementation that DOES track generation id on every
sector, you still have the case that you need out-of-band communication
to make use of it.  So the client would first send an out-of-band
message "set generation 123 as my line in the sand", then use NBD to
query dirty sectors (which returns clean for all sectors with id < 123,
and dirty for all sectors >= 123).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:08         ` Alex Bligh
  2016-04-04 20:13           ` Denis V. Lunev
@ 2016-04-04 20:26           ` Eric Blake
  2016-04-04 21:07             ` Alex Bligh
  1 sibling, 1 reply; 59+ messages in thread
From: Eric Blake @ 2016-04-04 20:26 UTC (permalink / raw)
  To: Alex Bligh, Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1537 bytes --]

On 04/04/2016 02:08 PM, Alex Bligh wrote:
> 
> On 4 Apr 2016, at 21:04, Denis V. Lunev <den@openvz.org> wrote:
> 
>>> Sure, but given you can't report dirtiness without also reporting
>>> allocation, if they are are at different blocksize I'd rather they
>>> were in different commands, because otherwise the code to report
>>> block size needs to work at two different granularities.
>>>
>> 'dirty' could come after the data was 'trimmed' from the region!
>> thus we could have dirty unallocated data.
> 
> Let me rephrase.
> 
> Under the current proposal it is not possible to report whether or
> not a region is dirty without also reporting whether or not it
> is allocated.

Huh? The current proposal _requires_ these to be separate queries.  You
cannot query dirtiness at the same time as allocation, because the value
of NBD_CMD_FLAG_DIRTY is distinct between the two queries.

> As these two concepts exist at potentially
> different block sizes, the code to support reporting on allocation
> must now be able to run both at the allocation blocksize and
> the dirtiness blocksize, which is going to be a pain.

No, the code for reporting allocation does NOT have to be run at the
same time as when reporting dirtiness.

> 
> If these were two different commands, they could each run at their
> natural block size.

Both modes of operation already can run at their natural block size.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:15             ` Alex Bligh
@ 2016-04-04 20:27               ` Denis V. Lunev
  2016-04-04 20:45                 ` Eric Blake
  0 siblings, 1 reply; 59+ messages in thread
From: Denis V. Lunev @ 2016-04-04 20:27 UTC (permalink / raw)
  To: Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Paolo Bonzini

On 04/04/2016 11:15 PM, Alex Bligh wrote:
> On 4 Apr 2016, at 21:13, Denis V. Lunev <den@openvz.org> wrote:
>
>> As far as I remember that text we have had a number in request
>> specifying which bitmap to query and the server should reply with one
>> bitmap at a time.
>>
>> Would this work for you?
> I think that would be much better, yes, though I'd suggest the
> bitmap had an ID other than a number 0..15 so other people
> can use it too.
>
bitmap requires to negotiate granularity which is
not that easy thing.

If we have different granularities for 'dirty' and 'allocated'
bitmaps and we can report this using this proto and
can not do this cleanly with bitmap approach assuming
that we will add 'NBD_STATE_DIRTY_DEALLOCATED' (0x2) state

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:08         ` Denis V. Lunev
@ 2016-04-04 20:34           ` Eric Blake
  2016-04-04 21:06             ` Denis V. Lunev
  2016-04-04 21:12             ` Alex Bligh
  0 siblings, 2 replies; 59+ messages in thread
From: Eric Blake @ 2016-04-04 20:34 UTC (permalink / raw)
  To: Denis V. Lunev, Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2929 bytes --]

On 04/04/2016 02:08 PM, Denis V. Lunev wrote:
>>
>> This again makes me think this should be a different
>> command from something which is obviously useful and
>> comprehensible to more than one server/client (i.e.
>> allocation).
>>
> original design of this command has used 16 number
> to specify the NUMBER of the bitmap which was
> exported by the server.

The original design abused the 16-bit 'flags' field of each command to
instead try and treat that value as a bitmap number, instead of a
bitwise-or'd set of flags.  That was one of the complaints against v1,
and was fixed in v2 by having a single boolean flag, NBD_CMD_FLAG_DIRTY,
which was off for (default) allocation queries, and set for dirtiness
queries.  We can add other flags for any other type of queries, and the
principle of each query being a run-length-encoded listing still applies.

> 
> We have reserved number 0 for 'used' bitmap, i.e.
> bitmap of allocated blocks and number 1 for 'dirty'
> bitmap. Though we can skip specification of the
> belonging of any number except '0' and put them
> to server-client negotiations. Or we could reserve
> '1' for dirtiness state as server-client agrees and
> allow other applications to register their own bitmaps
> as the deserve to.
> 
> Why not to do things this original way?

If you want to encode particular ids, you should do so in a separate
field, and not overload the 'flags' field.

As it is, we don't have structured writes - right now, you can write a
wire sniffer for the client side, where all commands except
NBD_CMD_WRITE are fixed size, and NBD_CMD_WRITE describes its own size
via its length field; the extension NBD_CMD_WRITE_ZEROES still fits into
this scheme.  All NBD implementations have to supply NBD_CMD_WRITE, but
any extension commands do NOT have to be universal.  Writing a wire
sniffer that special-cases NBD_CMD_WRITE is easy (since that command
will always exist), but writing a wire sniffer that special-cases
arbitrary commands, particularly where those arbitrary commands do not
also self-describe the length of the command, is hard.  We can't
overload the flags field to say which bitmap id to grab, but we also
can't arbitrarily add 4 bytes to the command size when the command is
NBD_CMD_BLOCK_STATUS (because wire sniffers that don't know about
NBD_CMD_BLOCK_STATUS wouldn't know to expect those four bytes to be part
of the current packet rather than starting a new packet).

The recent work on structured reads made it possible for an arbitrary
wire sniffer to gracefully skip over the variable-length return size
reply to NBD_CMD_BLOCK_STATUS, and any other extension command that we
might add later.  But right now, I'm not seeing a compelling reason to
add structured commands to the NBD protocol.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:27               ` Denis V. Lunev
@ 2016-04-04 20:45                 ` Eric Blake
  2016-04-04 21:04                   ` Denis V. Lunev
  0 siblings, 1 reply; 59+ messages in thread
From: Eric Blake @ 2016-04-04 20:45 UTC (permalink / raw)
  To: Denis V. Lunev, Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1966 bytes --]

On 04/04/2016 02:27 PM, Denis V. Lunev wrote:
> On 04/04/2016 11:15 PM, Alex Bligh wrote:
>> On 4 Apr 2016, at 21:13, Denis V. Lunev <den@openvz.org> wrote:
>>
>>> As far as I remember that text we have had a number in request
>>> specifying which bitmap to query and the server should reply with one
>>> bitmap at a time.
>>>
>>> Would this work for you?
>> I think that would be much better, yes, though I'd suggest the
>> bitmap had an ID other than a number 0..15 so other people
>> can use it too.
>>
> bitmap requires to negotiate granularity which is
> not that easy thing.
> 
> If we have different granularities for 'dirty' and 'allocated'
> bitmaps and we can report this using this proto and
> can not do this cleanly with bitmap approach assuming
> that we will add 'NBD_STATE_DIRTY_DEALLOCATED' (0x2) state

I'm not sure what you are trying to propose adding here.  'state' is a
bitmap - it is either a representation of two bits of information
(NBD_CMD_FLAG_DIRTY was clear, so state is the bitwise OR of
NBD_STATE_HOLE and NBD_STATE_ZERO), or the representation of one bit of
information (NBD_CMD_FLAG_DIRTY was set, so state is the bitwise OR of
NBD_STATE_CLEAN).

I'm not sure where you are reading into this that granularity has to be
negotiated.  The client never mentions granularity; and the server is
perfectly free to report data in descriptors as large or as small as it
wants (although I did document that the server SHOULD stick to
descriptors that are at least 512 bytes at a time, and SHOULD coalese
descriptors so that two consecutive descriptors have distinct state values).

Whether something is allocated or not has no direct bearing on whether
it is dirty or not; and it is feasible that a server could report the
act of NBD_CMD_TRIM as causing a portion of the file to become dirty.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:45                 ` Eric Blake
@ 2016-04-04 21:04                   ` Denis V. Lunev
  2016-04-04 21:12                     ` Alex Bligh
  2016-04-04 21:17                     ` Eric Blake
  0 siblings, 2 replies; 59+ messages in thread
From: Denis V. Lunev @ 2016-04-04 21:04 UTC (permalink / raw)
  To: Eric Blake, Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Paolo Bonzini

On 04/04/2016 11:45 PM, Eric Blake wrote:
> On 04/04/2016 02:27 PM, Denis V. Lunev wrote:
>> On 04/04/2016 11:15 PM, Alex Bligh wrote:
>>> On 4 Apr 2016, at 21:13, Denis V. Lunev <den@openvz.org> wrote:
>>>
>>>> As far as I remember that text we have had a number in request
>>>> specifying which bitmap to query and the server should reply with one
>>>> bitmap at a time.
>>>>
>>>> Would this work for you?
>>> I think that would be much better, yes, though I'd suggest the
>>> bitmap had an ID other than a number 0..15 so other people
>>> can use it too.
>>>
>> bitmap requires to negotiate granularity which is
>> not that easy thing.
>>
>> If we have different granularities for 'dirty' and 'allocated'
>> bitmaps and we can report this using this proto and
>> can not do this cleanly with bitmap approach assuming
>> that we will add 'NBD_STATE_DIRTY_DEALLOCATED' (0x2) state
> I'm not sure what you are trying to propose adding here.  'state' is a
> bitmap - it is either a representation of two bits of information
> (NBD_CMD_FLAG_DIRTY was clear, so state is the bitwise OR of
> NBD_STATE_HOLE and NBD_STATE_ZERO), or the representation of one bit of
> information (NBD_CMD_FLAG_DIRTY was set, so state is the bitwise OR of
> NBD_STATE_CLEAN).
sorry for vague description. I have messed up.

In v1 we have had 'status' field, which can have the
following values for dirty request:

+      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
+      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.

in the extent structure:

+        * 64 bits, offset (unsigned)
+        * 32 bits, length (unsigned)
+        * 16 bits, status (unsigned)

with an additional NBD_STATE_DIRTY_HOLE or (DIRTY_DEALLOCATED)
we could report the entire state using one query. The user could be
able to read entire state which is useful for backup at once.

Your current proposal is more tricky and it was misunderstood by Alex:

+        * 32 bits, status flags

and you describe flags as

+      - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole
+        (and future writes to that area may cause fragmentation or
+        encounter an `ENOSPC` error); if clear, the block is allocated
+        or the server could not otherwise determine its status.  Note
+        that the use of `NBD_CMD_TRIM` is related to this status, but
+        that the server MAY report a hole even where trim has not been
+        requested, and also that a server MAY report allocation even
+        where a trim has been requested.
+      - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as
+        all zeroes; if clear, the block contents are not known.  Note
+        that the use of `NBD_CMD_WRITE_ZEROES` is related to this
+        status, but that the server MAY report zeroes even where write
+        zeroes has not been requested, and also that a server MAY
+        report unknown content even where write zeroes has been
+        requested.

+      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
+        portion of the file that is still clean because it has not
+        been written; if clear, the block represents a portion of the
+        file that is dirty, or where the server could not otherwise
+        determine its status.

and opinion of Alex was that all 3 bits could be set in reply to 
NBD_CMD_BLOCK_STATUS
with NBD_CMD_FLAG_STATUS_DIRTY set.

This confused him. This confuses me too.

If allocated state is not replied to command with NBD_CMD_FLAG_STATUS_DIRTY
then why to have different meaning of bits.


> I'm not sure where you are reading into this that granularity has to be
> negotiated.  The client never mentions granularity; and the server is
> perfectly free to report data in descriptors as large or as small as it
> wants (although I did document that the server SHOULD stick to
> descriptors that are at least 512 bytes at a time, and SHOULD coalese
> descriptors so that two consecutive descriptors have distinct state values).
>
> Whether something is allocated or not has no direct bearing on whether
> it is dirty or not; and it is feasible that a server could report the
> act of NBD_CMD_TRIM as causing a portion of the file to become dirty.
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:34           ` Eric Blake
@ 2016-04-04 21:06             ` Denis V. Lunev
  2016-04-04 21:12             ` Alex Bligh
  1 sibling, 0 replies; 59+ messages in thread
From: Denis V. Lunev @ 2016-04-04 21:06 UTC (permalink / raw)
  To: Eric Blake, Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Paolo Bonzini

On 04/04/2016 11:34 PM, Eric Blake wrote:
> On 04/04/2016 02:08 PM, Denis V. Lunev wrote:
>>> This again makes me think this should be a different
>>> command from something which is obviously useful and
>>> comprehensible to more than one server/client (i.e.
>>> allocation).
>>>
>> original design of this command has used 16 number
>> to specify the NUMBER of the bitmap which was
>> exported by the server.
> The original design abused the 16-bit 'flags' field of each command to
> instead try and treat that value as a bitmap number, instead of a
> bitwise-or'd set of flags.  That was one of the complaints against v1,
> and was fixed in v2 by having a single boolean flag, NBD_CMD_FLAG_DIRTY,
> which was off for (default) allocation queries, and set for dirtiness
> queries.  We can add other flags for any other type of queries, and the
> principle of each query being a run-length-encoded listing still applies.
>
>> We have reserved number 0 for 'used' bitmap, i.e.
>> bitmap of allocated blocks and number 1 for 'dirty'
>> bitmap. Though we can skip specification of the
>> belonging of any number except '0' and put them
>> to server-client negotiations. Or we could reserve
>> '1' for dirtiness state as server-client agrees and
>> allow other applications to register their own bitmaps
>> as the deserve to.
>>
>> Why not to do things this original way?
> If you want to encode particular ids, you should do so in a separate
> field, and not overload the 'flags' field.
>
> As it is, we don't have structured writes - right now, you can write a
> wire sniffer for the client side, where all commands except
> NBD_CMD_WRITE are fixed size, and NBD_CMD_WRITE describes its own size
> via its length field; the extension NBD_CMD_WRITE_ZEROES still fits into
> this scheme.  All NBD implementations have to supply NBD_CMD_WRITE, but
> any extension commands do NOT have to be universal.  Writing a wire
> sniffer that special-cases NBD_CMD_WRITE is easy (since that command
> will always exist), but writing a wire sniffer that special-cases
> arbitrary commands, particularly where those arbitrary commands do not
> also self-describe the length of the command, is hard.  We can't
> overload the flags field to say which bitmap id to grab, but we also
> can't arbitrarily add 4 bytes to the command size when the command is
> NBD_CMD_BLOCK_STATUS (because wire sniffers that don't know about
> NBD_CMD_BLOCK_STATUS wouldn't know to expect those four bytes to be part
> of the current packet rather than starting a new packet).
>
> The recent work on structured reads made it possible for an arbitrary
> wire sniffer to gracefully skip over the variable-length return size
> reply to NBD_CMD_BLOCK_STATUS, and any other extension command that we
> might add later.  But right now, I'm not seeing a compelling reason to
> add structured commands to the NBD protocol.
>
thank you for pointing this out. I think I got an idea

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:26           ` Eric Blake
@ 2016-04-04 21:07             ` Alex Bligh
  2016-04-04 21:25               ` Eric Blake
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Bligh @ 2016-04-04 21:07 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Paolo Bonzini, Wouter Verhelst,
	Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 446 bytes --]


On 4 Apr 2016, at 21:26, Eric Blake <eblake@redhat.com> wrote:

> Huh? The current proposal _requires_ these to be separate queries.  You
> cannot query dirtiness at the same time as allocation, because the value
> of NBD_CMD_FLAG_DIRTY is distinct between the two queries.

OK, so you can ask for either (1) or (2), but not both. I see. I misread
that. I still think Denis's idea of selecting a bitmap to query works better.

--
Alex Bligh





[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 842 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:34           ` Eric Blake
  2016-04-04 21:06             ` Denis V. Lunev
@ 2016-04-04 21:12             ` Alex Bligh
  1 sibling, 0 replies; 59+ messages in thread
From: Alex Bligh @ 2016-04-04 21:12 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Paolo Bonzini, Wouter Verhelst,
	Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 981 bytes --]


On 4 Apr 2016, at 21:34, Eric Blake <eblake@redhat.com> wrote:

> The original design abused the 16-bit 'flags' field of each command to
> instead try and treat that value as a bitmap number, instead of a
> bitwise-or'd set of flags.  That was one of the complaints against v1,
> and was fixed in v2 by having a single boolean flag, NBD_CMD_FLAG_DIRTY,
> which was off for (default) allocation queries, and set for dirtiness
> queries.  We can add other flags for any other type of queries, and the
> principle of each query being a run-length-encoded listing still applies.

Well abusing flags is pretty gross.

You're multiplexing on a single flag and I didn't like that.

So I suggested multiplexing on the command and you didn't like that.

Eventually as you point out we really want to expand the command field,
and ...

> As it is, we don't have structured writes - right now, you can write a

... I agree this is almost inventing 'structured writes' :-)

--
Alex Bligh





[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 842 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 21:04                   ` Denis V. Lunev
@ 2016-04-04 21:12                     ` Alex Bligh
  2016-04-04 21:17                     ` Eric Blake
  1 sibling, 0 replies; 59+ messages in thread
From: Alex Bligh @ 2016-04-04 21:12 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Wouter Verhelst, Paolo Bonzini


On 4 Apr 2016, at 22:04, Denis V. Lunev <den@openvz.org> wrote:

> and opinion of Alex was that all 3 bits could be set in reply to NBD_CMD_BLOCK_STATUS
> with NBD_CMD_FLAG_STATUS_DIRTY set.
> 
> This confused him. This confuses me too.

Yeah that was precisely how I got confused.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 21:04                   ` Denis V. Lunev
  2016-04-04 21:12                     ` Alex Bligh
@ 2016-04-04 21:17                     ` Eric Blake
  2016-04-04 21:27                       ` Denis V. Lunev
  1 sibling, 1 reply; 59+ messages in thread
From: Eric Blake @ 2016-04-04 21:17 UTC (permalink / raw)
  To: Denis V. Lunev, Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2675 bytes --]

On 04/04/2016 03:04 PM, Denis V. Lunev wrote:
> In v1 we have had 'status' field, which can have the
> following values for dirty request:
> 
> +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> 
> in the extent structure:
> 
> +        * 64 bits, offset (unsigned)
> +        * 32 bits, length (unsigned)
> +        * 16 bits, status (unsigned)

Between v1 and v2, we dropped 64-bit offset (offset is now implied, by
adding lengths of all earlier descriptors), and widened status from 16
bits to 32 bits (so that each descriptor is now naturally 8-byte aligned
and therefore easier to make a C array).

> 
> with an additional NBD_STATE_DIRTY_HOLE or (DIRTY_DEALLOCATED)
> we could report the entire state using one query. The user could be
> able to read entire state which is useful for backup at once.
> 
> Your current proposal is more tricky and it was misunderstood by Alex:
> 
> +        * 32 bits, status flags
> 
> and you describe flags as
> 
> +      - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole

> +      - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as

> 
> +      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a

> and opinion of Alex was that all 3 bits could be set in reply to
> NBD_CMD_BLOCK_STATUS
> with NBD_CMD_FLAG_STATUS_DIRTY set.
> 
> This confused him. This confuses me too.

There's nothing that says that NBD_STATE_CLEAN can't be reassigned to
bit 0.  Conversely, we may want to add a future NBD_CMD_FLAG_FOO that
lets you read allocation and dirty information in the same call, in
which case having the bits be distinct will make that easier; but where
we would also make it obvious that the server is allowed to reject that
command flag as unsupported (we already state the server can reject
NBD_CMD_FLAG_STATUS_DIRTY with EINVAL as unsupported; and that it if
does not reject a dirtiness request but cannot otherwise report
anything, then the entire region is reported as dirty).

I don't have any strong opinions on whether NBD_STATE_CLEAN should
remain bit 2 or be renumbered to bit 0, although renumbering it to bit 0
would make it painfully obvious that you cannot query allocation and
dirtiness at the same time.

> 
> If allocated state is not replied to command with NBD_CMD_FLAG_STATUS_DIRTY
> then why to have different meaning of bits.

Because we still have room - no need to overlap the meaning of bits as
long as we have more bits to choose from.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 21:07             ` Alex Bligh
@ 2016-04-04 21:25               ` Eric Blake
  2016-04-04 22:06                 ` Alex Bligh
  0 siblings, 1 reply; 59+ messages in thread
From: Eric Blake @ 2016-04-04 21:25 UTC (permalink / raw)
  To: Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Paolo Bonzini, Wouter Verhelst,
	Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 1974 bytes --]

On 04/04/2016 03:07 PM, Alex Bligh wrote:
> 
> On 4 Apr 2016, at 21:26, Eric Blake <eblake@redhat.com> wrote:
> 
>> Huh? The current proposal _requires_ these to be separate queries.  You
>> cannot query dirtiness at the same time as allocation, because the value
>> of NBD_CMD_FLAG_DIRTY is distinct between the two queries.
> 
> OK, so you can ask for either (1) or (2), but not both. I see. I misread
> that. I still think Denis's idea of selecting a bitmap to query works better.

I'm still not sure I follow what you are proposing in 'selecting a
bitmap', at least not without also needing to expand the protocol to
allow for a structured command that has more than a fixed-length message
size (currently all commands except NBD_CMD_WRITE) or a self-described
size (NBD_CMD_WRITE).  Would this bitmap id be specified as an integer
(32 bits?) as an arbitrary UTF-8 string? or as something else?  And
since we _already_ documented that querying dirty information requires
out-of-band coordination, why can't the out-of-band communication convey
the bitmap selection, so that the NBD command then just reads the dirty
status of the already-selected bitmap?

It was Paolo's suggestion to document that querying dirty status is only
useful with out-of-band coordination, at which point, why bloat the NBD
protocol with details that are better left to that external
coordination?  Wouter had a valid complaint that v1 was unspecified (it
said that being "dirty" was implementation defined, but gave no other
meaning and a server could use random numbers and still be compliant);
v2 picked up Paolo's wording suggestion against v1 that tries to make it
obvious that being "dirty" is still implementation defined, but that the
definition is whatever it takes for a coordination with out-of-band
information to provide useful results.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 21:17                     ` Eric Blake
@ 2016-04-04 21:27                       ` Denis V. Lunev
  0 siblings, 0 replies; 59+ messages in thread
From: Denis V. Lunev @ 2016-04-04 21:27 UTC (permalink / raw)
  To: Eric Blake, Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Paolo Bonzini

On 04/05/2016 12:17 AM, Eric Blake wrote:
> On 04/04/2016 03:04 PM, Denis V. Lunev wrote:
>> In v1 we have had 'status' field, which can have the
>> following values for dirty request:
>>
>> +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
>> +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
>>
>> in the extent structure:
>>
>> +        * 64 bits, offset (unsigned)
>> +        * 32 bits, length (unsigned)
>> +        * 16 bits, status (unsigned)
> Between v1 and v2, we dropped 64-bit offset (offset is now implied, by
> adding lengths of all earlier descriptors), and widened status from 16
> bits to 32 bits (so that each descriptor is now naturally 8-byte aligned
> and therefore easier to make a C array).
>
>> with an additional NBD_STATE_DIRTY_HOLE or (DIRTY_DEALLOCATED)
>> we could report the entire state using one query. The user could be
>> able to read entire state which is useful for backup at once.
>>
>> Your current proposal is more tricky and it was misunderstood by Alex:
>>
>> +        * 32 bits, status flags
>>
>> and you describe flags as
>>
>> +      - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole
>> +      - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as
>> +      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
>> and opinion of Alex was that all 3 bits could be set in reply to
>> NBD_CMD_BLOCK_STATUS
>> with NBD_CMD_FLAG_STATUS_DIRTY set.
>>
>> This confused him. This confuses me too.
> There's nothing that says that NBD_STATE_CLEAN can't be reassigned to
> bit 0.  Conversely, we may want to add a future NBD_CMD_FLAG_FOO that
> lets you read allocation and dirty information in the same call, in
> which case having the bits be distinct will make that easier; but where
> we would also make it obvious that the server is allowed to reject that
> command flag as unsupported (we already state the server can reject
> NBD_CMD_FLAG_STATUS_DIRTY with EINVAL as unsupported; and that it if
> does not reject a dirtiness request but cannot otherwise report
> anything, then the entire region is reported as dirty).
>
> I don't have any strong opinions on whether NBD_STATE_CLEAN should
> remain bit 2 or be renumbered to bit 0, although renumbering it to bit 0
> would make it painfully obvious that you cannot query allocation and
> dirtiness at the same time.
I think it is worth to do to avoid this type of the confusion.

>> If allocated state is not replied to command with NBD_CMD_FLAG_STATUS_DIRTY
>> then why to have different meaning of bits.
> Because we still have room - no need to overlap the meaning of bits as
> long as we have more bits to choose from.
>
ok

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 21:25               ` Eric Blake
@ 2016-04-04 22:06                 ` Alex Bligh
  0 siblings, 0 replies; 59+ messages in thread
From: Alex Bligh @ 2016-04-04 22:06 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Paolo Bonzini, Wouter Verhelst,
	Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 2004 bytes --]


On 4 Apr 2016, at 22:25, Eric Blake <eblake@redhat.com> wrote:

> On 04/04/2016 03:07 PM, Alex Bligh wrote:
>> 
>> On 4 Apr 2016, at 21:26, Eric Blake <eblake@redhat.com> wrote:
>> 
>>> Huh? The current proposal _requires_ these to be separate queries.  You
>>> cannot query dirtiness at the same time as allocation, because the value
>>> of NBD_CMD_FLAG_DIRTY is distinct between the two queries.
>> 
>> OK, so you can ask for either (1) or (2), but not both. I see. I misread
>> that. I still think Denis's idea of selecting a bitmap to query works better.
> 
> I'm still not sure I follow what you are proposing in 'selecting a
> bitmap', at least not without also needing to expand the protocol to
> allow for a structured command that has more than a fixed-length message
> size (currently all commands except NBD_CMD_WRITE) or a self-described
> size (NBD_CMD_WRITE).

Whether it's an integer, or a UTF-8 string or whatever, we'd need
a way to expand the command structure.

An easy way to do that would be a command flag 'NBD_CMD_FLAG_EXTENDED_COMMAND'
which means the command was immediately followed by

0000 32-bit: command extra-length
0004 variable length data

>  Would this bitmap id be specified as an integer
> (32 bits?) as an arbitrary UTF-8 string? or as something else?

Or a 128 bit unique identifier (i.e. a packed UUID) to identify a bitmap.
That would obviate the need for a registry of such things.

That or the UTF-8 string

>  And
> since we _already_ documented that querying dirty information requires
> out-of-band coordination, why can't the out-of-band communication convey
> the bitmap selection, so that the NBD command then just reads the dirty
> status of the already-selected bitmap?

I suppose I was trying to make a single 'read bitmap' command, that
read an arbitrary bitmap. We then define one 'well known' bitmap
as 'allocation'. You can have your qemu bitmap(s) to do whatever
you want.

--
Alex Bligh





[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 842 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 16:39 [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension Eric Blake
  2016-04-04 18:06 ` [Qemu-devel] [Nbd] " Alex Bligh
@ 2016-04-04 22:40 ` Wouter Verhelst
  2016-04-04 23:03   ` Eric Blake
  2016-04-05  4:05 ` [Qemu-devel] " Kevin Wolf
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 59+ messages in thread
From: Wouter Verhelst @ 2016-04-04 22:40 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan Hajnoczi, Denis V. Lunev, Paolo Bonzini

Hi,

Need to look into this in some detail, for which I don't have the time
(or the non-tiredness ;-) right now, but these two caught my eye:

On Mon, Apr 04, 2016 at 10:39:10AM -0600, Eric Blake wrote:
[...]
> +* `NBD_REPLY_TYPE_BLOCK_STATUS`
> +
> +    *length* MUST be a positive integer multiple of 8.  This reply
> +    represents a series of consecutive block descriptors where the sum
> +    of the lengths of the descriptors equals the length of the
> +    original request.  This chunk type MUST appear at most once in a
> +    structured reply. Valid as a reply to `NBD_CMD_BLOCK_STATUS.
> +
> +    The payload is structured as a list of one or more descriptors,
> +    each with this layout:
> +
> +        * 32 bits, length (unsigned, MUST NOT be zero)
> +        * 32 bits, status flags
> +
> +    The definition of the status flags is determined based on the
> +    flags present in the original request.

Might be a good idea to specify what a client can do with flags it
doesn't know about; ignore them, probably?
 
[...]
> +The extension adds the following new command flag:
> +
> +- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`.
> +  SHOULD be set to 1 if the client wants to request dirtiness status
> +  rather than provisioning status.

Why only one flag here? I could imagine a client might want to query for
both states at the same time. Obviously that means a client needs to
query for *at least* one of the statuses, otherwise the server should
reply with EINVAL.

Though I'm undecided on whether a bit being set to 0 should mean "give
me that information" or whether 1 should.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 22:40 ` Wouter Verhelst
@ 2016-04-04 23:03   ` Eric Blake
  2016-04-05 13:41     ` Paolo Bonzini
  2016-04-06  5:57     ` Denis V. Lunev
  0 siblings, 2 replies; 59+ messages in thread
From: Eric Blake @ 2016-04-04 23:03 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan Hajnoczi, Denis V. Lunev, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2773 bytes --]

On 04/04/2016 04:40 PM, Wouter Verhelst wrote:
> Hi,
> 
> Need to look into this in some detail, for which I don't have the time
> (or the non-tiredness ;-) right now, but these two caught my eye:
> 

>> +    The payload is structured as a list of one or more descriptors,
>> +    each with this layout:
>> +
>> +        * 32 bits, length (unsigned, MUST NOT be zero)
>> +        * 32 bits, status flags
>> +
>> +    The definition of the status flags is determined based on the
>> +    flags present in the original request.
> 
> Might be a good idea to specify what a client can do with flags it
> doesn't know about; ignore them, probably?

Sounds correct - so a server can subdivide into more descriptors with
additional status bits, and clients just have to ignore those extra bits
and coalesce things itself when dealing with the bits it cares about.

>  
> [...]
>> +The extension adds the following new command flag:
>> +
>> +- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`.
>> +  SHOULD be set to 1 if the client wants to request dirtiness status
>> +  rather than provisioning status.
> 
> Why only one flag here? I could imagine a client might want to query for
> both states at the same time. Obviously that means a client needs to
> query for *at least* one of the statuses, otherwise the server should
> reply with EINVAL.
> 
> Though I'm undecided on whether a bit being set to 0 should mean "give
> me that information" or whether 1 should.

Hmm. Based on that reading, maybe we want TWO flags:

NBD_CMD_FLAG_STATUS_ALLOCATED
NBD_CMD_FLAG_STATUS_DIRTY

and require that the client MUST set at least one flag (setting no flags
at all is boring), but similarly allow that the server MAY reject
attempts to set multiple flags with EINVAL (if there is no efficient way
to provide the information for all flags on a single pass), in which
case clients have to fall back to querying one flag at a time.

But while Alex and Denis were arguing that no one would ever query both
things at once (and therefore, it might be better to make
NBD_STATUS_HOLE and NBD_STATUS_CLEAN both be bit 0), your approach of
having two separate request flags and allowing both at once would mean
we do need to keep the status information separate (NBD_STATUS_HOLE is
bit 0, NBD_STATUS_CLEAN is bit 2).

I also see that I failed to reserve a bit for NBD_CMD_FLAG_STATUS_DIRTY.

Looks like there will still be some more conversation and at least a v3
needed, but I'll wait a couple days for that to happen so that more
reviewers can chime in, without being too tired during the review process.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 19:54     ` Denis V. Lunev
  2016-04-04 20:03       ` Alex Bligh
@ 2016-04-04 23:08       ` Wouter Verhelst
  2016-04-04 23:32         ` Eric Blake
  2016-04-05  7:13         ` Alex Bligh
  1 sibling, 2 replies; 59+ messages in thread
From: Wouter Verhelst @ 2016-04-04 23:08 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Paolo Bonzini

On Mon, Apr 04, 2016 at 10:54:02PM +0300, Denis V. Lunev wrote:
> saying about dirtiness, we would soon come to the fact, that
> we can have several dirtiness states regarding different
> lines of incremental backups. This complexity is hidden
> inside QEMU and it would be very difficult to publish and
> reuse it.

How about this then.

A reply to GET_BLOCK_STATUS containing chunks of this:

32-bit length
32-bit "snapshot status"
if bit 0 in the latter field is set, that means the block is allocated
  on the original device
if bit 1 is set, that means the block is allocated on the first-level
  snapshot
if bit 2 is set, that means the block is allocated on the second-level
  snapshot

etc

If all flags are cleared, that means the block is not allocated (i.e.,
is a hole) and MUST read as zeroes.

If a flag is set at a particular level X, that means the device is dirty
at the Xth-level snapshot.

If at least one flag is set for a region, that means the data may read
as "not zero".

The protocol does not define what it means to have multiple levels of
snapshots, other than:

- Any write command (WRITE or WRITE_ZEROES) MUST NOT clear or set the
  Xth level flag if the Yth level flag is not also cleared at the same
  time, for any Y > X
- Any write (as above) MAY set or clear multiple levels of flags at the
  same time, as long as the above holds

Having a 32-bit snapshot status field allows for 32 levels of snapshots.
We could switch length and flags to 64 bits so that things continue to
align nicely, and then we have a maximum of 64 levels of snapshots.

(I'm not going to write this up formally at this time of night, but you
get the general idea)

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 23:08       ` Wouter Verhelst
@ 2016-04-04 23:32         ` Eric Blake
  2016-04-05  7:16           ` Wouter Verhelst
  2016-04-05 21:44           ` Wouter Verhelst
  2016-04-05  7:13         ` Alex Bligh
  1 sibling, 2 replies; 59+ messages in thread
From: Eric Blake @ 2016-04-04 23:32 UTC (permalink / raw)
  To: Wouter Verhelst, Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, Alex Bligh, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3979 bytes --]

On 04/04/2016 05:08 PM, Wouter Verhelst wrote:
> On Mon, Apr 04, 2016 at 10:54:02PM +0300, Denis V. Lunev wrote:
>> saying about dirtiness, we would soon come to the fact, that
>> we can have several dirtiness states regarding different
>> lines of incremental backups. This complexity is hidden
>> inside QEMU and it would be very difficult to publish and
>> reuse it.
> 
> How about this then.
> 
> A reply to GET_BLOCK_STATUS containing chunks of this:
> 
> 32-bit length
> 32-bit "snapshot status"
> if bit 0 in the latter field is set, that means the block is allocated
>   on the original device
> if bit 1 is set, that means the block is allocated on the first-level
>   snapshot
> if bit 2 is set, that means the block is allocated on the second-level
>   snapshot

The idea of allocation is orthogonal from the idea of reads as zeroes.
That is, a client may usefully guarantee that something reads as zeroes,
whether or not it is allocated (but knowing whether it is a hole or
allocated will determine whether future writes to that area will cause
file system fragmentation or be at risk of ENOSPC on thin-provisioning).
 If we want to expose the notion of depth (and I'm not sure about that
yet), we may want to reserve bit 0 for 'reads as zero' and bits 1-30 as
'allocated at depth "bit-1"' (and bit 31 as 'allocated at depth 30 or
greater).

I don't know if the idea of depth of allocation is useful enough to
expose in this manner; qemu could certainly advertise depth if the
protocol calls it out, but I'm still not sure whether knowing depth
helps any algorithms.

> 
> If all flags are cleared, that means the block is not allocated (i.e.,
> is a hole) and MUST read as zeroes.

That's too strong.  NBD_CMD_TRIM says that we can create holes whose
data does not necessarily read as zeroes (and SCSI definitely has
semantics like this - not all devices guarantee zero reads when you
UNMAP; and WRITE_SAME has an UNMAP flag to control whether you are okay
with the faster unmapping operation at the expense of bad reads, or
slower explicit writes).  Hence my complaint that we have to treat
'reads as zero' as an orthogonal bit to 'allocated at depth X'.

> 
> If a flag is set at a particular level X, that means the device is dirty
> at the Xth-level snapshot.
> 
> If at least one flag is set for a region, that means the data may read
> as "not zero".
> 
> The protocol does not define what it means to have multiple levels of
> snapshots, other than:
> 
> - Any write command (WRITE or WRITE_ZEROES) MUST NOT clear or set the
>   Xth level flag if the Yth level flag is not also cleared at the same
>   time, for any Y > X
> - Any write (as above) MAY set or clear multiple levels of flags at the
>   same time, as long as the above holds
> 
> Having a 32-bit snapshot status field allows for 32 levels of snapshots.
> We could switch length and flags to 64 bits so that things continue to
> align nicely, and then we have a maximum of 64 levels of snapshots.

64 bits may not lay out as nicely (a 12-byte struct is not as efficient
to copy between the wire and a C array as a 8-byte struct).

> 
> (I'm not going to write this up formally at this time of night, but you
> get the general idea)

The idea may make it possible to expose dirty information as a layer of
depth (from the qemu perspective, each qcow2 file would occupy 2 layers
of depth: one if dirty, and another if allocated; then deeper layers are
determined by backing files).  But I'm also worried that it may be more
complicated than the original question at hand (qemu wants to know,  in
advance of a read, which portions of a file are worth reading, because
they are either allocated, or because they are dirty; but doesn't care
to what depth the server has to go to actually perform the reads).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 16:39 [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension Eric Blake
  2016-04-04 18:06 ` [Qemu-devel] [Nbd] " Alex Bligh
  2016-04-04 22:40 ` Wouter Verhelst
@ 2016-04-05  4:05 ` Kevin Wolf
  2016-04-05 13:43   ` Paolo Bonzini
  2016-04-05  8:51 ` Stefan Hajnoczi
  2016-04-05  9:24 ` [Qemu-devel] [Nbd] " Markus Pargmann
  4 siblings, 1 reply; 59+ messages in thread
From: Kevin Wolf @ 2016-04-05  4:05 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, qemu-devel, Pavel Borzenkov, Stefan Hajnoczi,
	Paolo Bonzini, Wouter Verhelst, Denis V. Lunev

Am 04.04.2016 um 18:39 hat Eric Blake geschrieben:
> With the availability of sparse storage formats, it is often needed
> to query status of a particular range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS
> extension with one new NBD_CMD_BLOCK_STATUS command, a new
> structured reply chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required
> during, for example, incremental block device backup. To express
> this concept via NBD protocol, this patch also adds a flag to
> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
> provisioning information; however, with the current proposal, data
> dirtiness is only useful with additional coordination outside of
> the NBD protocol (such as a way to start and stop the server from
> tracking dirty sectors).  Future NBD extensions may add commands
> to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI
> "GET LBA STATUS" command more closely, it has been chosen to return
> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
> instead of a bitmap.
> 
> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> CC: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>

I think there is one common type of request that clients can't implement
efficiently with this proposal: Give me the status and the length of the
next extent of blocks with the same status. This is essentially how
bdrv_get_block_status() works in qemu, so we want an efficient way to do
this.

With this proposal, if I understand correctly, we could either send
short NBD_CMD_BLOCK_STATUS requests and risk being inefficient because
we need many calls for a large area with the same status; or, if we send
a large NBD_CMD_BLOCK_STATUS request, we may get a ridiculously long
reply (with the corresponding overhead on the server to compose the
list) while we're only really interested in the first descriptor.

The options I can think of is adding a request field "max number of
descriptors" or a flag "only single descriptor" (with the assumption
that clients always want one or unlimited), but maybe you have a better
idea.

Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 23:08       ` Wouter Verhelst
  2016-04-04 23:32         ` Eric Blake
@ 2016-04-05  7:13         ` Alex Bligh
  1 sibling, 0 replies; 59+ messages in thread
From: Alex Bligh @ 2016-04-05  7:13 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Paolo Bonzini, Denis V. Lunev


On 5 Apr 2016, at 00:08, Wouter Verhelst <w@uter.be> wrote:

> 
> if bit 0 in the latter field is set, that means the block is allocated
>  on the original device
> if bit 1 is set, that means the block is allocated on the first-level
>  snapshot
> if bit 2 is set, that means the block is allocated on the second-level
>  snapshot

This is what I originally thought Eric was proposing. The issue
is that in common servers the 'dirtiness' is at a different
allocation / blocksize from the allocation status. So you
really want 2 different bitmaps.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 23:32         ` Eric Blake
@ 2016-04-05  7:16           ` Wouter Verhelst
  2016-04-05 21:44           ` Wouter Verhelst
  1 sibling, 0 replies; 59+ messages in thread
From: Wouter Verhelst @ 2016-04-05  7:16 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Paolo Bonzini, Denis V. Lunev

On Mon, Apr 04, 2016 at 05:32:34PM -0600, Eric Blake wrote:
> On 04/04/2016 05:08 PM, Wouter Verhelst wrote:
> > On Mon, Apr 04, 2016 at 10:54:02PM +0300, Denis V. Lunev wrote:
> >> saying about dirtiness, we would soon come to the fact, that
> >> we can have several dirtiness states regarding different
> >> lines of incremental backups. This complexity is hidden
> >> inside QEMU and it would be very difficult to publish and
> >> reuse it.
> > 
> > How about this then.
> > 
> > A reply to GET_BLOCK_STATUS containing chunks of this:
> > 
> > 32-bit length
> > 32-bit "snapshot status"
> > if bit 0 in the latter field is set, that means the block is allocated
> >   on the original device
> > if bit 1 is set, that means the block is allocated on the first-level
> >   snapshot
> > if bit 2 is set, that means the block is allocated on the second-level
> >   snapshot
> 
> The idea of allocation is orthogonal from the idea of reads as zeroes.
> That is, a client may usefully guarantee that something reads as zeroes,
> whether or not it is allocated (but knowing whether it is a hole or
> allocated will determine whether future writes to that area will cause
> file system fragmentation or be at risk of ENOSPC on thin-provisioning).
>  If we want to expose the notion of depth (and I'm not sure about that
> yet), we may want to reserve bit 0 for 'reads as zero' and bits 1-30 as
> 'allocated at depth "bit-1"'

That works too, I suppose.

> (and bit 31 as 'allocated at depth 30 or greater).

The reason I said "the protocol does not define" was specifically to
avoid this. There is nothing that compels a server to map its
implementation-specific snapshots one-to-one on NBD protocol snapshots.
E.g., a client could ask for a snapshot out of band; the server would
then create a new snapshot on top of its existing snapshots, and map
things as: bit 0 for "reads as zeroes", bit 1 for "allocated before the
snapshot you just asked me for", folding together all allocation bits
for older snapshots, and bit 2 as "this block has been changed in the
context of the snapshot you just asked me for".

The out-of-band protocol could alternatively say "NBD snapshot level 13
is the snapshot I just created for you".

> I don't know if the idea of depth of allocation is useful enough to
> expose in this manner; qemu could certainly advertise depth if the
> protocol calls it out, but I'm still not sure whether knowing depth
> helps any algorithms.

The point is that with the proposed format, depth *can* be exposed but
doesn't *have* to be. The original proposal provided two levels of
allocation (on the original device, and on the snapshot or whatever
tracking method is used). This proposal provides 32 (if counting the "is
zeroes" bit as a separate level).

> > If all flags are cleared, that means the block is not allocated (i.e.,
> > is a hole) and MUST read as zeroes.
> 
> That's too strong.  NBD_CMD_TRIM says that we can create holes whose
> data does not necessarily read as zeroes (and SCSI definitely has
> semantics like this - not all devices guarantee zero reads when you
> UNMAP; and WRITE_SAME has an UNMAP flag to control whether you are okay
> with the faster unmapping operation at the expense of bad reads, or
> slower explicit writes).  Hence my complaint that we have to treat
> 'reads as zero' as an orthogonal bit to 'allocated at depth X'.

Right, so we should special-case the first bit then, as you suggest.

> > If a flag is set at a particular level X, that means the device is dirty
> > at the Xth-level snapshot.
> > 
> > If at least one flag is set for a region, that means the data may read
> > as "not zero".
> > 
> > The protocol does not define what it means to have multiple levels of
> > snapshots, other than:
> > 
> > - Any write command (WRITE or WRITE_ZEROES) MUST NOT clear or set the
> >   Xth level flag if the Yth level flag is not also cleared at the same
> >   time, for any Y > X
> > - Any write (as above) MAY set or clear multiple levels of flags at the
> >   same time, as long as the above holds
> > 
> > Having a 32-bit snapshot status field allows for 32 levels of snapshots.
> > We could switch length and flags to 64 bits so that things continue to
> > align nicely, and then we have a maximum of 64 levels of snapshots.
> 
> 64 bits may not lay out as nicely (a 12-byte struct is not as efficient
> to copy between the wire and a C array as a 8-byte struct).

Which is why I said to switch *both* fields to 64 bits, so that it
becomes 16 bytes rather than 12 :-)

> > (I'm not going to write this up formally at this time of night, but you
> > get the general idea)
> 
> The idea may make it possible to expose dirty information as a layer of
> depth (from the qemu perspective, each qcow2 file would occupy 2 layers
> of depth: one if dirty, and another if allocated; then deeper layers are
> determined by backing files).  But I'm also worried that it may be more
> complicated than the original question at hand (qemu wants to know,  in
> advance of a read, which portions of a file are worth reading, because
> they are either allocated, or because they are dirty; but doesn't care
> to what depth the server has to go to actually perform the reads).

You just do "if ([snapshot status] >= (1 << X))" to see if any bits
beyond level X are set. That's a fairly simple operation. If a client
isn't interested in anything more than "modified beyond the original
device", it checks whether the snapshot status field has a value >= 4.
If so, it knows it must read the relevant blocks.

If the server doesn't want to expose more information than the client
would need for such an operation, it could fold all snapshots beyond the
original device into the first level snapshot (i.e., only sending
information in the first three bits of the field), and then the client
would have less information, but still enough to do what it needs to do.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 16:39 [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension Eric Blake
                   ` (2 preceding siblings ...)
  2016-04-05  4:05 ` [Qemu-devel] " Kevin Wolf
@ 2016-04-05  8:51 ` Stefan Hajnoczi
  2016-04-05  9:24 ` [Qemu-devel] [Nbd] " Markus Pargmann
  4 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2016-04-05  8:51 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Paolo Bonzini, Wouter Verhelst, Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 1002 bytes --]

On Mon, Apr 04, 2016 at 10:39:10AM -0600, Eric Blake wrote:
> +To provide such class of information, the `BLOCK_STATUS` extension
> +adds a new `NBD_CMD_BLOCK_STATUS` command which returns a list of
> +ranges with their respective states.  This extension is not available
> +unless the client also negotiates the `STRUCTURED_REPLY` extension.
> +
> +* `NBD_FLAG_SEND_BLOCK_STATUS`
> +
> +    The server SHOULD set this transmission flag to 1 if structured
> +    replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS`
> +    request is supported.
> +
> +* `NBD_REPLY_TYPE_BLOCK_STATUS`
> +
> +    *length* MUST be a positive integer multiple of 8.  This reply
> +    represents a series of consecutive block descriptors where the sum
> +    of the lengths of the descriptors equals the length of the
> +    original request.  This chunk type MUST appear at most once in a
> +    structured reply. Valid as a reply to `NBD_CMD_BLOCK_STATUS.

Missing terminating backquote on NBD_CMD_BLOCK_STATUS.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 16:39 [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension Eric Blake
                   ` (3 preceding siblings ...)
  2016-04-05  8:51 ` Stefan Hajnoczi
@ 2016-04-05  9:24 ` Markus Pargmann
  2016-04-05 13:50   ` Paolo Bonzini
  2016-04-05 14:14   ` Eric Blake
  4 siblings, 2 replies; 59+ messages in thread
From: Markus Pargmann @ 2016-04-05  9:24 UTC (permalink / raw)
  To: nbd-general
  Cc: Kevin Wolf, qemu-devel, Pavel Borzenkov, Stefan Hajnoczi,
	Denis V. Lunev, Wouter Verhelst, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 13117 bytes --]

Hi,

On Monday 04 April 2016 10:39:10 Eric Blake wrote:
> With the availability of sparse storage formats, it is often needed
> to query status of a particular range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS
> extension with one new NBD_CMD_BLOCK_STATUS command, a new
> structured reply chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required
> during, for example, incremental block device backup. To express
> this concept via NBD protocol, this patch also adds a flag to
> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
> provisioning information; however, with the current proposal, data
> dirtiness is only useful with additional coordination outside of
> the NBD protocol (such as a way to start and stop the server from
> tracking dirty sectors).  Future NBD extensions may add commands
> to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI
> "GET LBA STATUS" command more closely, it has been chosen to return
> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
> instead of a bitmap.
> 
> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> CC: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> 
> v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html
> 
> Since then, we've added the STRUCTURED_REPLY extension, which
> necessitates a rather larger rebase; I've also changed things
> to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
> modes to be determined by boolean flags (rather than by fixed
> values of the 16-bit flags field), changed the reply status fields
> to be bitwise-or values (with a default of 0 always being sane),
> and changed the descriptor layout to drop an offset but to include
> a 32-bit status so that the descriptor is nicely 8-byte aligned
> without padding.
> 
>  doc/proto.md | 152 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 151 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index fb97217..d280bbd 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -296,6 +296,8 @@ immediately after the handshake flags field in oldstyle negotiation:
>    `WRITE_ZEROES` extension; see below.
>  - bit 7, `NBD_FLAG_SEND_DF`; defined by the experimental `STRUCTURED_REPLY`
>    extension; see below.
> +- bit 8, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental
> +  `BLOCK_STATUS` extension; see below.
> 
>  Clients SHOULD ignore unknown flags.
> 
> @@ -583,6 +585,10 @@ The following request types exist:
> 
>      Defined by the experimental `WRITE_ZEROES` extension; see below.
> 
> +* `NBD_CMD_BLOCK_STATUS` (7)
> +
> +    Defined by the experimental `BLOCK_STATUS` extension; see below.
> +
>  * Other requests
> 
>      Some third-party implementations may require additional protocol
> @@ -922,7 +928,8 @@ error, and alters the reply to the `NBD_CMD_READ` request.
>        were sent earlier in the structured reply, the server SHOULD NOT
>        send multiple distinct offsets that lie within the bounds of a
>        single content chunk.  Valid as a reply to `NBD_CMD_READ`,
> -      `NBD_CMD_WRITE`, `NBD_CMD_WRITE_ZEROES`, and `NBD_CMD_TRIM`.
> +      `NBD_CMD_WRITE`, `NBD_CMD_WRITE_ZEROES`, `NBD_CMD_TRIM`, and
> +      `NBD_CMD_BLOCK_STATUS`.
> 
>        The payload is structured as:
> 
> @@ -965,6 +972,10 @@ error, and alters the reply to the `NBD_CMD_READ` request.
>        64 bits: offset (unsigned)  
>        32 bits: hole size (unsigned, MUST be nonzero)  
> 
> +    - `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
> +
> +      Defined by the experimental extension `BLOCK_STATUS`; see below.
> +
>  * `NBD_OPT_STRUCTURED_REPLY`
> 
>      The client wishes to use structured replies during the
> @@ -1094,6 +1105,145 @@ error, and alters the reply to the `NBD_CMD_READ` request.
>      support unfragmented reads in which the client's request length
>      does not exceed 65,536 bytes.
> 
> +### `BLOCK_STATUS` extension
> +
> +With the availability of sparse storage formats, it is often needed to
> +query the status of a particular range and read only those blocks of
> +data that are actually present on the block device.
> +
> +Some storage formats and operations over such formats express a
> +concept of data dirtiness. Whether the operation is block device
> +mirroring, incremental block device backup or any other operation with
> +a concept of data dirtiness, they all share a need to provide a list
> +of ranges that this particular operation treats as dirty.
> +
> +To provide such class of information, the `BLOCK_STATUS` extension
> +adds a new `NBD_CMD_BLOCK_STATUS` command which returns a list of
> +ranges with their respective states.  This extension is not available
> +unless the client also negotiates the `STRUCTURED_REPLY` extension.
> +
> +* `NBD_FLAG_SEND_BLOCK_STATUS`
> +
> +    The server SHOULD set this transmission flag to 1 if structured
> +    replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS`
> +    request is supported.
> +
> +* `NBD_REPLY_TYPE_BLOCK_STATUS`
> +
> +    *length* MUST be a positive integer multiple of 8.  This reply
> +    represents a series of consecutive block descriptors where the sum
> +    of the lengths of the descriptors equals the length of the
> +    original request.  This chunk type MUST appear at most once in a
> +    structured reply. Valid as a reply to `NBD_CMD_BLOCK_STATUS.
> +
> +    The payload is structured as a list of one or more descriptors,
> +    each with this layout:
> +
> +        * 32 bits, length (unsigned, MUST NOT be zero)
> +        * 32 bits, status flags
> +
> +    The definition of the status flags is determined based on the
> +    flags present in the original request.
> +
> +* `NBD_CMD_BLOCK_STATUS`
> +
> +    A block status query request. Length and offset define the range
> +    of interest. Clients SHOULD NOT use this request unless the server
> +    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
> +    in turn requires the client to first negotiate structured replies.
> +    For a successful return, the server MUST use a structured reply,
> +    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> +
> +    The list of block status descriptors within the
> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
> +    of the file, and the sum of the *length* fields of each descriptor
> +    MUST match the overall *length* of the request.  The server SHOULD
> +    use different *status* values between consecutive descriptors, and
> +    SHOULD use descriptor lengths that are an integer multiple of 512
> +    bytes where possible (the first and last descriptor of an
> +    unaligned query being the most obvious places for an exception).
> +    The status flags are intentionally defined so that a server MAY
> +    always safely report a status of 0 for any block, although the
> +    server SHOULD return additional status values when they can be
> +    easily detected.
> +
> +    If an error occurs, the server SHOULD set the appropriate error
> +    code in the error field of either a simple reply or an error
> +    chunk.  However, if the error does not involve invalid usage (such
> +    as a request beyond the bounds of the file), a server MAY reply
> +    with a single block status descriptor with *length* matching the
> +    requested length, and *status* of 0 rather than reporting the
> +    error.
> +
> +    The type of information requested by the client is determined by
> +    the request flags, as follows:
> +
> +    1. Block provisioning status
> +
> +    Upon receiving an `NBD_CMD_BLOCK_STATUS` command with the flag
> +    `NBD_FLAG_STATUS_DIRTY` clear, the server MUST return the
> +    provisioning status of the device, where the status field of each
> +    descriptor is determined by the following bits (all four
> +    combinations of these two bits are possible):
> +
> +      - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole
> +        (and future writes to that area may cause fragmentation or
> +        encounter an `ENOSPC` error); if clear, the block is allocated
> +        or the server could not otherwise determine its status.  Note
> +        that the use of `NBD_CMD_TRIM` is related to this status, but
> +        that the server MAY report a hole even where trim has not been
> +        requested, and also that a server MAY report allocation even
> +        where a trim has been requested.
> +      - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as
> +        all zeroes; if clear, the block contents are not known.  Note
> +        that the use of `NBD_CMD_WRITE_ZEROES` is related to this
> +        status, but that the server MAY report zeroes even where write
> +        zeroes has not been requested, and also that a server MAY
> +        report unknown content even where write zeroes has been
> +        requested.
> +
> +    The client SHOULD NOT read from an area that has both
> +    `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.

Why not? If we don't execute CMD_BLOCK_STATUS we wouldn't know about the
situation and would simply read directly. To fulfill this statement we
would need to receive the block status before every read operation.

Also something that is kind of missing from the document so far is
concurrency with other NBD clients. Certainly most users do not use NBD
for concurrent access to the backend storage. But for example the
sentence above ignores the fact that another client may work on the
backend and that the state may change after some time so that it may
still be necessary to read from an area with NBD_STATE_HOLE and
NBD_STATE_ZERO set.

Also it is uncertain if these status bits may change over time through
reorganization of backend storage, for example holes may be removed in
the backend and so on. Is it safe to cache this stuff?

Until now something like READ and WRITE where somehow atomic operations
in the protocol.

Best Regards,

Markus

> +
> +    2. Block dirtiness status
> +
> +    This command is meant to operate in tandem with other (non-NBD)
> +    channels to the server.  Generally, a "dirty" block is a block
> +    that has been written to by someone, but the exact meaning of "has
> +    been written" is left to the implementation.  For example, a
> +    virtual machine monitor could provide a (non-NBD) command to start
> +    tracking blocks written by the virtual machine.  A backup client
> +    can then connect to an NBD server provided by the virtual machine
> +    monitor and use `NBD_CMD_BLOCK_STATUS` with the
> +    `NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty
> +    blocks that the virtual machine has changed.
> +
> +    An implementation that doesn't track the "dirtiness" state of
> +    blocks MUST either fail this command with `EINVAL`, or mark all
> +    blocks as dirty in the descriptor that it returns.  Upon receiving
> +    an `NBD_CMD_BLOCK_STATUS` command with the flag
> +    `NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness
> +    status of the device, where the status field of each descriptor is
> +    determined by the following bit:
> +
> +      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
> +        portion of the file that is still clean because it has not
> +        been written; if clear, the block represents a portion of the
> +        file that is dirty, or where the server could not otherwise
> +        determine its status.
> +
> +A client MAY close the connection if it detects that the server has
> +sent an invalid chunks (such as lengths in the
> +`NBD_REPLY_TYPE_BLOCK_STATUS` not summing up to the requested length).
> +The server SHOULD return `EINVAL` if it receives a `BLOCK_STATUS`
> +request including one or more sectors beyond the size of the device.
> +
> +The extension adds the following new command flag:
> +
> +- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`.
> +  SHOULD be set to 1 if the client wants to request dirtiness status
> +  rather than provisioning status.
> +
>  ## About this file
> 
>  This file tries to document the NBD protocol as it is currently
> 

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 19:34   ` Eric Blake
  2016-04-04 19:54     ` Denis V. Lunev
  2016-04-04 19:58     ` Alex Bligh
@ 2016-04-05 13:38     ` Paolo Bonzini
  2 siblings, 0 replies; 59+ messages in thread
From: Paolo Bonzini @ 2016-04-05 13:38 UTC (permalink / raw)
  To: Eric Blake, Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Denis V. Lunev


[-- Attachment #1.1: Type: text/plain, Size: 608 bytes --]



On 04/04/2016 21:34, Eric Blake wrote:
> At this point, I was just trying to rebase the proposal as originally
> made by Denis and Pavel; perhaps they will have more insight on how they
> envisioned using the command, or on whether we should try harder to make
> this more of a two-way protocol (where the client can tell the server
> when to mark something as clean, or when to start tracking whether
> something is dirty).

I don't think so, this command (if I understand correctly) is meant for
the case where the dirty bitmap is static (e.g. the underlying image is
read-only).

Paolo


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 23:03   ` Eric Blake
@ 2016-04-05 13:41     ` Paolo Bonzini
  2016-04-06  5:57     ` Denis V. Lunev
  1 sibling, 0 replies; 59+ messages in thread
From: Paolo Bonzini @ 2016-04-05 13:41 UTC (permalink / raw)
  To: Eric Blake, Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan Hajnoczi, Denis V. Lunev


[-- Attachment #1.1: Type: text/plain, Size: 961 bytes --]



On 05/04/2016 01:03, Eric Blake wrote:
> 
> But while Alex and Denis were arguing that no one would ever query both
> things at once (and therefore, it might be better to make
> NBD_STATUS_HOLE and NBD_STATUS_CLEAN both be bit 0), your approach of
> having two separate request flags and allowing both at once would mean
> we do need to keep the status information separate (NBD_STATUS_HOLE is
> bit 0, NBD_STATUS_CLEAN is bit 2).

I agree that querying both is messy.  It would add complication to the
implementation and the usecases are separate enough.

Usually you would first query for dirtiness, and then perhaps ask for
allocation status on the dirty areas.  Getting back the allocation
status on the clean areas would make the request unnecessarily larger.
In addition, querying the dirtiness status should be extremely cheap,
while querying the allocation status might be expensive depending on the
underlying storage.

Paolo


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-05  4:05 ` [Qemu-devel] " Kevin Wolf
@ 2016-04-05 13:43   ` Paolo Bonzini
  2016-04-07 10:38     ` Vladimir Sementsov-Ogievskiy
  2016-04-07 15:35     ` Pavel Borzenkov
  0 siblings, 2 replies; 59+ messages in thread
From: Paolo Bonzini @ 2016-04-05 13:43 UTC (permalink / raw)
  To: Kevin Wolf, Eric Blake
  Cc: nbd-general, qemu-devel, Pavel Borzenkov, Stefan Hajnoczi,
	Wouter Verhelst, Denis V. Lunev



On 05/04/2016 06:05, Kevin Wolf wrote:
> The options I can think of is adding a request field "max number of
> descriptors" or a flag "only single descriptor" (with the assumption
> that clients always want one or unlimited), but maybe you have a better
> idea.

I think a limit is better.  Even if the client is ultimately going to
process the whole file, it may take a very long time and space to
retrieve all the descriptors in one go.  Rather than query e.g. 16GB at
a time, I think it's simpler to put a limit of 1024 descriptors or so.

Paolo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-05  9:24 ` [Qemu-devel] [Nbd] " Markus Pargmann
@ 2016-04-05 13:50   ` Paolo Bonzini
  2016-04-11  5:58     ` Markus Pargmann
  2016-04-05 14:14   ` Eric Blake
  1 sibling, 1 reply; 59+ messages in thread
From: Paolo Bonzini @ 2016-04-05 13:50 UTC (permalink / raw)
  To: Markus Pargmann, nbd-general
  Cc: Kevin Wolf, qemu-devel, Pavel Borzenkov, Stefan Hajnoczi,
	Wouter Verhelst, Denis V. Lunev


[-- Attachment #1.1: Type: text/plain, Size: 1394 bytes --]



On 05/04/2016 11:24, Markus Pargmann wrote:
> Also it is uncertain if these status bits may change over time through
> reorganization of backend storage, for example holes may be removed in
> the backend and so on. Is it safe to cache this stuff?

If there's no concurrent access, it is.  Even if it is out of date, it
is still a valid representation.  For example, suppose a file has a
hole.  The client knows that *its own client* (whoever's using /dev/nbd0
for example) should write to it before relying on the contents of the
area.  This remains true if a hole becomes a non-hole.

If there's concurrent access, all bets are off.  Suppose a zero area
loses the zero flag.  Client A knows that the area remains zero unless
someone has written to it.  If *another* client B has concurrently
written something, there must be a communication mechanism by which B
tells A to invalidate the cache, or A must not cache the
information---and probably should not request it in the first place.

> Until now something like READ and WRITE where somehow atomic operations
> in the protocol.

No, they weren't.  If you have overlapping I/O from multiple clients
there's no way to know what data you will get.  You might even get old
data and new data interspersed in a single read.  There's definitely no
guarantee of atomicity in either POSIX or NBD.

Thanks,

Paolo


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-05  9:24 ` [Qemu-devel] [Nbd] " Markus Pargmann
  2016-04-05 13:50   ` Paolo Bonzini
@ 2016-04-05 14:14   ` Eric Blake
  2016-04-05 20:50     ` Wouter Verhelst
  1 sibling, 1 reply; 59+ messages in thread
From: Eric Blake @ 2016-04-05 14:14 UTC (permalink / raw)
  To: Markus Pargmann, nbd-general
  Cc: Kevin Wolf, qemu-devel, Pavel Borzenkov, Stefan Hajnoczi,
	Denis V. Lunev, Wouter Verhelst, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2219 bytes --]

On 04/05/2016 03:24 AM, Markus Pargmann wrote:

>> +        requested.
>> +
>> +    The client SHOULD NOT read from an area that has both
>> +    `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.
> 
> Why not? If we don't execute CMD_BLOCK_STATUS we wouldn't know about the
> situation and would simply read directly. To fulfill this statement we
> would need to receive the block status before every read operation.

Because we already state that for NBD_CMD_TRIM, the client SHOULD NOT
read an area where a trim was requested without an intervening write,
precisely because the server MAY (but not MUST) cause the trim to create
bogus reads for that area until another write happens.  I was just
trying to explain that the representation of HOLE and not ZERO
represents the state created by a successful NBD_CMD_TRIM that the
server honors, without being able to guarantee zero reads.

> 
> Also something that is kind of missing from the document so far is
> concurrency with other NBD clients. Certainly most users do not use NBD
> for concurrent access to the backend storage. But for example the
> sentence above ignores the fact that another client may work on the
> backend and that the state may change after some time so that it may
> still be necessary to read from an area with NBD_STATE_HOLE and
> NBD_STATE_ZERO set.

That's missing from NBD in general, and I don't think this is the patch
to add it.  We already have concurrency with self issues, because NBD
server can handle requests out of order (within the bounds of FLUSH and
FUA modifiers).

> 
> Also it is uncertain if these status bits may change over time through
> reorganization of backend storage, for example holes may be removed in
> the backend and so on. Is it safe to cache this stuff?

If the client is the only thing modifying the drive, maybe we want to
make that additional constraint on the server.  But how best to word it,
or is it too tight of a specification?

> 
> Until now something like READ and WRITE where somehow atomic operations
> in the protocol.

Not really.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 20:03       ` Alex Bligh
  2016-04-04 20:08         ` Denis V. Lunev
@ 2016-04-05 14:15         ` Paolo Bonzini
  2016-04-05 15:01           ` Alex Bligh
  1 sibling, 1 reply; 59+ messages in thread
From: Paolo Bonzini @ 2016-04-05 14:15 UTC (permalink / raw)
  To: Alex Bligh, Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst



On 04/04/2016 22:03, Alex Bligh wrote:
> So qemu-nbdserver has some idea of 'dirtiness', and you want
> to publish that, and the consumer is qemu (and only qemu),
> because only qemu knows what 'dirtiness' means in the
> sense the server provides it?

The consumer is not QEMU; the consumer is backup software which is not
part of QEMU (and could even be proprietary, ugh).

Paolo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-05 14:15         ` Paolo Bonzini
@ 2016-04-05 15:01           ` Alex Bligh
  2016-04-05 15:23             ` Paolo Bonzini
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Bligh @ 2016-04-05 15:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Wouter Verhelst, Denis V. Lunev


On 5 Apr 2016, at 15:15, Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> On 04/04/2016 22:03, Alex Bligh wrote:
>> So qemu-nbdserver has some idea of 'dirtiness', and you want
>> to publish that, and the consumer is qemu (and only qemu),
>> because only qemu knows what 'dirtiness' means in the
>> sense the server provides it?
> 
> The consumer is not QEMU; the consumer is backup software which is not
> part of QEMU (and could even be proprietary, ugh).

I'm missing how the ugh-author can write the software without knowing
exactly what the bit does. Or are you saying "that's a matter
for the qemu spec, not the nbd spec"?

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-05 15:01           ` Alex Bligh
@ 2016-04-05 15:23             ` Paolo Bonzini
  2016-04-05 15:27               ` Alex Bligh
  0 siblings, 1 reply; 59+ messages in thread
From: Paolo Bonzini @ 2016-04-05 15:23 UTC (permalink / raw)
  To: Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Denis V. Lunev



On 05/04/2016 17:01, Alex Bligh wrote:
>>> >> So qemu-nbdserver has some idea of 'dirtiness', and you want
>>> >> to publish that, and the consumer is qemu (and only qemu),
>>> >> because only qemu knows what 'dirtiness' means in the
>>> >> sense the server provides it?
>> > 
>> > The consumer is not QEMU; the consumer is backup software which is not
>> > part of QEMU (and could even be proprietary, ugh).
> I'm missing how the ugh-author can write the software without knowing
> exactly what the bit does. Or are you saying "that's a matter
> for the qemu spec, not the nbd spec"?

Yes, that's it.  NBD defines a safe default and a general idea of what
it should be used for.

Paolo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-05 15:23             ` Paolo Bonzini
@ 2016-04-05 15:27               ` Alex Bligh
  2016-04-05 15:31                 ` Paolo Bonzini
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Bligh @ 2016-04-05 15:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Wouter Verhelst, Denis V. Lunev


On 5 Apr 2016, at 16:23, Paolo Bonzini <pbonzini@redhat.com> wrote:

>> I'm missing how the ugh-author can write the software without knowing
>> exactly what the bit does. Or are you saying "that's a matter
>> for the qemu spec, not the nbd spec"?
> 
> Yes, that's it.  NBD defines a safe default and a general idea of what
> it should be used for.

OK, so my argument then is we should not have a special case
'only Qemu knows what this does' bit if we can avoid at, and
instead have a general system of 'proprietary bits' (unfortunate
word but perhaps 'non-NBD').

Of course this is made harder as the NBD_CMD_ size is not
extensible (today).

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-05 15:27               ` Alex Bligh
@ 2016-04-05 15:31                 ` Paolo Bonzini
  0 siblings, 0 replies; 59+ messages in thread
From: Paolo Bonzini @ 2016-04-05 15:31 UTC (permalink / raw)
  To: Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Wouter Verhelst, Denis V. Lunev



On 05/04/2016 17:27, Alex Bligh wrote:
>>> I'm missing how the ugh-author can write the software without knowing
>>> exactly what the bit does. Or are you saying "that's a matter
>>> for the qemu spec, not the nbd spec"?
> > 
> > Yes, that's it.  NBD defines a safe default and a general idea of what
> > it should be used for.
>
> OK, so my argument then is we should not have a special case
> 'only Qemu knows what this does' bit if we can avoid at, and
> instead have a general system of 'proprietary bits' (unfortunate
> word but perhaps 'non-NBD').

I think it is generic enough.  Ideally the backup software wouldn't see
the QEMU-specific bits, those would be provided by a management layer
(such as OpenStack/Virtuozzo/oVirt/ProxMox).  They would just know to
copy the dirty bits.

Paolo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-05 14:14   ` Eric Blake
@ 2016-04-05 20:50     ` Wouter Verhelst
  2016-04-11  6:07       ` Markus Pargmann
  0 siblings, 1 reply; 59+ messages in thread
From: Wouter Verhelst @ 2016-04-05 20:50 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan Hajnoczi, Paolo Bonzini, Markus Pargmann, Denis V. Lunev

On Tue, Apr 05, 2016 at 08:14:01AM -0600, Eric Blake wrote:
> On 04/05/2016 03:24 AM, Markus Pargmann wrote:
> 
> >> +        requested.
> >> +
> >> +    The client SHOULD NOT read from an area that has both
> >> +    `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.
> > 
> > Why not? If we don't execute CMD_BLOCK_STATUS we wouldn't know about the
> > situation and would simply read directly. To fulfill this statement we
> > would need to receive the block status before every read operation.
> 
> Because we already state that for NBD_CMD_TRIM, the client SHOULD NOT
> read an area where a trim was requested without an intervening write,

Actually, we state that a client SHOULD NOT *make any assumption* about
the state (emphasis added). There's a difference.

If the client wants to write, there is no problem (but he'll probably
get garbage).

[...]
> > Also something that is kind of missing from the document so far is
> > concurrency with other NBD clients. Certainly most users do not use NBD
> > for concurrent access to the backend storage. But for example the
> > sentence above ignores the fact that another client may work on the
> > backend and that the state may change after some time so that it may
> > still be necessary to read from an area with NBD_STATE_HOLE and
> > NBD_STATE_ZERO set.
> 
> That's missing from NBD in general, and I don't think this is the patch
> to add it.  We already have concurrency with self issues, because NBD
> server can handle requests out of order (within the bounds of FLUSH and
> FUA modifiers).

I don't think NBD *needs* to add much in the way of concurrency, other
than write barriers etc; but having an explicit message "some other
process just modified offset X length Y" seems out of scope for NBD.

> > Also it is uncertain if these status bits may change over time through
> > reorganization of backend storage, for example holes may be removed in
> > the backend and so on. Is it safe to cache this stuff?
> 
> If the client is the only thing modifying the drive, maybe we want to
> make that additional constraint on the server.  But how best to word it,
> or is it too tight of a specification?

I think it's perfectly fine for the protocol to assume in general that
the client is the only writer, and that no other writers are
concurrently accessing the same device. Anything else would make the
protocol much more complex; and one of the features of NBD is, I think,
the fact that the protocol is not *too* complex.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 23:32         ` Eric Blake
  2016-04-05  7:16           ` Wouter Verhelst
@ 2016-04-05 21:44           ` Wouter Verhelst
  1 sibling, 0 replies; 59+ messages in thread
From: Wouter Verhelst @ 2016-04-05 21:44 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Paolo Bonzini, Denis V. Lunev

On Mon, Apr 04, 2016 at 05:32:34PM -0600, Eric Blake wrote:
> On 04/04/2016 05:08 PM, Wouter Verhelst wrote:
> > On Mon, Apr 04, 2016 at 10:54:02PM +0300, Denis V. Lunev wrote:
> >> saying about dirtiness, we would soon come to the fact, that
> >> we can have several dirtiness states regarding different
> >> lines of incremental backups. This complexity is hidden
> >> inside QEMU and it would be very difficult to publish and
> >> reuse it.
> > 
> > How about this then.
> > 
> > A reply to GET_BLOCK_STATUS containing chunks of this:
> > 
> > 32-bit length
> > 32-bit "snapshot status"
> > if bit 0 in the latter field is set, that means the block is allocated
> >   on the original device
> > if bit 1 is set, that means the block is allocated on the first-level
> >   snapshot
> > if bit 2 is set, that means the block is allocated on the second-level
> >   snapshot
> 
> The idea of allocation is orthogonal from the idea of reads as zeroes.
> That is, a client may usefully guarantee that something reads as zeroes,
> whether or not it is allocated (but knowing whether it is a hole or
> allocated will determine whether future writes to that area will cause
> file system fragmentation or be at risk of ENOSPC on thin-provisioning).
>  If we want to expose the notion of depth (and I'm not sure about that
> yet), we may want to reserve bit 0 for 'reads as zero' and bits 1-30 as
> 'allocated at depth "bit-1"' (and bit 31 as 'allocated at depth 30 or
> greater).
> 
> I don't know if the idea of depth of allocation is useful enough to
> expose in this manner; qemu could certainly advertise depth if the
> protocol calls it out, but I'm still not sure whether knowing depth
> helps any algorithms.
> 
> > 
> > If all flags are cleared, that means the block is not allocated (i.e.,
> > is a hole) and MUST read as zeroes.
> 
> That's too strong.  NBD_CMD_TRIM says that we can create holes whose
> data does not necessarily read as zeroes (and SCSI definitely has
> semantics like this - not all devices guarantee zero reads when you
> UNMAP; and WRITE_SAME has an UNMAP flag to control whether you are okay
> with the faster unmapping operation at the expense of bad reads, or
> slower explicit writes).  Hence my complaint that we have to treat
> 'reads as zero' as an orthogonal bit to 'allocated at depth X'.

Right.

It would be possible to instead mark bit 0 as NBD_ALLOC_LEVEL_DATA (i.e.,
"not just zeroes" -- since I believe a flag value of "one" to indicate
"zeroes" is bound to cause confusion), bit 1 as NBD_ALLOC_LEVEL_CURRENT
(i.e., "allocated in the current snapshot level"), and bits 2 through 31
as NBD_ALLOC_LEVEL_2 through NBD_ALLOC_LEVEL_31, with
implementation-defined meanings, as above.

> > If a flag is set at a particular level X, that means the device is dirty
> > at the Xth-level snapshot.
> > 
> > If at least one flag is set for a region, that means the data may read
> > as "not zero".
> > 
> > The protocol does not define what it means to have multiple levels of
> > snapshots, other than:
> > 
> > - Any write command (WRITE or WRITE_ZEROES) MUST NOT clear or set the
> >   Xth level flag if the Yth level flag is not also cleared at the same
> >   time, for any Y > X
> > - Any write (as above) MAY set or clear multiple levels of flags at the
> >   same time, as long as the above holds

Having thought about this on my way to work, I'm now no longer convinced
this is a good idea. I could imagine several scenarios where violating
these constraints might be useful, and none where not violating them
would buy us anything.

Instead, I suggest the following semantics:

- NBD_CMD_WRITE to a region which has NBD_ALLOC_LEVEL_DATA cleared MUST
  set that bit for that region
- NBD_CMD_WRITE_ZEROES or NBD_CMD_TRIM to a region which has
  NBD_ALLOC_LEVEL_DATA set MAY clear that bit for that region
- NBD_CMD_TRIM to a region which has one of the NBD_ALLOC_LEVEL_X flags
  set MAY just clear those flags (and no other). In that case, the
  contents of that region (when read) SHOULD be assumed to have changed,
  but is not defined, and is unlikely to read as zeroes if other
  NBD_ALLOC_LEVEL_* flags are still set.
- NBD_CMD_WRITE to a region which has NBD_ALLOC_LEVEL_CURRENT set SHOULD
  NOT fail with ENOSPC, although it MAY still fail with other errors.
- NBD_CMD_LEVEL_READ from a region which has NBD_ALLOC_LEVEL_DATA clared
  MUST read as zeroes

This may all be a bit overengineered, but it has the benefit of
being fairly generic. It also means that if the flags field is
all-zeroes, the region is "uninteresting".

Thoughts?

[...]

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-04 23:03   ` Eric Blake
  2016-04-05 13:41     ` Paolo Bonzini
@ 2016-04-06  5:57     ` Denis V. Lunev
  2016-04-06 14:08       ` Eric Blake
  1 sibling, 1 reply; 59+ messages in thread
From: Denis V. Lunev @ 2016-04-06  5:57 UTC (permalink / raw)
  To: Eric Blake, Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan Hajnoczi, Paolo Bonzini

On 04/05/2016 02:03 AM, Eric Blake wrote:
> On 04/04/2016 04:40 PM, Wouter Verhelst wrote:
>> Hi,
>>
>> Need to look into this in some detail, for which I don't have the time
>> (or the non-tiredness ;-) right now, but these two caught my eye:
>>
>>> +    The payload is structured as a list of one or more descriptors,
>>> +    each with this layout:
>>> +
>>> +        * 32 bits, length (unsigned, MUST NOT be zero)
>>> +        * 32 bits, status flags
>>> +
>>> +    The definition of the status flags is determined based on the
>>> +    flags present in the original request.
>> Might be a good idea to specify what a client can do with flags it
>> doesn't know about; ignore them, probably?
> Sounds correct - so a server can subdivide into more descriptors with
> additional status bits, and clients just have to ignore those extra bits
> and coalesce things itself when dealing with the bits it cares about.
>
>>   
>> [...]
>>> +The extension adds the following new command flag:
>>> +
>>> +- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`.
>>> +  SHOULD be set to 1 if the client wants to request dirtiness status
>>> +  rather than provisioning status.
>> Why only one flag here? I could imagine a client might want to query for
>> both states at the same time. Obviously that means a client needs to
>> query for *at least* one of the statuses, otherwise the server should
>> reply with EINVAL.
>>
>> Though I'm undecided on whether a bit being set to 0 should mean "give
>> me that information" or whether 1 should.
> Hmm. Based on that reading, maybe we want TWO flags:
>
> NBD_CMD_FLAG_STATUS_ALLOCATED
> NBD_CMD_FLAG_STATUS_DIRTY
>
> and require that the client MUST set at least one flag (setting no flags
> at all is boring), but similarly allow that the server MAY reject
> attempts to set multiple flags with EINVAL (if there is no efficient way
> to provide the information for all flags on a single pass), in which
> case clients have to fall back to querying one flag at a time.
>
> But while Alex and Denis were arguing that no one would ever query both
> things at once (and therefore, it might be better to make
> NBD_STATUS_HOLE and NBD_STATUS_CLEAN both be bit 0), your approach of
> having two separate request flags and allowing both at once would mean
> we do need to keep the status information separate (NBD_STATUS_HOLE is
> bit 0, NBD_STATUS_CLEAN is bit 2).
>
> I also see that I failed to reserve a bit for NBD_CMD_FLAG_STATUS_DIRTY.
>
> Looks like there will still be some more conversation and at least a v3
> needed, but I'll wait a couple days for that to happen so that more
> reviewers can chime in, without being too tired during the review process.
>
that looks correct to me. I agree that we should set only one flag
and reject the request with two of them.
Actually this is like "bitmap number", but we have limitation with 32
numbers only.

We could specify that the server MUST reply with "all 1" for unknown
flag. This would provide nice forward compatibility.

Den

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-06  5:57     ` Denis V. Lunev
@ 2016-04-06 14:08       ` Eric Blake
  0 siblings, 0 replies; 59+ messages in thread
From: Eric Blake @ 2016-04-06 14:08 UTC (permalink / raw)
  To: Denis V. Lunev, Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan Hajnoczi, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 949 bytes --]

On 04/05/2016 11:57 PM, Denis V. Lunev wrote:

>> Looks like there will still be some more conversation and at least a v3
>> needed, but I'll wait a couple days for that to happen so that more
>> reviewers can chime in, without being too tired during the review
>> process.
>>
> that looks correct to me. I agree that we should set only one flag
> and reject the request with two of them.
> Actually this is like "bitmap number", but we have limitation with 32
> numbers only.
> 
> We could specify that the server MUST reply with "all 1" for unknown
> flag. This would provide nice forward compatibility.

My v2 approach was to define the status so that "all 0" was the safe
default (hence, naming the flag "NBD_STATUS_CLEAN" and set to 1 only
when no longer dirty, not "NBD_STATUS_DIRTY" where 1 by default is safer).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-05 13:43   ` Paolo Bonzini
@ 2016-04-07 10:38     ` Vladimir Sementsov-Ogievskiy
  2016-04-07 16:10       ` Eric Blake
  2016-04-07 15:35     ` Pavel Borzenkov
  1 sibling, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2016-04-07 10:38 UTC (permalink / raw)
  To: Paolo Bonzini, Kevin Wolf, Eric Blake
  Cc: nbd-general, qemu-devel, Pavel Borzenkov, Stefan Hajnoczi,
	Denis V. Lunev, Wouter Verhelst

On 05.04.2016 16:43, Paolo Bonzini wrote:
>
> On 05/04/2016 06:05, Kevin Wolf wrote:
>> The options I can think of is adding a request field "max number of
>> descriptors" or a flag "only single descriptor" (with the assumption
>> that clients always want one or unlimited), but maybe you have a better
>> idea.
> I think a limit is better.  Even if the client is ultimately going to
> process the whole file, it may take a very long time and space to
> retrieve all the descriptors in one go.  Rather than query e.g. 16GB at
> a time, I think it's simpler to put a limit of 1024 descriptors or so.
>
> Paolo
>

I vote for the limit too. More over, I think, there should be two sides 
limit:

1. The client can specify the limit, so server should not return more 
extents than requested. Of course, server should chose sequential 
extents from the beginning of requested range.
2. Server side limit: if client asked too many extents or not specified 
a limit at all, server should not return all extents, but only 1024 (for 
ex.) from the beginning of the range.
2.1 And/or, why not allow the server use the power of structured reply 
and send several reply chunks? Why did you forbid this? (if I correctly 
understand "This chunk type MUST appear at most once in a structured 
reply.")

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-05 13:43   ` Paolo Bonzini
  2016-04-07 10:38     ` Vladimir Sementsov-Ogievskiy
@ 2016-04-07 15:35     ` Pavel Borzenkov
  2016-04-07 15:43       ` Paolo Bonzini
  1 sibling, 1 reply; 59+ messages in thread
From: Pavel Borzenkov @ 2016-04-07 15:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, nbd-general, qemu-devel, Stefan Hajnoczi,
	Denis V. Lunev, Wouter Verhelst

On Tue, Apr 05, 2016 at 03:43:08PM +0200, Paolo Bonzini wrote:
> 
> 
> On 05/04/2016 06:05, Kevin Wolf wrote:
> > The options I can think of is adding a request field "max number of
> > descriptors" or a flag "only single descriptor" (with the assumption
> > that clients always want one or unlimited), but maybe you have a better
> > idea.
> 
> I think a limit is better.  Even if the client is ultimately going to
> process the whole file, it may take a very long time and space to
> retrieve all the descriptors in one go.  Rather than query e.g. 16GB at
> a time, I think it's simpler to put a limit of 1024 descriptors or so.

Agree. I'm not sure that the value of the limit should be hard-coded
into the protocol, though. Why don't just allow a server to send less
data than requested (similar to what SCSI "GET LBA STATUS" allows) and
allow the server to choose the limit suitable for it (without directly
exposing it to the clients)?

> Paolo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-07 15:35     ` Pavel Borzenkov
@ 2016-04-07 15:43       ` Paolo Bonzini
  0 siblings, 0 replies; 59+ messages in thread
From: Paolo Bonzini @ 2016-04-07 15:43 UTC (permalink / raw)
  To: Pavel Borzenkov
  Cc: Kevin Wolf, nbd-general, qemu-devel, Stefan Hajnoczi,
	Denis V. Lunev, Wouter Verhelst



On 07/04/2016 17:35, Pavel Borzenkov wrote:
> > On 05/04/2016 06:05, Kevin Wolf wrote:
> > > The options I can think of is adding a request field "max number of
> > > descriptors" or a flag "only single descriptor" (with the assumption
> > > that clients always want one or unlimited), but maybe you have a better
> > > idea.
> > 
> > I think a limit is better.  Even if the client is ultimately going to
> > process the whole file, it may take a very long time and space to
> > retrieve all the descriptors in one go.  Rather than query e.g. 16GB at
> > a time, I think it's simpler to put a limit of 1024 descriptors or so.
> 
> Agree. I'm not sure that the value of the limit should be hard-coded
> into the protocol, though. Why don't just allow a server to send less
> data than requested (similar to what SCSI "GET LBA STATUS" allows) and
> allow the server to choose the limit suitable for it (without directly
> exposing it to the clients)?

Yes, definitely.  The number was just an example.  Even 1 would be fine.

Paolo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-07 10:38     ` Vladimir Sementsov-Ogievskiy
@ 2016-04-07 16:10       ` Eric Blake
  2016-04-07 16:21         ` [Qemu-devel] [Nbd] " Alex Bligh
                           ` (3 more replies)
  0 siblings, 4 replies; 59+ messages in thread
From: Eric Blake @ 2016-04-07 16:10 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, Kevin Wolf
  Cc: nbd-general, qemu-devel, Pavel Borzenkov, Stefan Hajnoczi,
	Denis V. Lunev, Wouter Verhelst

[-- Attachment #1: Type: text/plain, Size: 4629 bytes --]

On 04/07/2016 04:38 AM, Vladimir Sementsov-Ogievskiy wrote:
> On 05.04.2016 16:43, Paolo Bonzini wrote:
>>
>> On 05/04/2016 06:05, Kevin Wolf wrote:
>>> The options I can think of is adding a request field "max number of
>>> descriptors" or a flag "only single descriptor" (with the assumption
>>> that clients always want one or unlimited), but maybe you have a better
>>> idea.
>> I think a limit is better.  Even if the client is ultimately going to
>> process the whole file, it may take a very long time and space to
>> retrieve all the descriptors in one go.  Rather than query e.g. 16GB at
>> a time, I think it's simpler to put a limit of 1024 descriptors or so.
>>
>> Paolo
>>
> 
> I vote for the limit too. More over, I think, there should be two sides
> limit:
> 
> 1. The client can specify the limit, so server should not return more
> extents than requested. Of course, server should chose sequential
> extents from the beginning of requested range.

For the client to request a limit would entail that we enhance the
protocol to allow structured requests (where a wire-sniffer would know
how many bytes to read for the client's additional data, even if it does
not understand the extension's semantics).  Might not be a bad idea to
have this in the long run, but so far I've been reluctant to bite the
bullet.

> 2. Server side limit: if client asked too many extents or not specified
> a limit at all, server should not return all extents, but only 1024 (for
> ex.) from the beginning of the range.

Okay, I'm fairly convinced now that letting the server limit the reply
is a good thing, and that one doesn't require a structured request from
the client.  Since we just recently documented that strings should be no
more than 4096 bytes, and my v2 proposal used 8 bytes per descriptor,
maybe a good way to enforce a similar limit would be:

The server MAY choose to send fewer descriptors than what would describe
the full extent of the client's request, but MUST send at least one
descriptor unless an error is reported.  The server MUST NOT send more
than 512 descriptors, even if that does not completely describe the
client's requested length.

That way, a client in general should never expect more than ~4096 bytes
+ overhead on any server reply except a reply to NBD_CMD_READ, and can
therefore utilize stack allocation for all other replies (if we do this,
maybe we should make a hard rule that all future protocol extensions,
other than NBD_CMD_READ, will guarantee that a reply has a bounded size)

I also think it may be okay to let the server reply with MORE data than
the client requested, but only as long as it does not result in any
extra descriptors (that is, only the last descriptor can result in a
length beyond the client's request).  For example, if the client asks
for block status of 1M of the file, but the server can conveniently
learn via lseek(SEEK_HOLE) or other means that there are 2M of data
before status changes, then there's no reason to force the server to
throw away the information about the 1M beyond the client's read, and
the client might even be able to be more efficient in later requests.

> 2.1 And/or, why not allow the server use the power of structured reply
> and send several reply chunks? Why did you forbid this? (if I correctly
> understand "This chunk type MUST appear at most once in a structured
> reply.")

If we allow more than one chunk, then either every chunk has to include
an offset (more traffic over the wire), or the chunks have to be sent in
a particular order (we aren't gaining any benefits that NBD_CMD_READ
gains by allowing out-of-order transmission).  It's also more work for
the client to reconstruct if it has to reassemble; with NBD_CMD_READ,
the payload is dominated by the data being read, and you can pwrite()
the data into its final location as the client; but with
NBD_CMD_BLOCK_STATUS, the payload is dominated by the metadata and we
want to keep it minimal; and there is no convenient command for the
client to reassemble the information if received out of order.

Allowing for a short reply seems to be worth doing, but allowing for
multiple reply chunks seems not worth the risk.

I'm also starting to think that it is worth FIRST documenting an
extension for advertising block sizes, so that we can then couch
BLOCK_STATUS in those terms (a server MUST NOT subdivide status into
finer granularity than the advertised block sizes).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-07 16:10       ` Eric Blake
@ 2016-04-07 16:21         ` Alex Bligh
  2016-04-08 11:35         ` [Qemu-devel] " Kevin Wolf
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Alex Bligh @ 2016-04-07 16:21 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, nbd-general, Vladimir Sementsov-Ogievskiy,
	Stefan stefanha@redhat. com, qemu-devel, Pavel Borzenkov,
	Alex Bligh, Denis V. Lunev, Wouter Verhelst, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 347 bytes --]


On 7 Apr 2016, at 17:10, Eric Blake <eblake@redhat.com> wrote:

> I'm also starting to think that it is worth FIRST documenting an
> extension for advertising block sizes, so that we can then couch
> BLOCK_STATUS in those terms (a server MUST NOT subdivide status into
> finer granularity than the advertised block sizes).

+1

--
Alex Bligh





[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 842 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-07 16:10       ` Eric Blake
  2016-04-07 16:21         ` [Qemu-devel] [Nbd] " Alex Bligh
@ 2016-04-08 11:35         ` Kevin Wolf
  2016-04-09  9:08         ` [Qemu-devel] [Nbd] " Wouter Verhelst
  2016-04-13 12:38         ` [Qemu-devel] " Pavel Borzenkov
  3 siblings, 0 replies; 59+ messages in thread
From: Kevin Wolf @ 2016-04-08 11:35 UTC (permalink / raw)
  To: Eric Blake
  Cc: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, nbd-general,
	qemu-devel, Pavel Borzenkov, Stefan Hajnoczi, Wouter Verhelst,
	Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 4382 bytes --]

Am 07.04.2016 um 18:10 hat Eric Blake geschrieben:
> On 04/07/2016 04:38 AM, Vladimir Sementsov-Ogievskiy wrote:
> > On 05.04.2016 16:43, Paolo Bonzini wrote:
> >>
> >> On 05/04/2016 06:05, Kevin Wolf wrote:
> >>> The options I can think of is adding a request field "max number of
> >>> descriptors" or a flag "only single descriptor" (with the assumption
> >>> that clients always want one or unlimited), but maybe you have a better
> >>> idea.
> >> I think a limit is better.  Even if the client is ultimately going to
> >> process the whole file, it may take a very long time and space to
> >> retrieve all the descriptors in one go.  Rather than query e.g. 16GB at
> >> a time, I think it's simpler to put a limit of 1024 descriptors or so.
> >>
> >> Paolo
> >>
> > 
> > I vote for the limit too. More over, I think, there should be two sides
> > limit:
> > 
> > 1. The client can specify the limit, so server should not return more
> > extents than requested. Of course, server should chose sequential
> > extents from the beginning of requested range.
> 
> For the client to request a limit would entail that we enhance the
> protocol to allow structured requests (where a wire-sniffer would know
> how many bytes to read for the client's additional data, even if it does
> not understand the extension's semantics).  Might not be a bad idea to
> have this in the long run, but so far I've been reluctant to bite the
> bullet.

I still think that client specified limits are the more important part
because only the client knows what information it needs. Server limits
can protect the server against excessive requests from clients, but only
the client can protect against wasting resources by doing more than is
necessary.

> > 2. Server side limit: if client asked too many extents or not specified
> > a limit at all, server should not return all extents, but only 1024 (for
> > ex.) from the beginning of the range.
> 
> Okay, I'm fairly convinced now that letting the server limit the reply
> is a good thing, and that one doesn't require a structured request from
> the client.  Since we just recently documented that strings should be no
> more than 4096 bytes, and my v2 proposal used 8 bytes per descriptor,
> maybe a good way to enforce a similar limit would be:
> 
> The server MAY choose to send fewer descriptors than what would describe
> the full extent of the client's request, but MUST send at least one
> descriptor unless an error is reported.  The server MUST NOT send more
> than 512 descriptors, even if that does not completely describe the
> client's requested length.

Why have these 512 descriptors in the spec? Sounds to me as if it should
be a server configuration option.

> That way, a client in general should never expect more than ~4096 bytes
> + overhead on any server reply except a reply to NBD_CMD_READ, and can
> therefore utilize stack allocation for all other replies (if we do this,
> maybe we should make a hard rule that all future protocol extensions,
> other than NBD_CMD_READ, will guarantee that a reply has a bounded size)

Or have a client specified limit for all potentially unbounded replies.

> I also think it may be okay to let the server reply with MORE data than
> the client requested, but only as long as it does not result in any
> extra descriptors (that is, only the last descriptor can result in a
> length beyond the client's request).  For example, if the client asks
> for block status of 1M of the file, but the server can conveniently
> learn via lseek(SEEK_HOLE) or other means that there are 2M of data
> before status changes, then there's no reason to force the server to
> throw away the information about the 1M beyond the client's read, and
> the client might even be able to be more efficient in later requests.

If we say SHOULD instead of MAY, this might actually be a way around the
urgent practical problem that qemu needs a way to ask the question
"what's the status of this block and how many contiguous blocks have the
same status? (I don't care about the following extents.)" This is
actually the only way that qemu is currently planning to use the
command.

So if qemu can ask for "status at offset x, length 1 byte" and expect to
get the right reply to the above question, that might be workable.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-07 16:10       ` Eric Blake
  2016-04-07 16:21         ` [Qemu-devel] [Nbd] " Alex Bligh
  2016-04-08 11:35         ` [Qemu-devel] " Kevin Wolf
@ 2016-04-09  9:08         ` Wouter Verhelst
  2016-04-13 12:38         ` [Qemu-devel] " Pavel Borzenkov
  3 siblings, 0 replies; 59+ messages in thread
From: Wouter Verhelst @ 2016-04-09  9:08 UTC (permalink / raw)
  To: Eric Blake
  Cc: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, Kevin Wolf,
	nbd-general, qemu-devel, Pavel Borzenkov, Stefan Hajnoczi,
	Denis V. Lunev

On Thu, Apr 07, 2016 at 10:10:58AM -0600, Eric Blake wrote:
> On 04/07/2016 04:38 AM, Vladimir Sementsov-Ogievskiy wrote:
> > On 05.04.2016 16:43, Paolo Bonzini wrote:
> >>
> >> On 05/04/2016 06:05, Kevin Wolf wrote:
> >>> The options I can think of is adding a request field "max number of
> >>> descriptors" or a flag "only single descriptor" (with the assumption
> >>> that clients always want one or unlimited), but maybe you have a better
> >>> idea.
> >> I think a limit is better.  Even if the client is ultimately going to
> >> process the whole file, it may take a very long time and space to
> >> retrieve all the descriptors in one go.  Rather than query e.g. 16GB at
> >> a time, I think it's simpler to put a limit of 1024 descriptors or so.
> >>
> >> Paolo
> >>
> > 
> > I vote for the limit too. More over, I think, there should be two sides
> > limit:
> > 
> > 1. The client can specify the limit, so server should not return more
> > extents than requested. Of course, server should chose sequential
> > extents from the beginning of requested range.
> 
> For the client to request a limit would entail that we enhance the
> protocol to allow structured requests (where a wire-sniffer would know
> how many bytes to read for the client's additional data, even if it does
> not understand the extension's semantics).  Might not be a bad idea to
> have this in the long run, but so far I've been reluctant to bite the
> bullet.
> 
> > 2. Server side limit: if client asked too many extents or not specified
> > a limit at all, server should not return all extents, but only 1024 (for
> > ex.) from the beginning of the range.
> 
> Okay, I'm fairly convinced now that letting the server limit the reply
> is a good thing, and that one doesn't require a structured request from
> the client.  Since we just recently documented that strings should be no
> more than 4096 bytes, and my v2 proposal used 8 bytes per descriptor,
> maybe a good way to enforce a similar limit would be:
> 
> The server MAY choose to send fewer descriptors than what would describe
> the full extent of the client's request, but MUST send at least one
> descriptor unless an error is reported.  The server MUST NOT send more
> than 512 descriptors, even if that does not completely describe the
> client's requested length.
> 
> That way, a client in general should never expect more than ~4096 bytes
> + overhead on any server reply except a reply to NBD_CMD_READ, and can
> therefore utilize stack allocation for all other replies (if we do this,
> maybe we should make a hard rule that all future protocol extensions,
> other than NBD_CMD_READ, will guarantee that a reply has a bounded size)

The downside is more requests. A client-requested limit on the number of
replies sounds like a good idea, but a protocol-requested limited
doesn't, to me.

We could have a flag that turns the "length" field into "length in
number of described extents" rather than "length in number of bytes".
That way, the client could say "please give me metadata on X extents
starting at offset Y", rather than "please give me data on the extents
starting at offset X for Y bytes of length"

> I also think it may be okay to let the server reply with MORE data than
> the client requested, but only as long as it does not result in any
> extra descriptors (that is, only the last descriptor can result in a
> length beyond the client's request).  For example, if the client asks
> for block status of 1M of the file, but the server can conveniently
> learn via lseek(SEEK_HOLE) or other means that there are 2M of data
> before status changes, then there's no reason to force the server to
> throw away the information about the 1M beyond the client's read, and
> the client might even be able to be more efficient in later requests.

Yes, definitely. I had originally wanted to suggest that rather than my
above as a way to allow clients to do limited-extent requests (by just
declaring the above and stating that asking information about an offset
with zero-length is legal), but I think my other suggestion might be
more interesting.

> > 2.1 And/or, why not allow the server use the power of structured reply
> > and send several reply chunks? Why did you forbid this? (if I correctly
> > understand "This chunk type MUST appear at most once in a structured
> > reply.")
> 
> If we allow more than one chunk, then either every chunk has to include
> an offset (more traffic over the wire),

We can state that every chunk (as in, description of multiple extents
sent in one structured reply) has an offset, and that all extents
described in that chunk MUST be sequential.

> or the chunks have to be sent in
> a particular order (we aren't gaining any benefits that NBD_CMD_READ
> gains by allowing out-of-order transmission).  It's also more work for
> the client to reconstruct if it has to reassemble; with NBD_CMD_READ,
> the payload is dominated by the data being read, and you can pwrite()
> the data into its final location as the client; but with
> NBD_CMD_BLOCK_STATUS, the payload is dominated by the metadata and we
> want to keep it minimal; and there is no convenient command for the
> client to reassemble the information if received out of order.

There is a DF flag, which could be allowed here as well.

> Allowing for a short reply seems to be worth doing, but allowing for
> multiple reply chunks seems not worth the risk.

Disagree.

[...]

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-05 13:50   ` Paolo Bonzini
@ 2016-04-11  5:58     ` Markus Pargmann
  0 siblings, 0 replies; 59+ messages in thread
From: Markus Pargmann @ 2016-04-11  5:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: nbd-general, Eric Blake, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan Hajnoczi, Denis V. Lunev, Wouter Verhelst

[-- Attachment #1: Type: text/plain, Size: 2445 bytes --]

On Tuesday 05 April 2016 15:50:16 Paolo Bonzini wrote:
> 
> On 05/04/2016 11:24, Markus Pargmann wrote:
> > Also it is uncertain if these status bits may change over time through
> > reorganization of backend storage, for example holes may be removed in
> > the backend and so on. Is it safe to cache this stuff?
> 
> If there's no concurrent access, it is.  Even if it is out of date, it
> is still a valid representation.  For example, suppose a file has a
> hole.  The client knows that *its own client* (whoever's using /dev/nbd0
> for example) should write to it before relying on the contents of the
> area.  This remains true if a hole becomes a non-hole.
> 
> If there's concurrent access, all bets are off.  Suppose a zero area
> loses the zero flag.  Client A knows that the area remains zero unless
> someone has written to it.  If *another* client B has concurrently
> written something, there must be a communication mechanism by which B
> tells A to invalidate the cache, or A must not cache the
> information---and probably should not request it in the first place.
> 
> > Until now something like READ and WRITE where somehow atomic operations
> > in the protocol.
> 
> No, they weren't.  If you have overlapping I/O from multiple clients
> there's no way to know what data you will get.  You might even get old
> data and new data interspersed in a single read.  There's definitely no
> guarantee of atomicity in either POSIX or NBD.

At least from the protocol side they were atomic. You had one read/write that
does not rely on any other commands before or after.

I think this was mostly a misunderstanding of the "SHOULD" sentence. I was
assuming that should would more be like a "must" which would have introduced
strange requirements for the client, for example that it has to receive the
block status before making any reads. And this as well would have lead to
possibly complicated interaction between two clients receiving block status
before reading.

But there was a mail recently which referenced rfc2119 so everything is fine
with that.

Best Regards,

Markus

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-05 20:50     ` Wouter Verhelst
@ 2016-04-11  6:07       ` Markus Pargmann
  0 siblings, 0 replies; 59+ messages in thread
From: Markus Pargmann @ 2016-04-11  6:07 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: Eric Blake, nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan Hajnoczi, Denis V. Lunev, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3601 bytes --]

On Tuesday 05 April 2016 22:50:51 Wouter Verhelst wrote:
> On Tue, Apr 05, 2016 at 08:14:01AM -0600, Eric Blake wrote:
> > On 04/05/2016 03:24 AM, Markus Pargmann wrote:
> > 
> > >> +        requested.
> > >> +
> > >> +    The client SHOULD NOT read from an area that has both
> > >> +    `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.
> > > 
> > > Why not? If we don't execute CMD_BLOCK_STATUS we wouldn't know about the
> > > situation and would simply read directly. To fulfill this statement we
> > > would need to receive the block status before every read operation.
> > 
> > Because we already state that for NBD_CMD_TRIM, the client SHOULD NOT
> > read an area where a trim was requested without an intervening write,
> 
> Actually, we state that a client SHOULD NOT *make any assumption* about
> the state (emphasis added). There's a difference.
> 
> If the client wants to write, there is no problem (but he'll probably
> get garbage).
> 
> [...]
> > > Also something that is kind of missing from the document so far is
> > > concurrency with other NBD clients. Certainly most users do not use NBD
> > > for concurrent access to the backend storage. But for example the
> > > sentence above ignores the fact that another client may work on the
> > > backend and that the state may change after some time so that it may
> > > still be necessary to read from an area with NBD_STATE_HOLE and
> > > NBD_STATE_ZERO set.
> > 
> > That's missing from NBD in general, and I don't think this is the patch
> > to add it.  We already have concurrency with self issues, because NBD
> > server can handle requests out of order (within the bounds of FLUSH and
> > FUA modifiers).
> 
> I don't think NBD *needs* to add much in the way of concurrency, other
> than write barriers etc; but having an explicit message "some other
> process just modified offset X length Y" seems out of scope for NBD.

Yes it certainly is out of scope to add explicit messages for concurrency. I
only thought about some general information for the reader to have a small
paragraph in the protocol documentation that states that concurrent access can
occur at all times and that no assumptions can be made about the state of the
server and its backend storage at any times.

> 
> > > Also it is uncertain if these status bits may change over time through
> > > reorganization of backend storage, for example holes may be removed in
> > > the backend and so on. Is it safe to cache this stuff?
> > 
> > If the client is the only thing modifying the drive, maybe we want to
> > make that additional constraint on the server.  But how best to word it,
> > or is it too tight of a specification?
> 
> I think it's perfectly fine for the protocol to assume in general that
> the client is the only writer, and that no other writers are
> concurrently accessing the same device. Anything else would make the
> protocol much more complex; and one of the features of NBD is, I think,
> the fact that the protocol is not *too* complex.

Yes this definitely is a feature. I just think that concurrent access was
always possible and we should not by accident remove that possibility. But as
far as I can see that's not an issue here.

Best Regards,

Markus

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-07 16:10       ` Eric Blake
                           ` (2 preceding siblings ...)
  2016-04-09  9:08         ` [Qemu-devel] [Nbd] " Wouter Verhelst
@ 2016-04-13 12:38         ` Pavel Borzenkov
  2016-04-13 14:40           ` Eric Blake
  3 siblings, 1 reply; 59+ messages in thread
From: Pavel Borzenkov @ 2016-04-13 12:38 UTC (permalink / raw)
  To: Eric Blake
  Cc: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, Kevin Wolf,
	nbd-general, qemu-devel, Stefan Hajnoczi, Wouter Verhelst,
	Denis V. Lunev

Hi Eric,

On Thu, Apr 07, 2016 at 10:10:58AM -0600, Eric Blake wrote:
> On 04/07/2016 04:38 AM, Vladimir Sementsov-Ogievskiy wrote:
> > On 05.04.2016 16:43, Paolo Bonzini wrote:
> >>
> >> On 05/04/2016 06:05, Kevin Wolf wrote:
> >>> The options I can think of is adding a request field "max number of
> >>> descriptors" or a flag "only single descriptor" (with the assumption
> >>> that clients always want one or unlimited), but maybe you have a better
> >>> idea.
> >> I think a limit is better.  Even if the client is ultimately going to
> >> process the whole file, it may take a very long time and space to
> >> retrieve all the descriptors in one go.  Rather than query e.g. 16GB at
> >> a time, I think it's simpler to put a limit of 1024 descriptors or so.
> >>
> >> Paolo
> >>
> > 
> > I vote for the limit too. More over, I think, there should be two sides
> > limit:
> > 
> > 1. The client can specify the limit, so server should not return more
> > extents than requested. Of course, server should chose sequential
> > extents from the beginning of requested range.
> 
> For the client to request a limit would entail that we enhance the
> protocol to allow structured requests (where a wire-sniffer would know
> how many bytes to read for the client's additional data, even if it does
> not understand the extension's semantics).  Might not be a bad idea to
> have this in the long run, but so far I've been reluctant to bite the
> bullet.
> 
> > 2. Server side limit: if client asked too many extents or not specified
> > a limit at all, server should not return all extents, but only 1024 (for
> > ex.) from the beginning of the range.
> 
> Okay, I'm fairly convinced now that letting the server limit the reply
> is a good thing, and that one doesn't require a structured request from
> the client.  Since we just recently documented that strings should be no
> more than 4096 bytes, and my v2 proposal used 8 bytes per descriptor,
> maybe a good way to enforce a similar limit would be:
> 
> The server MAY choose to send fewer descriptors than what would describe
> the full extent of the client's request, but MUST send at least one
> descriptor unless an error is reported.  The server MUST NOT send more
> than 512 descriptors, even if that does not completely describe the
> client's requested length.
> 
> That way, a client in general should never expect more than ~4096 bytes
> + overhead on any server reply except a reply to NBD_CMD_READ, and can
> therefore utilize stack allocation for all other replies (if we do this,
> maybe we should make a hard rule that all future protocol extensions,
> other than NBD_CMD_READ, will guarantee that a reply has a bounded size)
> 
> I also think it may be okay to let the server reply with MORE data than
> the client requested, but only as long as it does not result in any
> extra descriptors (that is, only the last descriptor can result in a
> length beyond the client's request).  For example, if the client asks
> for block status of 1M of the file, but the server can conveniently
> learn via lseek(SEEK_HOLE) or other means that there are 2M of data
> before status changes, then there's no reason to force the server to
> throw away the information about the 1M beyond the client's read, and
> the client might even be able to be more efficient in later requests.
> 
> > 2.1 And/or, why not allow the server use the power of structured reply
> > and send several reply chunks? Why did you forbid this? (if I correctly
> > understand "This chunk type MUST appear at most once in a structured
> > reply.")
> 
> If we allow more than one chunk, then either every chunk has to include
> an offset (more traffic over the wire), or the chunks have to be sent in
> a particular order (we aren't gaining any benefits that NBD_CMD_READ
> gains by allowing out-of-order transmission).  It's also more work for
> the client to reconstruct if it has to reassemble; with NBD_CMD_READ,
> the payload is dominated by the data being read, and you can pwrite()
> the data into its final location as the client; but with
> NBD_CMD_BLOCK_STATUS, the payload is dominated by the metadata and we
> want to keep it minimal; and there is no convenient command for the
> client to reassemble the information if received out of order.
> 
> Allowing for a short reply seems to be worth doing, but allowing for
> multiple reply chunks seems not worth the risk.
> 
> I'm also starting to think that it is worth FIRST documenting an
> extension for advertising block sizes, so that we can then couch
> BLOCK_STATUS in those terms (a server MUST NOT subdivide status into
> finer granularity than the advertised block sizes).

Why do you need to operate with blocks instead of list of extents?
What benefits will this approach provide for a client or a server?

Are you still working on the spec? I can update the patch with
information about server-side limit/beyond request's length replies and
post v3, so that things keep moving forward.

-- 
Pavel

> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
  2016-04-13 12:38         ` [Qemu-devel] " Pavel Borzenkov
@ 2016-04-13 14:40           ` Eric Blake
  0 siblings, 0 replies; 59+ messages in thread
From: Eric Blake @ 2016-04-13 14:40 UTC (permalink / raw)
  To: Pavel Borzenkov
  Cc: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, Kevin Wolf,
	nbd-general, qemu-devel, Stefan Hajnoczi, Wouter Verhelst,
	Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 1268 bytes --]

On 04/13/2016 06:38 AM, Pavel Borzenkov wrote:
>> I'm also starting to think that it is worth FIRST documenting an
>> extension for advertising block sizes, so that we can then couch
>> BLOCK_STATUS in those terms (a server MUST NOT subdivide status into
>> finer granularity than the advertised block sizes).
> 
> Why do you need to operate with blocks instead of list of extents?

It's still a list of extents, just that having a definition of block
sizes means that we can then require that the list of extents will not
be subdivided smaller than a particular block size (so the client
doesn't have to worry about a server overwhelming it with extents
covering 1 byte each).

> What benefits will this approach provide for a client or a server?
> 
> Are you still working on the spec? I can update the patch with
> information about server-side limit/beyond request's length replies and
> post v3, so that things keep moving forward.

You're welcome to post a v3, if you don't want to wait for me to get
around to it.  There's a lot going on in the spec right now, and I'm
hoping to help flush some of the pending patches first.


-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2016-04-13 14:40 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-04 16:39 [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension Eric Blake
2016-04-04 18:06 ` [Qemu-devel] [Nbd] " Alex Bligh
2016-04-04 19:34   ` Eric Blake
2016-04-04 19:54     ` Denis V. Lunev
2016-04-04 20:03       ` Alex Bligh
2016-04-04 20:08         ` Denis V. Lunev
2016-04-04 20:34           ` Eric Blake
2016-04-04 21:06             ` Denis V. Lunev
2016-04-04 21:12             ` Alex Bligh
2016-04-05 14:15         ` Paolo Bonzini
2016-04-05 15:01           ` Alex Bligh
2016-04-05 15:23             ` Paolo Bonzini
2016-04-05 15:27               ` Alex Bligh
2016-04-05 15:31                 ` Paolo Bonzini
2016-04-04 23:08       ` Wouter Verhelst
2016-04-04 23:32         ` Eric Blake
2016-04-05  7:16           ` Wouter Verhelst
2016-04-05 21:44           ` Wouter Verhelst
2016-04-05  7:13         ` Alex Bligh
2016-04-04 19:58     ` Alex Bligh
2016-04-04 20:04       ` Denis V. Lunev
2016-04-04 20:08         ` Alex Bligh
2016-04-04 20:13           ` Denis V. Lunev
2016-04-04 20:15             ` Alex Bligh
2016-04-04 20:27               ` Denis V. Lunev
2016-04-04 20:45                 ` Eric Blake
2016-04-04 21:04                   ` Denis V. Lunev
2016-04-04 21:12                     ` Alex Bligh
2016-04-04 21:17                     ` Eric Blake
2016-04-04 21:27                       ` Denis V. Lunev
2016-04-04 20:26           ` Eric Blake
2016-04-04 21:07             ` Alex Bligh
2016-04-04 21:25               ` Eric Blake
2016-04-04 22:06                 ` Alex Bligh
2016-04-04 20:22       ` Eric Blake
2016-04-05 13:38     ` Paolo Bonzini
2016-04-04 22:40 ` Wouter Verhelst
2016-04-04 23:03   ` Eric Blake
2016-04-05 13:41     ` Paolo Bonzini
2016-04-06  5:57     ` Denis V. Lunev
2016-04-06 14:08       ` Eric Blake
2016-04-05  4:05 ` [Qemu-devel] " Kevin Wolf
2016-04-05 13:43   ` Paolo Bonzini
2016-04-07 10:38     ` Vladimir Sementsov-Ogievskiy
2016-04-07 16:10       ` Eric Blake
2016-04-07 16:21         ` [Qemu-devel] [Nbd] " Alex Bligh
2016-04-08 11:35         ` [Qemu-devel] " Kevin Wolf
2016-04-09  9:08         ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-04-13 12:38         ` [Qemu-devel] " Pavel Borzenkov
2016-04-13 14:40           ` Eric Blake
2016-04-07 15:35     ` Pavel Borzenkov
2016-04-07 15:43       ` Paolo Bonzini
2016-04-05  8:51 ` Stefan Hajnoczi
2016-04-05  9:24 ` [Qemu-devel] [Nbd] " Markus Pargmann
2016-04-05 13:50   ` Paolo Bonzini
2016-04-11  5:58     ` Markus Pargmann
2016-04-05 14:14   ` Eric Blake
2016-04-05 20:50     ` Wouter Verhelst
2016-04-11  6:07       ` Markus Pargmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.