qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/2] NBD protocol extensions: WRITE_ZEROES and GET_LBA_STATUS
@ 2016-03-23 14:16 Denis V. Lunev
  2016-03-23 14:16 ` [Qemu-devel] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension Denis V. Lunev
  2016-03-23 14:16 ` [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension Denis V. Lunev
  0 siblings, 2 replies; 54+ messages in thread
From: Denis V. Lunev @ 2016-03-23 14:16 UTC (permalink / raw)
  To: nbd-general, qemu-devel
  Cc: Kevin Wolf, Paolo Bonzini, Pavel Borzenkov, Stefan Hajnoczi, den,
	Wouter Verhelst, Roman Kagan

This patch set adds two new experimental extensions to the NBD protocol:

- WRITE_ZEROES, which describes a command for efficient zeroing of block
  device (or a part of it) without unneeded data transfer;
- GET_LBA_STATUS, which describes a command for quering block allocation
  state and block dirtiness status.

These commands provides minimal set of operations with block device sparness
and are very useful for mirroring and backing up of QEMU block device.

Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Wouter Verhelst <w@uter.be>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>

Pavel Borzenkov (2):
  NBD proto: add WRITE_ZEROES extension
  NBD proto: add GET_LBA_STATUS extension

 doc/proto.md | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 121 insertions(+)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Qemu-devel] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension
  2016-03-23 14:16 [Qemu-devel] [PATCH 0/2] NBD protocol extensions: WRITE_ZEROES and GET_LBA_STATUS Denis V. Lunev
@ 2016-03-23 14:16 ` Denis V. Lunev
  2016-03-23 15:14   ` Eric Blake
  2016-03-23 17:21   ` Wouter Verhelst
  2016-03-23 14:16 ` [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension Denis V. Lunev
  1 sibling, 2 replies; 54+ messages in thread
From: Denis V. Lunev @ 2016-03-23 14:16 UTC (permalink / raw)
  To: nbd-general, qemu-devel
  Cc: Kevin Wolf, Pavel Borzenkov, Stefan Hajnoczi, Paolo Bonzini,
	Wouter Verhelst, den

From: Pavel Borzenkov <pborzenkov@virtuozzo.com>

There exist some cases when a client knows that the data it is going to
write is all zeroes. Such cases include mirroring or backing up a device
implemented by a sparse file.

With current NBD command set, the client has to issue NBD_CMD_WRITE
command with zeroed payload and transfer these zero bytes through the
wire. The server has to write the data onto disk, effectively denying
the sparseness.

To remedy this, the patch adds WRITE_ZEROES extension with one new
NBD_CMD_WRITE_ZEROES command.

Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Wouter Verhelst <w@uter.be>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 doc/proto.md | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/doc/proto.md b/doc/proto.md
index 463ef8a..cda213c 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation:
   schedule I/O accesses as for a rotational medium
 - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
   `NBD_CMD_TRIM` commands
+- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
+  supports `NBD_CMD_WRITE_ZEROES` commands
 
 ##### Client flags
 
@@ -471,6 +473,10 @@ The following request types exist:
     about the contents of the export affected by this command, until
     overwriting it again with `NBD_CMD_WRITE`.
 
+* `NBD_CMD_WRITE_ZEROES` (6)
+
+    Defined by the experimental `WRITE_ZEROES` extension; see below.
+
 * Other requests
 
     Some third-party implementations may require additional protocol
@@ -594,6 +600,44 @@ option reply type.
       message if they do not also send it as a reply to the
       `NBD_OPT_SELECT` message.
 
+### `WRITE_ZEROES` extension
+
+There exist some cases when a client knows that the data it is going to write
+is all zeroes. Such cases include mirroring or backing up a device implemented
+by a sparse file. With current NBD command set, the client has to issue
+`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
+through the wire. The server has to write the data onto disk, effectively
+denying the sparseness.
+
+To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds
+one new command with two command flags.
+
+* `NBD_CMD_WRITE_ZEROES` (6)
+
+    A write request with no payload. Length and offset define the location
+    and amount of data to be zeroed.
+
+    The server MUST zero out the data on disk, and then send the reply
+    message. The server MAY send the reply message before the data has
+    reached permanent storage.
+
+    If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
+    export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` (bit 0)
+    in the command flags field. If this flag was set, the server MUST NOT send
+    the reply until it has ensured that the newly-zeroed data has reached
+    permanent storage.
+
+    If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in the
+    command flags field, the server MAY use trimming to zero out the area,
+    but it MUST ensure that the data reads back as zero.
+
+    If an error occurs, the server SHOULD set the appropriate error code
+    in the error field. The server MAY then close the connection.
+
+The server SHOULD return `ENOSPC` if it receives a write zeroes request
+including one or more sectors beyond the size of the device. It SHOULD
+return `EPERM` if it receives a write zeroes request on a read-only export.
+
 ## About this file
 
 This file tries to document the NBD protocol as it is currently
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-23 14:16 [Qemu-devel] [PATCH 0/2] NBD protocol extensions: WRITE_ZEROES and GET_LBA_STATUS Denis V. Lunev
  2016-03-23 14:16 ` [Qemu-devel] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension Denis V. Lunev
@ 2016-03-23 14:16 ` Denis V. Lunev
  2016-03-23 16:27   ` Eric Blake
                     ` (4 more replies)
  1 sibling, 5 replies; 54+ messages in thread
From: Denis V. Lunev @ 2016-03-23 14:16 UTC (permalink / raw)
  To: nbd-general, qemu-devel
  Cc: Kevin Wolf, Pavel Borzenkov, Stefan Hajnoczi, Paolo Bonzini,
	Wouter Verhelst, den

From: Pavel Borzenkov <pborzenkov@virtuozzo.com>

With the availability of sparse storage formats, it is often needed to
query status of a particular LBA range and read only those blocks of
data that are actually present on the block device.

To provide such information, the patch adds GET_LBA_STATUS extension
with one new NBD_CMD_GET_LBA_STATUS command.

There exists a concept of data dirtiness, which is required during, for
example, incremental block device backup. To express this concept via
NBD protocol, this patch also adds additional mode of operation to
NBD_CMD_GET_LBA_STATUS command.

Since NBD protocol has no notion of block size, and to mimic SCSI "GET
LBA STATUS" command more closely, it has been chosen to return a list of
extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
bitmap.

Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Wouter Verhelst <w@uter.be>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/doc/proto.md b/doc/proto.md
index cda213c..fff515d 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -243,6 +243,8 @@ immediately after the global flags field in oldstyle negotiation:
   `NBD_CMD_TRIM` commands
 - bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
   supports `NBD_CMD_WRITE_ZEROES` commands
+- bit 7, `NBD_FLAG_SEND_GET_LBA_STATUS`; should be set to 1 if the server
+  supports `NBD_CMD_GET_LBA_STATUS` commands
 
 ##### Client flags
 
@@ -477,6 +479,10 @@ The following request types exist:
 
     Defined by the experimental `WRITE_ZEROES` extension; see below.
 
+* `NBD_CMD_GET_LBA_STATUS` (7)
+
+    Defined by the experimental `GET_LBA_STATUS` extension; see below.
+
 * Other requests
 
     Some third-party implementations may require additional protocol
@@ -638,6 +644,82 @@ The server SHOULD return `ENOSPC` if it receives a write zeroes request
 including one or more sectors beyond the size of the device. It SHOULD
 return `EPERM` if it receives a write zeroes request on a read-only export.
 
+### `GET_LBA_STATUS` extension
+
+With the availability of sparse storage formats, it is often needed to query
+status of a particular LBA range and read only those blocks of data that are
+actually present on the block device.
+
+Some storage formats and operations over such formats express a concept of
+data dirtiness. Whether the operation is block device mirroring,
+incremental block device backup or any other operation with a concept of
+data dirtiness, they all share a need to provide a list of LBA ranges
+that this particular operation treats as dirty.
+
+To provide such class of information, `GET_LBA_STATUS` extension adds new
+`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
+their respective states.
+
+* `NBD_CMD_GET_LBA_STATUS` (7)
+
+    An LBA range status query request. Length and offset define the range
+    of interest. The server MUST reply with a reply header, followed
+    immediately by the following data:
+
+      - 32 bits, length of parameter data that follow (unsigned)
+      - zero or more LBA status descriptors, each having the following
+        structure:
+
+        * 64 bits, offset (unsigned)
+        * 32 bits, length (unsigned)
+        * 16 bits, status (unsigned)
+
+    unless an error condition has occurred.
+
+    If an error occurs, the server SHOULD set the appropriate error code
+    in the error field. The server MUST then either close the
+    connection, or send *length of parameter data* bytes of data
+    (which MAY be invalid).
+
+    The type of information required by the client is passed to server in the
+    command flags field. If the server does not implement requested type or
+    have no means to express it, it MUST NOT return an error, but instead MUST
+    return a single LBA status descriptor with *offset* and *length* equal to
+    the *offset* and *length* from request, and *status* set to `0`.
+
+    The following request types are currently defined for the command:
+
+    1. Block provisioning state
+
+    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
+    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
+    the provisioning state of the device. The following provisionnig states
+    are defined for the command:
+
+      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
+      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
+        and contains zeroes;
+      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
+        block device. A client MUST NOT make any assumptions about the
+        contents of the extent.
+
+    2. Block dirtiness state
+
+    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
+    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
+    the dirtiness status of the device. The following dirtiness states
+    are defined for the command:
+
+      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
+      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
+
+    Generic NBD client implementation without knowledge of a particular NBD
+    server operation MUST NOT make any assumption on the meaning of the
+    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
+
+The server SHOULD return `EINVAL` if it receives a `GET_LBA_STATUS` request
+including one or more sectors beyond the size of the device.
+
 ## About this file
 
 This file tries to document the NBD protocol as it is currently
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension
  2016-03-23 14:16 ` [Qemu-devel] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension Denis V. Lunev
@ 2016-03-23 15:14   ` Eric Blake
  2016-03-23 17:40     ` [Qemu-devel] [Nbd] " Wouter Verhelst
  2016-03-24  7:16     ` [Qemu-devel] " Pavel Borzenkov
  2016-03-23 17:21   ` Wouter Verhelst
  1 sibling, 2 replies; 54+ messages in thread
From: Eric Blake @ 2016-03-23 15:14 UTC (permalink / raw)
  To: Denis V. Lunev, nbd-general, qemu-devel
  Cc: Kevin Wolf, Paolo Bonzini, Pavel Borzenkov, Stefan Hajnoczi,
	Wouter Verhelst

[-- Attachment #1: Type: text/plain, Size: 6168 bytes --]

On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> There exist some cases when a client knows that the data it is going to
> write is all zeroes. Such cases include mirroring or backing up a device
> implemented by a sparse file.
> 
> With current NBD command set, the client has to issue NBD_CMD_WRITE
> command with zeroed payload and transfer these zero bytes through the
> wire. The server has to write the data onto disk, effectively denying
> the sparseness.
> 
> To remedy this, the patch adds WRITE_ZEROES extension with one new
> NBD_CMD_WRITE_ZEROES command.
> 
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  doc/proto.md | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 

>  
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    Defined by the experimental `WRITE_ZEROES` extension; see below.

If this patch goes in to the NBD sources, the extension is not
experimental any more, right?  Shouldn't the patch be written under the
assumption that it will be the final form of the text, even though the
feature is experimental until then? [1]

> +### `WRITE_ZEROES` extension
> +
> +There exist some cases when a client knows that the data it is going to write
> +is all zeroes. Such cases include mirroring or backing up a device implemented
> +by a sparse file. With current NBD command set, the client has to issue
> +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
> +through the wire. The server has to write the data onto disk, effectively
> +denying the sparseness.

s/denying/losing/ ?

> +
> +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds

s/is envisioned/exists/ - again, under the argument that once it is part
of this document, it is no longer experimental.

/me goes and reads the actual proto.md file, and light bulb turns on...

[1] Oh, you ARE adding this to the "Experimental extensions" section of
the document, so your wording IS correct.  I guess the idea is that we
write up the documentation in the experimental section, tweak qemu to
implement it both as NBD client and as NBD server (since qemu has code
that can serve in both positions), see how well it worked, and THEN do a
followup patch to proto.md to move the text into the non-experimental
section, along with any tweaks learned during the qemu patches.

> +one new command with two command flags.
> +
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    A write request with no payload. Length and offset define the location
> +    and amount of data to be zeroed.
> +
> +    The server MUST zero out the data on disk, and then send the reply
> +    message. The server MAY send the reply message before the data has
> +    reached permanent storage.
> +
> +    If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
> +    export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` (bit 0)
> +    in the command flags field. If this flag was set, the server MUST NOT send
> +    the reply until it has ensured that the newly-zeroed data has reached
> +    permanent storage.

Do we want to add:

The server SHOULD return EINVAL if the client set NBD_CMD_FLAG_FUA but
the export flags did not include NBD_FLAG_SEND_FUA.

> +
> +    If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in the
> +    command flags field, the server MAY use trimming to zero out the area,
> +    but it MUST ensure that the data reads back as zero.

Bug in the existing spec: The constant NBD_CMD_FLAG_FUA is mentioned,
but never defined with a fixed value.  Your text above defines it to
'bit 0' in the command flags field - is that correct?  If so, should we
add a section to the document that lists the bit values of all supported
command flags?

Meanwhile, your proposed text hardcodes NBD_CMD_FLAG_MAY_TRIM to 'bit
1'; that might also be worth adding into the same new section of the
document documenting all supported command flags.

Do we want to require that the server has negotiated the
NBD_FLAG_SEND_TRIM export flag prior to allowing the
NBD_CMD_FLAG_MAY_TRIM flag in this command?

Possibly-related bug in the existing spec: Should the text of the
NBD_CMD_TRIM (4) command mention the desired server behavior if the
client sends NBD_CMD_TRIM even though NBD_FLAG_SEND_TRIM was not
negotiated in the export flags?  Similarly, should the text of the
NBD_CMD_FLUSH () command mention the desired server behavior if the
client sends NBD_CMD_FLUSH even though NBD_FLAG_SEND_FLUSH was not
negotiated in the export flags?  At least NBD_CMD_FLUSH recommended that
the client must not send the command if the feature was not negotiated.

> +
> +    If an error occurs, the server SHOULD set the appropriate error code
> +    in the error field. The server MAY then close the connection.
> +
> +The server SHOULD return `ENOSPC` if it receives a write zeroes request
> +including one or more sectors beyond the size of the device. It SHOULD
> +return `EPERM` if it receives a write zeroes request on a read-only export.

Should we add a paragraph stating that the client MUST NOT send
NBD_CMD_WRITE_ZEROES if NBD_FLAG_SEND_WRITE_ZEROES was not negotiated in
the export options?  Similarly, should we suggest that the server reply
with EINVAL if it knows about the command, but the client issues the
command in spite of not negotiating it?  Should we enhance the
documentation in the "Error values" heading to mention that EINVAL
should be used in general for any client command not expected by the server?

> +
>  ## About this file
>  
>  This file tries to document the NBD protocol as it is currently
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-23 14:16 ` [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension Denis V. Lunev
@ 2016-03-23 16:27   ` Eric Blake
  2016-03-24 12:30     ` Pavel Borzenkov
  2016-03-23 17:58   ` [Qemu-devel] [Nbd] " Wouter Verhelst
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 54+ messages in thread
From: Eric Blake @ 2016-03-23 16:27 UTC (permalink / raw)
  To: Denis V. Lunev, nbd-general, qemu-devel
  Cc: Kevin Wolf, Paolo Bonzini, Pavel Borzenkov, Stefan Hajnoczi,
	Wouter Verhelst

[-- Attachment #1: Type: text/plain, Size: 10400 bytes --]

On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.

The acronym LBA is not used elsewhere in the NBD spec; should we spell
it out at least once?

> 
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
> 
> There exists a concept of data dirtiness, which is required during, for
> example, incremental block device backup. To express this concept via
> NBD protocol, this patch also adds additional mode of operation to
> NBD_CMD_GET_LBA_STATUS command.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> LBA STATUS" command more closely, it has been chosen to return a list of
> extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> bitmap.
> 
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 82 insertions(+)
> 

>  
> +### `GET_LBA_STATUS` extension
> +
> +With the availability of sparse storage formats, it is often needed to query
> +status of a particular LBA range and read only those blocks of data that are
> +actually present on the block device.
> +
> +Some storage formats and operations over such formats express a concept of
> +data dirtiness. Whether the operation is block device mirroring,
> +incremental block device backup or any other operation with a concept of
> +data dirtiness, they all share a need to provide a list of LBA ranges
> +that this particular operation treats as dirty.
> +
> +To provide such class of information, `GET_LBA_STATUS` extension adds new
> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> +their respective states.
> +
> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +    An LBA range status query request. Length and offset define the range
> +    of interest. The server MUST reply with a reply header, followed
> +    immediately by the following data:
> +
> +      - 32 bits, length of parameter data that follow (unsigned)

Is this length the number of descriptors, or the number of bytes
occupied by those descriptors?  It looks like bytes (that is, with the
current layout, this field should be a multiple of 14 unless an error is
returned and the data is bogus).

I guess 32 bits is sufficient: transmission commands are limited to
32-bit length, and we are unlikely to have more than one extent per 512
bytes of length, so even if we have a header for every single sector
(worst-case for alternating clean/dirty sectors), as long as the
smallest granularity of an extent is larger than the extent field, the
'length of parameter data' in bytes is still sufficient.

> +      - zero or more LBA status descriptors, each having the following

zero or more? [1]

> +        structure:
> +
> +        * 64 bits, offset (unsigned)
> +        * 32 bits, length (unsigned)
> +        * 16 bits, status (unsigned)

An array of these status descriptors is packed on the wire, while the
typical C layout of an array of these structures will have padding to
reach a nice 8-byte alignment.  Should 'status' be a 32-bit field, so
that clients and servers do not have to pack/unpack between 14 bytes on
the wire and 16 bytes in efficient array handling, at the expense of
more network traffic?

Conversely, it would be possible to send less data over the wire, as
long as we require that all LBA status descriptors cover consecutive
offsets.  That is, having the server reply with offsets is pointless,
since they can be reconstructed on the client by starting with the
offset in the client's request, then adding length from each status
field.  Is less network traffic desirable?

> +
> +    unless an error condition has occurred.
> +
> +    If an error occurs, the server SHOULD set the appropriate error code
> +    in the error field. The server MUST then either close the
> +    connection, or send *length of parameter data* bytes of data
> +    (which MAY be invalid).
> +
> +    The type of information required by the client is passed to server in the
> +    command flags field. If the server does not implement requested type or
> +    have no means to express it, it MUST NOT return an error, but instead MUST
> +    return a single LBA status descriptor with *offset* and *length* equal to
> +    the *offset* and *length* from request, and *status* set to `0`.

[1] So in what situations will we ever return an array of zero status
fields? On an error?  Should we make it clear that the server MUST NOT
return 0 status fields except on an error?

Do we want to require that the server MUST reply with enough extents to
sum up to the length of the client's request, or are we permitting a
short reply?

> +
> +    The following request types are currently defined for the command:
> +
> +    1. Block provisioning state
> +
> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return

Here, you spell it '0x0'; in the previous patch, you spelled the command
flag as 'bit 1' - does that mean that Block provisioning state is the
default when no command flags are sent?  What if we later add other
flags but still want block provisioning mode?  Wouldn't it be better to
state that this mode is entered when the NBD_FLAG_GET_DIRTY flag is
clear, without regards to the state of the other flags, than allowing
this mode only when all 16 flags are zero?

For example, should we allow a flag that states that the client is
interested only in allocated/unallocated, and that the server may
coalesce NBD_STATE_ZEROED extents as if they were NBD_STATE_ALLOCATED
for fewer extents reported and thus potentially less network traffic?

> +    the provisioning state of the device. The following provisionnig states

s/provisionnig/provisioning/

> +    are defined for the command:
> +
> +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> +        and contains zeroes;
> +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> +        block device. A client MUST NOT make any assumptions about the
> +        contents of the extent.

Can NBD_STATE_ALLOCATED and NBD_STATE_DEALLOCATED both be set at the
same time, or is that an error on the server?  What do we know about an
extent that has neither NBD_STATE_ALLOCATED nor NBD_STATE_DEALLOCATED set?

/me re-reads

Oh, you are using this as the _entire_ 16-bit status value, rather than
as bits 0, 1, and 2 as flags.

But I think you have two binary flags (four possible states) here: it is
quite conceivable to have a server on top of a SCSI device, where we
know that the extent is unallocated in SCSI, but where the server will
guarantee that it reads as all zeroes (possibly because the server
bypasses SCSI on the NBD read commands when it knows SCSI is
unallocated).  That is, if we set this up as two flags:

0x1 - allocated
0x2 - reads as 0

then we can express four states:

0x0 - LBA extent not present, client MUST NOT make assumptions about
contents, and reads should not be attempted
0x1 - LBA extent allocated, reads will succeed but no guarantee on contents
0x2 - LBA extent not present, but client can treat the extent as zeroes
and reads will succeed
0x3 - LBA extent present, client can treat the extent as zeroes and
reads will succeed

Actually, we should probably pick the bit values such that 0x0 means
allocated and readable, as the most common state, since we also require
that the command returns a single extent over the entire length with
status 0 if the server can't provide any further details.

I'm not familiar enough with the SCSI "GET LBA STATUS" command to know
if your command sanely matches to that one.

> +
> +    2. Block dirtiness state
> +
> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return

This overlaps with the bit value for NBD_FLAG_SEND_FUA in the previous
patch.  Is that okay?  Or should we use a different bit value, on the
grounds that some future extension may want to use both flags
orthogonally within the same (possibly new) command?  Again, consistency
in the spelling ('bit 1' in the previous patch, '0x1' here), would be nice.

> +    the dirtiness status of the device. The following dirtiness states
> +    are defined for the command:
> +
> +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.

Again, it looks like you are using these as two entire 16-bit status
values, rather than as two separate bits (1<<0 and 1<<1).  Another way
of expressing it is that a single boolean flag is defined, if clear, the
extent is dirty, if set, the extent is clean.

> +
> +    Generic NBD client implementation without knowledge of a particular NBD
> +    server operation MUST NOT make any assumption on the meaning of the
> +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> +
> +The server SHOULD return `EINVAL` if it receives a `GET_LBA_STATUS` request
> +including one or more sectors beyond the size of the device.

As mentioned in the previous mail, should we also recommend an EINVAL if
NBD_CMD_GET_LBA_STATUS was not negotiated in the export options but the
client sends the command anyways; and/or a requirement that the client
must not issue the command in that case?

> +
>  ## About this file
>  
>  This file tries to document the NBD protocol as it is currently
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension
  2016-03-23 14:16 ` [Qemu-devel] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension Denis V. Lunev
  2016-03-23 15:14   ` Eric Blake
@ 2016-03-23 17:21   ` Wouter Verhelst
  2016-03-24  7:57     ` Pavel Borzenkov
  1 sibling, 1 reply; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-23 17:21 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi, Paolo Bonzini

Hi,

On Wed, Mar 23, 2016 at 05:16:01PM +0300, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> There exist some cases when a client knows that the data it is going to
> write is all zeroes. Such cases include mirroring or backing up a device
> implemented by a sparse file.
> 
> With current NBD command set, the client has to issue NBD_CMD_WRITE
> command with zeroed payload and transfer these zero bytes through the
> wire. The server has to write the data onto disk, effectively denying
> the sparseness.
> 
> To remedy this, the patch adds WRITE_ZEROES extension with one new
> NBD_CMD_WRITE_ZEROES command.
> 
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  doc/proto.md | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 463ef8a..cda213c 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation:
>    schedule I/O accesses as for a rotational medium
>  - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
>    `NBD_CMD_TRIM` commands
> +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
> +  supports `NBD_CMD_WRITE_ZEROES` commands
>  
>  ##### Client flags
>  
> @@ -471,6 +473,10 @@ The following request types exist:
>      about the contents of the export affected by this command, until
>      overwriting it again with `NBD_CMD_WRITE`.
>  
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    Defined by the experimental `WRITE_ZEROES` extension; see below.
> +
>  * Other requests
>  
>      Some third-party implementations may require additional protocol
> @@ -594,6 +600,44 @@ option reply type.
>        message if they do not also send it as a reply to the
>        `NBD_OPT_SELECT` message.
>  
> +### `WRITE_ZEROES` extension
> +
> +There exist some cases when a client knows that the data it is going to write
> +is all zeroes. Such cases include mirroring or backing up a device implemented
> +by a sparse file. With current NBD command set, the client has to issue
> +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
> +through the wire. The server has to write the data onto disk, effectively
> +denying the sparseness.
> +
> +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds
> +one new command with two command flags.
> +
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    A write request with no payload. Length and offset define the location
> +    and amount of data to be zeroed.
> +
> +    The server MUST zero out the data on disk, and then send the reply
> +    message. The server MAY send the reply message before the data has
> +    reached permanent storage.
> +
> +    If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
> +    export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` (bit 0)
> +    in the command flags field. If this flag was set, the server MUST NOT send
> +    the reply until it has ensured that the newly-zeroed data has reached
> +    permanent storage.
> +
> +    If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in the
> +    command flags field, the server MAY use trimming to zero out the area,
> +    but it MUST ensure that the data reads back as zero.
> +
> +    If an error occurs, the server SHOULD set the appropriate error code
> +    in the error field. The server MAY then close the connection.
> +
> +The server SHOULD return `ENOSPC` if it receives a write zeroes request
> +including one or more sectors beyond the size of the device. It SHOULD
> +return `EPERM` if it receives a write zeroes request on a read-only export.
> +

So, the semantics of your proposed WRITE_ZEROES are exactly the same as
the WRITE command, except that no payload is sent?

In that case, I think it's slightly more sensible if we don't add a new
command, but instead just have an NBD_CMD_FLAG_ZEROES added to the WRITE
command instead. After all, they're going to be (mostly) the same
anyway.

Did you propose a separate command for a specific reason that I'm
missing (or forgetting), or is that just an oversight?

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension
  2016-03-23 15:14   ` Eric Blake
@ 2016-03-23 17:40     ` Wouter Verhelst
  2016-03-24  7:16     ` [Qemu-devel] " Pavel Borzenkov
  1 sibling, 0 replies; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-23 17:40 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

On Wed, Mar 23, 2016 at 09:14:10AM -0600, Eric Blake wrote:
> On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
[...]
> [1] Oh, you ARE adding this to the "Experimental extensions" section of
> the document, so your wording IS correct.  I guess the idea is that we
> write up the documentation in the experimental section, tweak qemu to
> implement it both as NBD client and as NBD server (since qemu has code
> that can serve in both positions), see how well it worked, and THEN do a
> followup patch to proto.md to move the text into the non-experimental
> section, along with any tweaks learned during the qemu patches.

That's the way I accept specification changes now, yes: define an
experimental spec first, wait for a first implementation (whether that's
qemu or the reference implementation is not as relevant), and only then
move it to the normative part of the spec.

> > +one new command with two command flags.
> > +
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +    A write request with no payload. Length and offset define the location
> > +    and amount of data to be zeroed.
> > +
> > +    The server MUST zero out the data on disk, and then send the reply
> > +    message. The server MAY send the reply message before the data has
> > +    reached permanent storage.
> > +
> > +    If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
> > +    export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` (bit 0)
> > +    in the command flags field. If this flag was set, the server MUST NOT send
> > +    the reply until it has ensured that the newly-zeroed data has reached
> > +    permanent storage.
> 
> Do we want to add:
> 
> The server SHOULD return EINVAL if the client set NBD_CMD_FLAG_FUA but
> the export flags did not include NBD_FLAG_SEND_FUA.

Well, if the export flags didn't include NBD_FLAG_SEND_FUA, that means
the server may not implement that flag, and hence the server may not
know what to do with it.

It's probably a good idea to send a particular error (EINVAL sounds
good, and is what the current reference nbd-server implementation
sends[1]) for commands or flags the server implementation doesn't know
about, but the spec doesn't currently say that explicitly.

[1] actually, the current server sends EINVAL for commands the server
doesn't know about, but silently ignores flags the server doesn't know
about.

> > +    If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in the
> > +    command flags field, the server MAY use trimming to zero out the area,
> > +    but it MUST ensure that the data reads back as zero.
> 
> Bug in the existing spec: The constant NBD_CMD_FLAG_FUA is mentioned,
> but never defined with a fixed value.  Your text above defines it to
> 'bit 0' in the command flags field - is that correct?

(checks code) Yes, it is.

> If so, should we add a section to the document that lists the bit
> values of all supported command flags?

Currently that's only the FUA flag, but yes, I'll do so.

> Meanwhile, your proposed text hardcodes NBD_CMD_FLAG_MAY_TRIM to 'bit
> 1'; that might also be worth adding into the same new section of the
> document documenting all supported command flags.
> 
> Do we want to require that the server has negotiated the
> NBD_FLAG_SEND_TRIM export flag prior to allowing the
> NBD_CMD_FLAG_MAY_TRIM flag in this command?
> 
> Possibly-related bug in the existing spec: Should the text of the
> NBD_CMD_TRIM (4) command mention the desired server behavior if the
> client sends NBD_CMD_TRIM even though NBD_FLAG_SEND_TRIM was not
> negotiated in the export flags?

Not sure that's a good idea. The export flag is there to say "I support
it". If the server doesn't support a feature, the client shouldn't use
that feature. If the client *does* use it, it ends up in undefined
behaviour-land (because the server may be from whatever source).

> Similarly, should the text of the NBD_CMD_FLUSH () command mention the
> desired server behavior if the client sends NBD_CMD_FLUSH even though
> NBD_FLAG_SEND_FLUSH was not negotiated in the export flags?  At least
> NBD_CMD_FLUSH recommended that the client must not send the command if
> the feature was not negotiated.

I suppose it would make sense to make the section on NBD_CMD_TRIM have a
similar notice, yes.

> > +    If an error occurs, the server SHOULD set the appropriate error code
> > +    in the error field. The server MAY then close the connection.
> > +
> > +The server SHOULD return `ENOSPC` if it receives a write zeroes request
> > +including one or more sectors beyond the size of the device. It SHOULD
> > +return `EPERM` if it receives a write zeroes request on a read-only export.
> 
> Should we add a paragraph stating that the client MUST NOT send
> NBD_CMD_WRITE_ZEROES if NBD_FLAG_SEND_WRITE_ZEROES was not negotiated in
> the export options?  Similarly, should we suggest that the server reply
> with EINVAL if it knows about the command, but the client issues the
> command in spite of not negotiating it?  Should we enhance the
> documentation in the "Error values" heading to mention that EINVAL
> should be used in general for any client command not expected by the server?

Yes, probably. As above, the current nbd-server implementation does so,
and I'll make that explicit.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-23 14:16 ` [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension Denis V. Lunev
  2016-03-23 16:27   ` Eric Blake
@ 2016-03-23 17:58   ` Wouter Verhelst
  2016-03-23 18:14     ` Kevin Wolf
                       ` (2 more replies)
  2016-03-24 22:08   ` [Qemu-devel] " Eric Blake
                     ` (2 subsequent siblings)
  4 siblings, 3 replies; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-23 17:58 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 6797 bytes --]

On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
> 
> There exists a concept of data dirtiness, which is required during, for
> example, incremental block device backup. To express this concept via
> NBD protocol, this patch also adds additional mode of operation to
> NBD_CMD_GET_LBA_STATUS command.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> LBA STATUS" command more closely, it has been chosen to return a list of
> extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> bitmap.
> 
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 82 insertions(+)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index cda213c..fff515d 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -243,6 +243,8 @@ immediately after the global flags field in oldstyle negotiation:
>    `NBD_CMD_TRIM` commands
>  - bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
>    supports `NBD_CMD_WRITE_ZEROES` commands
> +- bit 7, `NBD_FLAG_SEND_GET_LBA_STATUS`; should be set to 1 if the server
> +  supports `NBD_CMD_GET_LBA_STATUS` commands
>  
>  ##### Client flags
>  
> @@ -477,6 +479,10 @@ The following request types exist:
>  
>      Defined by the experimental `WRITE_ZEROES` extension; see below.
>  
> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +    Defined by the experimental `GET_LBA_STATUS` extension; see below.
> +
>  * Other requests
>  
>      Some third-party implementations may require additional protocol
> @@ -638,6 +644,82 @@ The server SHOULD return `ENOSPC` if it receives a write zeroes request
>  including one or more sectors beyond the size of the device. It SHOULD
>  return `EPERM` if it receives a write zeroes request on a read-only export.
>  
> +### `GET_LBA_STATUS` extension
> +
> +With the availability of sparse storage formats, it is often needed to query
> +status of a particular LBA range and read only those blocks of data that are
> +actually present on the block device.
> +
> +Some storage formats and operations over such formats express a concept of
> +data dirtiness. Whether the operation is block device mirroring,
> +incremental block device backup or any other operation with a concept of
> +data dirtiness, they all share a need to provide a list of LBA ranges
> +that this particular operation treats as dirty.
> +
> +To provide such class of information, `GET_LBA_STATUS` extension adds new
> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> +their respective states.
> +
> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +    An LBA range status query request. Length and offset define the range
> +    of interest. The server MUST reply with a reply header, followed
> +    immediately by the following data:

As Eric noted, please expand LBA at least once.

> +      - 32 bits, length of parameter data that follow (unsigned)
> +      - zero or more LBA status descriptors, each having the following
> +        structure:
> +
> +        * 64 bits, offset (unsigned)
> +        * 32 bits, length (unsigned)
> +        * 16 bits, status (unsigned)
> +
> +    unless an error condition has occurred.
> +
> +    If an error occurs, the server SHOULD set the appropriate error code
> +    in the error field. The server MUST then either close the
> +    connection, or send *length of parameter data* bytes of data
> +    (which MAY be invalid).
> +
> +    The type of information required by the client is passed to server in the
> +    command flags field. If the server does not implement requested type or
> +    have no means to express it, it MUST NOT return an error, but instead MUST
> +    return a single LBA status descriptor with *offset* and *length* equal to
> +    the *offset* and *length* from request, and *status* set to `0`.
> +
> +    The following request types are currently defined for the command:
> +
> +    1. Block provisioning state
> +
> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return

I prefer to have a non-zero flag value.

> +    the provisioning state of the device. The following provisionnig states
> +    are defined for the command:
> +
> +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> +        and contains zeroes;

Presumably this should be "contains only zeroes"?

Also, this may end up being a fairly expensive call for the server to
process. Is it really useful?

> +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> +        block device. A client MUST NOT make any assumptions about the
> +        contents of the extent.
> +
> +    2. Block dirtiness state
> +
> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> +    the dirtiness status of the device. The following dirtiness states
> +    are defined for the command:
> +
> +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> +
> +    Generic NBD client implementation without knowledge of a particular NBD
> +    server operation MUST NOT make any assumption on the meaning of the
> +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.

That makes it a useless call. A server can read /dev/random to decide
whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with
this spec.

Either the spec should define what it means for a block to be in a dirty
state, or it should not talk about it.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-23 17:58   ` [Qemu-devel] [Nbd] " Wouter Verhelst
@ 2016-03-23 18:14     ` Kevin Wolf
  2016-03-24  8:25       ` Pavel Borzenkov
  2016-03-24  8:43     ` Pavel Borzenkov
  2016-03-24 11:55     ` Paolo Bonzini
  2 siblings, 1 reply; 54+ messages in thread
From: Kevin Wolf @ 2016-03-23 18:14 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, Denis V. Lunev, qemu-devel, Stefan Hajnoczi, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2416 bytes --]

Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben:
> On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > +    The type of information required by the client is passed to server in the
> > +    command flags field. If the server does not implement requested type or
> > +    have no means to express it, it MUST NOT return an error, but instead MUST
> > +    return a single LBA status descriptor with *offset* and *length* equal to
> > +    the *offset* and *length* from request, and *status* set to `0`.
> > +
> > +    The following request types are currently defined for the command:
> > +
> > +    1. Block provisioning state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
> 
> I prefer to have a non-zero flag value.
> 
> > +    the provisioning state of the device. The following provisionnig states
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > +        and contains zeroes;
> 
> Presumably this should be "contains only zeroes"?
> 
> Also, this may end up being a fairly expensive call for the server to
> process. Is it really useful?

I think we need to make clear that this is meant as an optimisation and
it's always a valid option for a server to return NBD_STATE_ALLOCATED
even if the contents is zeroed.

It is definitely useful if the server has a means to efficiently find
out the allocation status (e.g. SEEK_HOLE). In that case the client may
be able to avoid reading the block and sending it over the network, or
when making a copy, it could use it to keep the target file sparse. If
the client can't take advantage, we didn't have much overhead, so it's
fine.

It's less clear in a case where the server needs to read in the block
and scan its contents. It could still be a net win if the next thing the
client does is retrieving the block: The server would still have the
cost of reading the block, but it wouldn't be transferred. But when the
client doesn't follow up with a read, it's quite a bit of overhead that
we had for no benefit. Returning NBD_STATE_ALLOCATED might be more
appropriate in this case than scanning for zeros.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension
  2016-03-23 15:14   ` Eric Blake
  2016-03-23 17:40     ` [Qemu-devel] [Nbd] " Wouter Verhelst
@ 2016-03-24  7:16     ` Pavel Borzenkov
  2016-03-24  7:36       ` [Qemu-devel] [Nbd] " Wouter Verhelst
  1 sibling, 1 reply; 54+ messages in thread
From: Pavel Borzenkov @ 2016-03-24  7:16 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Wouter Verhelst, Denis V. Lunev

On Wed, Mar 23, 2016 at 09:14:10AM -0600, Eric Blake wrote:
> On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > 
> > There exist some cases when a client knows that the data it is going to
> > write is all zeroes. Such cases include mirroring or backing up a device
> > implemented by a sparse file.
> > 
> > With current NBD command set, the client has to issue NBD_CMD_WRITE
> > command with zeroed payload and transfer these zero bytes through the
> > wire. The server has to write the data onto disk, effectively denying
> > the sparseness.
> > 
> > To remedy this, the patch adds WRITE_ZEROES extension with one new
> > NBD_CMD_WRITE_ZEROES command.
> > 
> > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > CC: Wouter Verhelst <w@uter.be>
> > CC: Paolo Bonzini <pbonzini@redhat.com>
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  doc/proto.md | 44 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 44 insertions(+)
> > 
> 
> >  
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +    Defined by the experimental `WRITE_ZEROES` extension; see below.
> 
> If this patch goes in to the NBD sources, the extension is not
> experimental any more, right?  Shouldn't the patch be written under the
> assumption that it will be the final form of the text, even though the
> feature is experimental until then? [1]
> 
> > +### `WRITE_ZEROES` extension
> > +
> > +There exist some cases when a client knows that the data it is going to write
> > +is all zeroes. Such cases include mirroring or backing up a device implemented
> > +by a sparse file. With current NBD command set, the client has to issue
> > +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
> > +through the wire. The server has to write the data onto disk, effectively
> > +denying the sparseness.
> 
> s/denying/losing/ ?
> 
> > +
> > +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds
> 
> s/is envisioned/exists/ - again, under the argument that once it is part
> of this document, it is no longer experimental.
> 
> /me goes and reads the actual proto.md file, and light bulb turns on...
> 
> [1] Oh, you ARE adding this to the "Experimental extensions" section of
> the document, so your wording IS correct.  I guess the idea is that we
> write up the documentation in the experimental section, tweak qemu to
> implement it both as NBD client and as NBD server (since qemu has code
> that can serve in both positions), see how well it worked, and THEN do a
> followup patch to proto.md to move the text into the non-experimental
> section, along with any tweaks learned during the qemu patches.

Yes, that is the plan.

> > +one new command with two command flags.
> > +
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +    A write request with no payload. Length and offset define the location
> > +    and amount of data to be zeroed.
> > +
> > +    The server MUST zero out the data on disk, and then send the reply
> > +    message. The server MAY send the reply message before the data has
> > +    reached permanent storage.
> > +
> > +    If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
> > +    export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` (bit 0)
> > +    in the command flags field. If this flag was set, the server MUST NOT send
> > +    the reply until it has ensured that the newly-zeroed data has reached
> > +    permanent storage.
> 
> Do we want to add:
> 
> The server SHOULD return EINVAL if the client set NBD_CMD_FLAG_FUA but
> the export flags did not include NBD_FLAG_SEND_FUA.

Looks like a good idea. Needs to be added here and in NBD_CMD_WRITE
command description.

> 
> > +
> > +    If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in the
> > +    command flags field, the server MAY use trimming to zero out the area,
> > +    but it MUST ensure that the data reads back as zero.
> 
> Bug in the existing spec: The constant NBD_CMD_FLAG_FUA is mentioned,
> but never defined with a fixed value.  Your text above defines it to
> 'bit 0' in the command flags field - is that correct?  If so, should we
> add a section to the document that lists the bit values of all supported
> command flags?

Yes, this is correct (at least this is how it's used in nbd and
QEMU).

My understanding of the NBD specification is that the 'command flags'
field contains per-command flags, thus their values for each command may
coincide. So I think it's better to describe flags supported by each
command in respective command's section, instead of in one global
"Command flags" section.

> 
> Meanwhile, your proposed text hardcodes NBD_CMD_FLAG_MAY_TRIM to 'bit
> 1'; that might also be worth adding into the same new section of the
> document documenting all supported command flags.
> 
> Do we want to require that the server has negotiated the
> NBD_FLAG_SEND_TRIM export flag prior to allowing the
> NBD_CMD_FLAG_MAY_TRIM flag in this command?

I don't think there is a need for such a requirement. Since
NBD_CMD_FLAG_MAY_TRIM flag is merely a suggestion, it can be safely
ignored by the server if it doesn't want to (or can't) use TRIM.

> 
> Possibly-related bug in the existing spec: Should the text of the
> NBD_CMD_TRIM (4) command mention the desired server behavior if the
> client sends NBD_CMD_TRIM even though NBD_FLAG_SEND_TRIM was not
> negotiated in the export flags?  Similarly, should the text of the
> NBD_CMD_FLUSH () command mention the desired server behavior if the
> client sends NBD_CMD_FLUSH even though NBD_FLAG_SEND_FLUSH was not
> negotiated in the export flags?  At least NBD_CMD_FLUSH recommended that
> the client must not send the command if the feature was not negotiated.

Agree, better to mention this.

> 
> > +
> > +    If an error occurs, the server SHOULD set the appropriate error code
> > +    in the error field. The server MAY then close the connection.
> > +
> > +The server SHOULD return `ENOSPC` if it receives a write zeroes request
> > +including one or more sectors beyond the size of the device. It SHOULD
> > +return `EPERM` if it receives a write zeroes request on a read-only export.
> 
> Should we add a paragraph stating that the client MUST NOT send
> NBD_CMD_WRITE_ZEROES if NBD_FLAG_SEND_WRITE_ZEROES was not negotiated in
> the export options?  Similarly, should we suggest that the server reply
> with EINVAL if it knows about the command, but the client issues the
> command in spite of not negotiating it?  Should we enhance the
> documentation in the "Error values" heading to mention that EINVAL
> should be used in general for any client command not expected by the server?

Agree.

So if no one objects, I'll send a patch correcting current spec
ambiguities and than patches with new proposed commands.

> 
> > +
> >  ## About this file
> >  
> >  This file tries to document the NBD protocol as it is currently
> > 
> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension
  2016-03-24  7:16     ` [Qemu-devel] " Pavel Borzenkov
@ 2016-03-24  7:36       ` Wouter Verhelst
  0 siblings, 0 replies; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-24  7:36 UTC (permalink / raw)
  To: Pavel Borzenkov
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Denis V. Lunev, Paolo Bonzini

On Thu, Mar 24, 2016 at 10:16:19AM +0300, Pavel Borzenkov wrote:
> So if no one objects, I'll send a patch correcting current spec
> ambiguities and than patches with new proposed commands.

Yes please. Thank you.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension
  2016-03-23 17:21   ` Wouter Verhelst
@ 2016-03-24  7:57     ` Pavel Borzenkov
  2016-03-24  8:26       ` Wouter Verhelst
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Borzenkov @ 2016-03-24  7:57 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

On Wed, Mar 23, 2016 at 06:21:16PM +0100, Wouter Verhelst wrote:
> Hi,
> 
> On Wed, Mar 23, 2016 at 05:16:01PM +0300, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > 
> > There exist some cases when a client knows that the data it is going to
> > write is all zeroes. Such cases include mirroring or backing up a device
> > implemented by a sparse file.
> > 
> > With current NBD command set, the client has to issue NBD_CMD_WRITE
> > command with zeroed payload and transfer these zero bytes through the
> > wire. The server has to write the data onto disk, effectively denying
> > the sparseness.
> > 
> > To remedy this, the patch adds WRITE_ZEROES extension with one new
> > NBD_CMD_WRITE_ZEROES command.
> > 
> > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > CC: Wouter Verhelst <w@uter.be>
> > CC: Paolo Bonzini <pbonzini@redhat.com>
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  doc/proto.md | 44 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 44 insertions(+)
> > 
> > diff --git a/doc/proto.md b/doc/proto.md
> > index 463ef8a..cda213c 100644
> > --- a/doc/proto.md
> > +++ b/doc/proto.md
> > @@ -241,6 +241,8 @@ immediately after the global flags field in oldstyle negotiation:
> >    schedule I/O accesses as for a rotational medium
> >  - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
> >    `NBD_CMD_TRIM` commands
> > +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
> > +  supports `NBD_CMD_WRITE_ZEROES` commands
> >  
> >  ##### Client flags
> >  
> > @@ -471,6 +473,10 @@ The following request types exist:
> >      about the contents of the export affected by this command, until
> >      overwriting it again with `NBD_CMD_WRITE`.
> >  
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +    Defined by the experimental `WRITE_ZEROES` extension; see below.
> > +
> >  * Other requests
> >  
> >      Some third-party implementations may require additional protocol
> > @@ -594,6 +600,44 @@ option reply type.
> >        message if they do not also send it as a reply to the
> >        `NBD_OPT_SELECT` message.
> >  
> > +### `WRITE_ZEROES` extension
> > +
> > +There exist some cases when a client knows that the data it is going to write
> > +is all zeroes. Such cases include mirroring or backing up a device implemented
> > +by a sparse file. With current NBD command set, the client has to issue
> > +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
> > +through the wire. The server has to write the data onto disk, effectively
> > +denying the sparseness.
> > +
> > +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds
> > +one new command with two command flags.
> > +
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +    A write request with no payload. Length and offset define the location
> > +    and amount of data to be zeroed.
> > +
> > +    The server MUST zero out the data on disk, and then send the reply
> > +    message. The server MAY send the reply message before the data has
> > +    reached permanent storage.
> > +
> > +    If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
> > +    export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` (bit 0)
> > +    in the command flags field. If this flag was set, the server MUST NOT send
> > +    the reply until it has ensured that the newly-zeroed data has reached
> > +    permanent storage.
> > +
> > +    If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in the
> > +    command flags field, the server MAY use trimming to zero out the area,
> > +    but it MUST ensure that the data reads back as zero.
> > +
> > +    If an error occurs, the server SHOULD set the appropriate error code
> > +    in the error field. The server MAY then close the connection.
> > +
> > +The server SHOULD return `ENOSPC` if it receives a write zeroes request
> > +including one or more sectors beyond the size of the device. It SHOULD
> > +return `EPERM` if it receives a write zeroes request on a read-only export.
> > +
> 
> So, the semantics of your proposed WRITE_ZEROES are exactly the same as
> the WRITE command, except that no payload is sent?
> 
> In that case, I think it's slightly more sensible if we don't add a new
> command, but instead just have an NBD_CMD_FLAG_ZEROES added to the WRITE
> command instead. After all, they're going to be (mostly) the same
> anyway.
> 
> Did you propose a separate command for a specific reason that I'm
> missing (or forgetting), or is that just an oversight?

No, there is no specific reason. Looks like NBD_CMD_FLAG_ZEROES fits the
spec and implementations nicely. So I'll rewrite the extension and add
the flag instead of the whole command.

> 
> -- 
> < ron> I mean, the main *practical* problem with C++, is there's like a dozen
>        people in the world who think they really understand all of its rules,
>        and pretty much all of them are just lying to themselves too.
>  -- #debian-devel, OFTC, 2016-02-12
> 
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-23 18:14     ` Kevin Wolf
@ 2016-03-24  8:25       ` Pavel Borzenkov
  2016-03-24  8:41         ` Wouter Verhelst
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Borzenkov @ 2016-03-24  8:25 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: nbd-general, Denis V. Lunev, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Wouter Verhelst

On Wed, Mar 23, 2016 at 07:14:54PM +0100, Kevin Wolf wrote:
> Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben:
> > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > > +    The type of information required by the client is passed to server in the
> > > +    command flags field. If the server does not implement requested type or
> > > +    have no means to express it, it MUST NOT return an error, but instead MUST
> > > +    return a single LBA status descriptor with *offset* and *length* equal to
> > > +    the *offset* and *length* from request, and *status* set to `0`.
> > > +
> > > +    The following request types are currently defined for the command:
> > > +
> > > +    1. Block provisioning state
> > > +
> > > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > > +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
> > 
> > I prefer to have a non-zero flag value.
> > 
> > > +    the provisioning state of the device. The following provisionnig states
> > > +    are defined for the command:
> > > +
> > > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > > +        and contains zeroes;
> > 
> > Presumably this should be "contains only zeroes"?
> > 
> > Also, this may end up being a fairly expensive call for the server to
> > process. Is it really useful?
> 
> I think we need to make clear that this is meant as an optimisation and
> it's always a valid option for a server to return NBD_STATE_ALLOCATED
> even if the contents is zeroed.
> 
> It is definitely useful if the server has a means to efficiently find
> out the allocation status (e.g. SEEK_HOLE). In that case the client may
> be able to avoid reading the block and sending it over the network, or
> when making a copy, it could use it to keep the target file sparse. If
> the client can't take advantage, we didn't have much overhead, so it's
> fine.

Yes, that was the idea. I'll add a note that the server may return
NBD_STATE_ALLOCATED instead of NBD_STATE_ZEROED if it has not means to
efficiently differentiate allocated blocks with zeroes from allocated
blocks with non-zeroed content.

> 
> It's less clear in a case where the server needs to read in the block
> and scan its contents. It could still be a net win if the next thing the
> client does is retrieving the block: The server would still have the
> cost of reading the block, but it wouldn't be transferred. But when the
> client doesn't follow up with a read, it's quite a bit of overhead that
> we had for no benefit. Returning NBD_STATE_ALLOCATED might be more
> appropriate in this case than scanning for zeros.
> 
> Kevin



> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140

> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension
  2016-03-24  7:57     ` Pavel Borzenkov
@ 2016-03-24  8:26       ` Wouter Verhelst
  2016-03-24 11:35         ` Pavel Borzenkov
                           ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-24  8:26 UTC (permalink / raw)
  To: Pavel Borzenkov
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

On Thu, Mar 24, 2016 at 10:57:06AM +0300, Pavel Borzenkov wrote:
> On Wed, Mar 23, 2016 at 06:21:16PM +0100, Wouter Verhelst wrote:
> > So, the semantics of your proposed WRITE_ZEROES are exactly the same as
> > the WRITE command, except that no payload is sent?
> > 
> > In that case, I think it's slightly more sensible if we don't add a new
> > command, but instead just have an NBD_CMD_FLAG_ZEROES added to the WRITE
> > command instead. After all, they're going to be (mostly) the same
> > anyway.
> > 
> > Did you propose a separate command for a specific reason that I'm
> > missing (or forgetting), or is that just an oversight?
> 
> No, there is no specific reason. Looks like NBD_CMD_FLAG_ZEROES fits the
> spec and implementations nicely. So I'll rewrite the extension and add
> the flag instead of the whole command.

Actually, having given this some more thought...

There is at least one server-side implementation of nbd (mine) which
silently ignores flags it doesn't know about. This isn't a problem for
non-critical flags, but it could be a problem for a flag like this. Of
course, a client shouldn't send a flag to a server which that server
hasn't heard of, but mistakes do happen.

Do we want to keep that in mind? If so, we might want to keep it as a
separate command after all.

OTOH, it could be said that silently ignoring unknown messages is a bug.
I should probably just fix my implementation instead.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24  8:25       ` Pavel Borzenkov
@ 2016-03-24  8:41         ` Wouter Verhelst
  2016-03-24 11:36           ` Pavel Borzenkov
  0 siblings, 1 reply; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-24  8:41 UTC (permalink / raw)
  To: Pavel Borzenkov
  Cc: Kevin Wolf, nbd-general, qemu-devel, Stefan Hajnoczi,
	Denis V. Lunev, Paolo Bonzini

On Thu, Mar 24, 2016 at 11:25:52AM +0300, Pavel Borzenkov wrote:
> On Wed, Mar 23, 2016 at 07:14:54PM +0100, Kevin Wolf wrote:
> > Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben:
> > > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > > > +    the provisioning state of the device. The following provisionnig states
> > > > +    are defined for the command:
> > > > +
> > > > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > > > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > > > +        and contains zeroes;
> > > 
> > > Presumably this should be "contains only zeroes"?
> > > 
> > > Also, this may end up being a fairly expensive call for the server to
> > > process. Is it really useful?
> > 
> > I think we need to make clear that this is meant as an optimisation and
> > it's always a valid option for a server to return NBD_STATE_ALLOCATED
> > even if the contents is zeroed.
> > 
> > It is definitely useful if the server has a means to efficiently find
> > out the allocation status (e.g. SEEK_HOLE). In that case the client may
> > be able to avoid reading the block and sending it over the network, or
> > when making a copy, it could use it to keep the target file sparse. If
> > the client can't take advantage, we didn't have much overhead, so it's
> > fine.
> 
> Yes, that was the idea. I'll add a note that the server may return
> NBD_STATE_ALLOCATED instead of NBD_STATE_ZEROED if it has not means to
> efficiently differentiate allocated blocks with zeroes from allocated
> blocks with non-zeroed content.

Okay, that alleviates my concerns.

In that case it might be useful if the server could say something along
the lines of "I know it's allocated, but I didn't check whether there's
anything non-zero in there"? The client can then decide to do nothing
with that information; but the more useful information is sent along,
the better...

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-23 17:58   ` [Qemu-devel] [Nbd] " Wouter Verhelst
  2016-03-23 18:14     ` Kevin Wolf
@ 2016-03-24  8:43     ` Pavel Borzenkov
  2016-03-24  9:33       ` Wouter Verhelst
  2016-03-24 11:55     ` Paolo Bonzini
  2 siblings, 1 reply; 54+ messages in thread
From: Pavel Borzenkov @ 2016-03-24  8:43 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

On Wed, Mar 23, 2016 at 06:58:34PM +0100, Wouter Verhelst wrote:
> On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > 
> > With the availability of sparse storage formats, it is often needed to
> > query status of a particular LBA range and read only those blocks of
> > data that are actually present on the block device.
> > 
> > To provide such information, the patch adds GET_LBA_STATUS extension
> > with one new NBD_CMD_GET_LBA_STATUS command.
> > 
> > There exists a concept of data dirtiness, which is required during, for
> > example, incremental block device backup. To express this concept via
> > NBD protocol, this patch also adds additional mode of operation to
> > NBD_CMD_GET_LBA_STATUS command.
> > 
> > Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> > LBA STATUS" command more closely, it has been chosen to return a list of
> > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> > bitmap.
> > 
> > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > CC: Wouter Verhelst <w@uter.be>
> > CC: Paolo Bonzini <pbonzini@redhat.com>
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 82 insertions(+)
> > 
> > diff --git a/doc/proto.md b/doc/proto.md
> > index cda213c..fff515d 100644
> > --- a/doc/proto.md
> > +++ b/doc/proto.md
> > @@ -243,6 +243,8 @@ immediately after the global flags field in oldstyle negotiation:
> >    `NBD_CMD_TRIM` commands
> >  - bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
> >    supports `NBD_CMD_WRITE_ZEROES` commands
> > +- bit 7, `NBD_FLAG_SEND_GET_LBA_STATUS`; should be set to 1 if the server
> > +  supports `NBD_CMD_GET_LBA_STATUS` commands
> >  
> >  ##### Client flags
> >  
> > @@ -477,6 +479,10 @@ The following request types exist:
> >  
> >      Defined by the experimental `WRITE_ZEROES` extension; see below.
> >  
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +    Defined by the experimental `GET_LBA_STATUS` extension; see below.
> > +
> >  * Other requests
> >  
> >      Some third-party implementations may require additional protocol
> > @@ -638,6 +644,82 @@ The server SHOULD return `ENOSPC` if it receives a write zeroes request
> >  including one or more sectors beyond the size of the device. It SHOULD
> >  return `EPERM` if it receives a write zeroes request on a read-only export.
> >  
> > +### `GET_LBA_STATUS` extension
> > +
> > +With the availability of sparse storage formats, it is often needed to query
> > +status of a particular LBA range and read only those blocks of data that are
> > +actually present on the block device.
> > +
> > +Some storage formats and operations over such formats express a concept of
> > +data dirtiness. Whether the operation is block device mirroring,
> > +incremental block device backup or any other operation with a concept of
> > +data dirtiness, they all share a need to provide a list of LBA ranges
> > +that this particular operation treats as dirty.
> > +
> > +To provide such class of information, `GET_LBA_STATUS` extension adds new
> > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> > +their respective states.
> > +
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +    An LBA range status query request. Length and offset define the range
> > +    of interest. The server MUST reply with a reply header, followed
> > +    immediately by the following data:
> 
> As Eric noted, please expand LBA at least once.
> 
> > +      - 32 bits, length of parameter data that follow (unsigned)
> > +      - zero or more LBA status descriptors, each having the following
> > +        structure:
> > +
> > +        * 64 bits, offset (unsigned)
> > +        * 32 bits, length (unsigned)
> > +        * 16 bits, status (unsigned)
> > +
> > +    unless an error condition has occurred.
> > +
> > +    If an error occurs, the server SHOULD set the appropriate error code
> > +    in the error field. The server MUST then either close the
> > +    connection, or send *length of parameter data* bytes of data
> > +    (which MAY be invalid).
> > +
> > +    The type of information required by the client is passed to server in the
> > +    command flags field. If the server does not implement requested type or
> > +    have no means to express it, it MUST NOT return an error, but instead MUST
> > +    return a single LBA status descriptor with *offset* and *length* equal to
> > +    the *offset* and *length* from request, and *status* set to `0`.
> > +
> > +    The following request types are currently defined for the command:
> > +
> > +    1. Block provisioning state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
> 
> I prefer to have a non-zero flag value.
> 
> > +    the provisioning state of the device. The following provisionnig states
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > +        and contains zeroes;
> 
> Presumably this should be "contains only zeroes"?
> 
> Also, this may end up being a fairly expensive call for the server to
> process. Is it really useful?
> 
> > +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> > +        block device. A client MUST NOT make any assumptions about the
> > +        contents of the extent.
> > +
> > +    2. Block dirtiness state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> > +    the dirtiness status of the device. The following dirtiness states
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> > +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> > +
> > +    Generic NBD client implementation without knowledge of a particular NBD
> > +    server operation MUST NOT make any assumption on the meaning of the
> > +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> 
> That makes it a useless call. A server can read /dev/random to decide
> whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with
> this spec.
> 
> Either the spec should define what it means for a block to be in a dirty
> state, or it should not talk about it.

What I was trying to say with this sentence is that without knowing what
is currently going on on the server side, the DIRTY state has no
meaning. If we are doing incremental backup DIRTY state might represent blocks
that have changed since last backup. For mirroring - blocks that are yet
to be migrated. And for the same block device different sets of DIRTY
ranges might be returned in response to this command. Basically, the
meaning of DIRTY depends on the context. 

Should I state in the spec, that the meaning of DIRTY state is
implementation-specific? I see that NBD_REP_SERVER already uses this
approach.

> 
> -- 
> < ron> I mean, the main *practical* problem with C++, is there's like a dozen
>        people in the world who think they really understand all of its rules,
>        and pretty much all of them are just lying to themselves too.
>  -- #debian-devel, OFTC, 2016-02-12



> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140

> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24  8:43     ` Pavel Borzenkov
@ 2016-03-24  9:33       ` Wouter Verhelst
  2016-03-24 10:32         ` Alex Bligh
  0 siblings, 1 reply; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-24  9:33 UTC (permalink / raw)
  To: Pavel Borzenkov
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 3362 bytes --]

On Thu, Mar 24, 2016 at 11:43:18AM +0300, Pavel Borzenkov wrote:
> On Wed, Mar 23, 2016 at 06:58:34PM +0100, Wouter Verhelst wrote:
> > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > > +    2. Block dirtiness state
> > > +
> > > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > > +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> > > +    the dirtiness status of the device. The following dirtiness states
> > > +    are defined for the command:
> > > +
> > > +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> > > +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> > > +
> > > +    Generic NBD client implementation without knowledge of a particular NBD
> > > +    server operation MUST NOT make any assumption on the meaning of the
> > > +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> > 
> > That makes it a useless call. A server can read /dev/random to decide
> > whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with
> > this spec.
> > 
> > Either the spec should define what it means for a block to be in a dirty
> > state, or it should not talk about it.
> 
> What I was trying to say with this sentence is that without knowing what
> is currently going on on the server side, the DIRTY state has no
> meaning. If we are doing incremental backup DIRTY state might represent blocks
> that have changed since last backup. For mirroring - blocks that are yet
> to be migrated. And for the same block device different sets of DIRTY
> ranges might be returned in response to this command. Basically, the
> meaning of DIRTY depends on the context. 
> 
> Should I state in the spec, that the meaning of DIRTY state is
> implementation-specific? I see that NBD_REP_SERVER already uses this
> approach.

That's still meaningless without a way for the client to detect what the
server-side implementation actually is. The protocol doesn't have that.

This is okay for NBD_REP_SERVER, because the extra information there is
*also* defined to be "UTF-8 encoded data suitable for direct display to
a human being", which means it just allows the server to pass on some
extra info (e.g., qemu could use it to state whether it's in use by a
live VM, or what the backend storage pool is, etc), but the data there
is not going to be critical.

For this proposed message, however, that is not the case; the whole
point of this part of your proposal seems to be to tell the client
whether some part of the export is "dirty" or not. Now I'm not saying we
need to fully define what it means for a part of the backend to be
"dirty" or not.  It's okay to leave part of the meaning in the dark,
leaving that implementation-defined. But saying "must not make any
assumption" basically means "give me some random metadata, and I'll
throw it away then because there's nothing I can do with it".

Alternatively, we could define more states than just "dirty" and
"clean", and have those be more strictly defined than what your current
proposal says.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24  9:33       ` Wouter Verhelst
@ 2016-03-24 10:32         ` Alex Bligh
  2016-03-24 11:58           ` Paolo Bonzini
  0 siblings, 1 reply; 54+ messages in thread
From: Alex Bligh @ 2016-03-24 10:32 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, Alex Bligh, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Denis V. Lunev, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 624 bytes --]


On 24 Mar 2016, at 09:33, Wouter Verhelst <w@uter.be> wrote:

> Now I'm not saying we
> need to fully define what it means for a part of the backend to be
> "dirty" or not.  It's okay to leave part of the meaning in the dark,
> leaving that implementation-defined.

Well, the 3 possible states are:

* unallocated
* zero
* non-zero

So the possible replies are a bitfield of those, with a '1' if it 'might'
be in that state, i.e.

111 = no idea
110 = might be zero or unallocated, but isn't zero
011 = I know it's allocated, but I don't know whether it is zero or not

And '000' is not permitted!

etc.


--
Alex Bligh





[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 842 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension
  2016-03-24  8:26       ` Wouter Verhelst
@ 2016-03-24 11:35         ` Pavel Borzenkov
  2016-03-24 11:37         ` Paolo Bonzini
  2016-03-24 14:53         ` Eric Blake
  2 siblings, 0 replies; 54+ messages in thread
From: Pavel Borzenkov @ 2016-03-24 11:35 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

On Thu, Mar 24, 2016 at 09:26:41AM +0100, Wouter Verhelst wrote:
> On Thu, Mar 24, 2016 at 10:57:06AM +0300, Pavel Borzenkov wrote:
> > On Wed, Mar 23, 2016 at 06:21:16PM +0100, Wouter Verhelst wrote:
> > > So, the semantics of your proposed WRITE_ZEROES are exactly the same as
> > > the WRITE command, except that no payload is sent?
> > > 
> > > In that case, I think it's slightly more sensible if we don't add a new
> > > command, but instead just have an NBD_CMD_FLAG_ZEROES added to the WRITE
> > > command instead. After all, they're going to be (mostly) the same
> > > anyway.
> > > 
> > > Did you propose a separate command for a specific reason that I'm
> > > missing (or forgetting), or is that just an oversight?
> > 
> > No, there is no specific reason. Looks like NBD_CMD_FLAG_ZEROES fits the
> > spec and implementations nicely. So I'll rewrite the extension and add
> > the flag instead of the whole command.
> 
> Actually, having given this some more thought...
> 
> There is at least one server-side implementation of nbd (mine) which
> silently ignores flags it doesn't know about. This isn't a problem for
> non-critical flags, but it could be a problem for a flag like this. Of
> course, a client shouldn't send a flag to a server which that server
> hasn't heard of, but mistakes do happen.
> 
> Do we want to keep that in mind? If so, we might want to keep it as a
> separate command after all.
> 
> OTOH, it could be said that silently ignoring unknown messages is a bug.
> I should probably just fix my implementation instead.

Since we are going to fix such ambiguities in the spec, fixing incorrect
implementations looks like the right thing to do.

But old unfixed versions which ignore unknown flags (or broken
implementation which send ZEROES without negotiating it) will still
exist.

So it looks safer to do it as a separate command. In this case we will
at least receive an error code back.

> 
> -- 
> < ron> I mean, the main *practical* problem with C++, is there's like a dozen
>        people in the world who think they really understand all of its rules,
>        and pretty much all of them are just lying to themselves too.
>  -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24  8:41         ` Wouter Verhelst
@ 2016-03-24 11:36           ` Pavel Borzenkov
  2016-03-24 12:32             ` Wouter Verhelst
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Borzenkov @ 2016-03-24 11:36 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: Kevin Wolf, nbd-general, qemu-devel, Stefan Hajnoczi,
	Denis V. Lunev, Paolo Bonzini

On Thu, Mar 24, 2016 at 09:41:29AM +0100, Wouter Verhelst wrote:
> On Thu, Mar 24, 2016 at 11:25:52AM +0300, Pavel Borzenkov wrote:
> > On Wed, Mar 23, 2016 at 07:14:54PM +0100, Kevin Wolf wrote:
> > > Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben:
> > > > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > > > > +    the provisioning state of the device. The following provisionnig states
> > > > > +    are defined for the command:
> > > > > +
> > > > > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > > > > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > > > > +        and contains zeroes;
> > > > 
> > > > Presumably this should be "contains only zeroes"?
> > > > 
> > > > Also, this may end up being a fairly expensive call for the server to
> > > > process. Is it really useful?
> > > 
> > > I think we need to make clear that this is meant as an optimisation and
> > > it's always a valid option for a server to return NBD_STATE_ALLOCATED
> > > even if the contents is zeroed.
> > > 
> > > It is definitely useful if the server has a means to efficiently find
> > > out the allocation status (e.g. SEEK_HOLE). In that case the client may
> > > be able to avoid reading the block and sending it over the network, or
> > > when making a copy, it could use it to keep the target file sparse. If
> > > the client can't take advantage, we didn't have much overhead, so it's
> > > fine.
> > 
> > Yes, that was the idea. I'll add a note that the server may return
> > NBD_STATE_ALLOCATED instead of NBD_STATE_ZEROED if it has not means to
> > efficiently differentiate allocated blocks with zeroes from allocated
> > blocks with non-zeroed content.
> 
> Okay, that alleviates my concerns.
> 
> In that case it might be useful if the server could say something along
> the lines of "I know it's allocated, but I didn't check whether there's
> anything non-zero in there"? The client can then decide to do nothing
> with that information; but the more useful information is sent along,
> the better...

Doesn't allocated state mean exactly this? E.g. it is allocated and I
have no idea what the content is.

> 
> -- 
> < ron> I mean, the main *practical* problem with C++, is there's like a dozen
>        people in the world who think they really understand all of its rules,
>        and pretty much all of them are just lying to themselves too.
>  -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension
  2016-03-24  8:26       ` Wouter Verhelst
  2016-03-24 11:35         ` Pavel Borzenkov
@ 2016-03-24 11:37         ` Paolo Bonzini
  2016-03-24 12:31           ` Wouter Verhelst
  2016-03-24 14:53         ` Eric Blake
  2 siblings, 1 reply; 54+ messages in thread
From: Paolo Bonzini @ 2016-03-24 11:37 UTC (permalink / raw)
  To: Wouter Verhelst, Pavel Borzenkov
  Cc: nbd-general, Denis V. Lunev, Stefan Hajnoczi, qemu-devel, Kevin Wolf



On 24/03/2016 09:26, Wouter Verhelst wrote:
>> > 
>> > No, there is no specific reason. Looks like NBD_CMD_FLAG_ZEROES fits the
>> > spec and implementations nicely. So I'll rewrite the extension and add
>> > the flag instead of the whole command.
> Actually, having given this some more thought...
> 
> There is at least one server-side implementation of nbd (mine) which
> silently ignores flags it doesn't know about. This isn't a problem for
> non-critical flags, but it could be a problem for a flag like this. Of
> course, a client shouldn't send a flag to a server which that server
> hasn't heard of, but mistakes do happen.
> 
> Do we want to keep that in mind? If so, we might want to keep it as a
> separate command after all.
> 
> OTOH, it could be said that silently ignoring unknown messages is a bug.
> I should probably just fix my implementation instead.

Even if it is a bug, it does suggest that the payload format should not
be changed by flags.  For example ignoring flags is a bug for an NBD
server, but not for a Wireshark protocol dissector.

Paolo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-23 17:58   ` [Qemu-devel] [Nbd] " Wouter Verhelst
  2016-03-23 18:14     ` Kevin Wolf
  2016-03-24  8:43     ` Pavel Borzenkov
@ 2016-03-24 11:55     ` Paolo Bonzini
  2016-03-24 12:43       ` Wouter Verhelst
  2016-03-24 15:25       ` Eric Blake
  2 siblings, 2 replies; 54+ messages in thread
From: Paolo Bonzini @ 2016-03-24 11:55 UTC (permalink / raw)
  To: Wouter Verhelst, Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi



On 23/03/2016 18:58, Wouter Verhelst wrote:
>> +To provide such class of information, `GET_LBA_STATUS` extension adds new
>> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
>> +their respective states.
>> +
>> +* `NBD_CMD_GET_LBA_STATUS` (7)
>> +
>> +    An LBA range status query request. Length and offset define the range
>> +    of interest. The server MUST reply with a reply header, followed
>> +    immediately by the following data:
> 
> As Eric noted, please expand LBA at least once.

Let's just use "block" (e.g. NBD_CMD_GET_BLOCK_STATUS).

>> +      - 32 bits, length of parameter data that follow (unsigned)
>> +      - zero or more LBA status descriptors, each having the following
>> +        structure:
>> +
>> +        * 64 bits, offset (unsigned)
>> +        * 32 bits, length (unsigned)
>> +        * 16 bits, status (unsigned)
>> +
>> +    unless an error condition has occurred.
>> +

Can we just return one descriptor?  That would simplify the protocol a bit.

>> +    the provisioning state of the device. The following provisionnig states
>> +    are defined for the command:
>> +
>> +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
>> +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
>> +        and contains zeroes;
> 
> Presumably this should be "contains only zeroes"?

Yes, or "reads as zeroes".

> Also, this may end up being a fairly expensive call for the server to
> process. Is it really useful?

It's always okay for the server to omit NBD_STATE_ZERO, but it can be
useful if the state is known for some reason.  For example, file system
holes are always zero, but unallocated blocks on a block device are not
always zero.

However, let's make these bits, so that

NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
NBD_STATE_ZERO (0x2), LBA extent will read as zeroes

and you can have NBD_STATE_ALLOCATED|NBD_STATE_ZERO.  File systems do
have the concept of unwritten extents which would be represented like
that.  The API to access the information (the FIEMAP ioctl) is broken,
but perhaps in the future a non-broken API could be added---for example
SEEK_ZERO and SEEK_NONZERO values for lseek's "whence" argument.

>> +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
>> +        block device. A client MUST NOT make any assumptions about the
>> +        contents of the extent.
>> +
>> +    2. Block dirtiness state
>> +
>> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
>> +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
>> +    the dirtiness status of the device. The following dirtiness states
>> +    are defined for the command:
>> +
>> +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
>> +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
>> +
>> +    Generic NBD client implementation without knowledge of a particular NBD
>> +    server operation MUST NOT make any assumption on the meaning of the
>> +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> 
> That makes it a useless call. A server can read /dev/random to decide
> whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with
> this spec.
> 
> Either the spec should define what it means for a block to be in a dirty
> state, or it should not talk about it.

Here is my attempt:

    This command is meant to operate in tandem with other (non-NBD)
    channels to the server.  Generally, a "dirty" block is a block that
    has been written to by someone, but the exact meaning of "has been
    written" is left to the implementation.  For example, a virtual
    machine monitor could provide a (non-NBD) command to start tracking
    blocks written by the virtual machine.  A backup client then can
    connect to an NBD server provided by the virtual machine monitor
    and use NBD_CMD_GET_BLOCK_STATUS only read blocks that the virtual
    machine has changed.

    An implementation that doesn't track the "dirtiness" state of blocks
    MUST either fail this command with EINVAL, or mark all blocks as
    dirty in the descriptor that it returns.

Paolo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 10:32         ` Alex Bligh
@ 2016-03-24 11:58           ` Paolo Bonzini
  2016-03-24 12:17             ` Alex Bligh
  0 siblings, 1 reply; 54+ messages in thread
From: Paolo Bonzini @ 2016-03-24 11:58 UTC (permalink / raw)
  To: Alex Bligh, Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Denis V. Lunev



On 24/03/2016 11:32, Alex Bligh wrote:
>> > Now I'm not saying we
>> > need to fully define what it means for a part of the backend to be
>> > "dirty" or not.  It's okay to leave part of the meaning in the dark,
>> > leaving that implementation-defined.
> Well, the 3 possible states are:
> 
> * unallocated
> * zero
> * non-zero
> 
> So the possible replies are a bitfield of those, with a '1' if it 'might'
> be in that state, i.e.
> 
> 111 = no idea
> 110 = might be zero or unallocated, but isn't zero
> 011 = I know it's allocated, but I don't know whether it is zero or not

How do you represent "definitely unallocated?"

Paolo

> And '000' is not permitted!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 11:58           ` Paolo Bonzini
@ 2016-03-24 12:17             ` Alex Bligh
  2016-03-24 12:32               ` Paolo Bonzini
  0 siblings, 1 reply; 54+ messages in thread
From: Alex Bligh @ 2016-03-24 12:17 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Denis V. Lunev, Wouter Verhelst


On 24 Mar 2016, at 11:58, Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> On 24/03/2016 11:32, Alex Bligh wrote:
>>>> Now I'm not saying we
>>>> need to fully define what it means for a part of the backend to be
>>>> "dirty" or not.  It's okay to leave part of the meaning in the dark,
>>>> leaving that implementation-defined.
>> Well, the 3 possible states are:
>> 
>> * unallocated
>> * zero
>> * non-zero
>> 
>> So the possible replies are a bitfield of those, with a '1' if it 'might'
>> be in that state, i.e.
>> 
>> 111 = no idea
>> 110 = might be zero or unallocated, but isn't zero
>> 011 = I know it's allocated, but I don't know whether it is zero or not
> 
> How do you represent "definitely unallocated?"

100 is definitely allocated. The first '1' says it 'might' be in allocated state,
but as we know it's NOT in any of the other states (next two zeroes), by a
process of elimination, it's definitely unallocated. Similarly 010 and 001
are the two other 'definite' states.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-23 16:27   ` Eric Blake
@ 2016-03-24 12:30     ` Pavel Borzenkov
  2016-03-24 15:04       ` Eric Blake
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Borzenkov @ 2016-03-24 12:30 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Wouter Verhelst, Denis V. Lunev

On Wed, Mar 23, 2016 at 10:27:00AM -0600, Eric Blake wrote:
> On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > 
> > With the availability of sparse storage formats, it is often needed to
> > query status of a particular LBA range and read only those blocks of
> > data that are actually present on the block device.
> 
> The acronym LBA is not used elsewhere in the NBD spec; should we spell
> it out at least once?
> 
> > 
> > To provide such information, the patch adds GET_LBA_STATUS extension
> > with one new NBD_CMD_GET_LBA_STATUS command.
> > 
> > There exists a concept of data dirtiness, which is required during, for
> > example, incremental block device backup. To express this concept via
> > NBD protocol, this patch also adds additional mode of operation to
> > NBD_CMD_GET_LBA_STATUS command.
> > 
> > Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> > LBA STATUS" command more closely, it has been chosen to return a list of
> > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> > bitmap.
> > 
> > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > CC: Wouter Verhelst <w@uter.be>
> > CC: Paolo Bonzini <pbonzini@redhat.com>
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 82 insertions(+)
> > 
> 
> >  
> > +### `GET_LBA_STATUS` extension
> > +
> > +With the availability of sparse storage formats, it is often needed to query
> > +status of a particular LBA range and read only those blocks of data that are
> > +actually present on the block device.
> > +
> > +Some storage formats and operations over such formats express a concept of
> > +data dirtiness. Whether the operation is block device mirroring,
> > +incremental block device backup or any other operation with a concept of
> > +data dirtiness, they all share a need to provide a list of LBA ranges
> > +that this particular operation treats as dirty.
> > +
> > +To provide such class of information, `GET_LBA_STATUS` extension adds new
> > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> > +their respective states.
> > +
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +    An LBA range status query request. Length and offset define the range
> > +    of interest. The server MUST reply with a reply header, followed
> > +    immediately by the following data:
> > +
> > +      - 32 bits, length of parameter data that follow (unsigned)
> 
> Is this length the number of descriptors, or the number of bytes
> occupied by those descriptors?  It looks like bytes (that is, with the
> current layout, this field should be a multiple of 14 unless an error is
> returned and the data is bogus).
> 
> I guess 32 bits is sufficient: transmission commands are limited to
> 32-bit length, and we are unlikely to have more than one extent per 512
> bytes of length, so even if we have a header for every single sector
> (worst-case for alternating clean/dirty sectors), as long as the
> smallest granularity of an extent is larger than the extent field, the
> 'length of parameter data' in bytes is still sufficient.
> 
> > +      - zero or more LBA status descriptors, each having the following
> 
> zero or more? [1]
> 
> > +        structure:
> > +
> > +        * 64 bits, offset (unsigned)
> > +        * 32 bits, length (unsigned)
> > +        * 16 bits, status (unsigned)
> 
> An array of these status descriptors is packed on the wire, while the
> typical C layout of an array of these structures will have padding to
> reach a nice 8-byte alignment.  Should 'status' be a 32-bit field, so
> that clients and servers do not have to pack/unpack between 14 bytes on
> the wire and 16 bytes in efficient array handling, at the expense of
> more network traffic?
> 
> Conversely, it would be possible to send less data over the wire, as
> long as we require that all LBA status descriptors cover consecutive
> offsets.  That is, having the server reply with offsets is pointless,
> since they can be reconstructed on the client by starting with the
> offset in the client's request, then adding length from each status
> field.  Is less network traffic desirable?

I think it's better to explicitly send the start of each LBA extent.
So I'll go with changing 'status' field to 32 bits to avoid
packing/unpacking issues.

> 
> > +
> > +    unless an error condition has occurred.
> > +
> > +    If an error occurs, the server SHOULD set the appropriate error code
> > +    in the error field. The server MUST then either close the
> > +    connection, or send *length of parameter data* bytes of data
> > +    (which MAY be invalid).
> > +
> > +    The type of information required by the client is passed to server in the
> > +    command flags field. If the server does not implement requested type or
> > +    have no means to express it, it MUST NOT return an error, but instead MUST
> > +    return a single LBA status descriptor with *offset* and *length* equal to
> > +    the *offset* and *length* from request, and *status* set to `0`.
> 
> [1] So in what situations will we ever return an array of zero status
> fields? On an error?  Should we make it clear that the server MUST NOT
> return 0 status fields except on an error?
> 
> Do we want to require that the server MUST reply with enough extents to
> sum up to the length of the client's request, or are we permitting a
> short reply?

While the "GET LBA STATUS" command in SCSI allows partial reply, I
believe it'd better to keep things simple and require that the server
must either return a list of extents that covers the whole requested
range, or an error.

> 
> > +
> > +    The following request types are currently defined for the command:
> > +
> > +    1. Block provisioning state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
> 
> Here, you spell it '0x0'; in the previous patch, you spelled the command
> flag as 'bit 1' - does that mean that Block provisioning state is the
> default when no command flags are sent?  What if we later add other
> flags but still want block provisioning mode?  Wouldn't it be better to
> state that this mode is entered when the NBD_FLAG_GET_DIRTY flag is
> clear, without regards to the state of the other flags, than allowing
> this mode only when all 16 flags are zero?
> 
> For example, should we allow a flag that states that the client is
> interested only in allocated/unallocated, and that the server may
> coalesce NBD_STATE_ZEROED extents as if they were NBD_STATE_ALLOCATED
> for fewer extents reported and thus potentially less network traffic?

Actually, for this command I treat 'command flags' field not as a set of
flags, but rather as a plain number representing required mode of
operation. Probably, not a good idea as it doesn't match the rest of the
commands.

I went this way because I didn't want to allow clients to request
several modes simultaneously (e.g. provisioning + dirtiness) in the same
request. This makes server side implementation harder.

I think I'll just switch to bits to match the rest of the commands and
will add a note, that server should return EINVAL in case several modes
are requested simultaneously.

> 
> > +    the provisioning state of the device. The following provisionnig states
> 
> s/provisionnig/provisioning/
> 
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > +        and contains zeroes;
> > +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> > +        block device. A client MUST NOT make any assumptions about the
> > +        contents of the extent.
> 
> Can NBD_STATE_ALLOCATED and NBD_STATE_DEALLOCATED both be set at the
> same time, or is that an error on the server?  What do we know about an
> extent that has neither NBD_STATE_ALLOCATED nor NBD_STATE_DEALLOCATED set?
> 
> /me re-reads
> 
> Oh, you are using this as the _entire_ 16-bit status value, rather than
> as bits 0, 1, and 2 as flags.

Yes

> 
> But I think you have two binary flags (four possible states) here: it is
> quite conceivable to have a server on top of a SCSI device, where we
> know that the extent is unallocated in SCSI, but where the server will
> guarantee that it reads as all zeroes (possibly because the server
> bypasses SCSI on the NBD read commands when it knows SCSI is
> unallocated).  That is, if we set this up as two flags:
> 
> 0x1 - allocated
> 0x2 - reads as 0
> 
> then we can express four states:
> 
> 0x0 - LBA extent not present, client MUST NOT make assumptions about
> contents, and reads should not be attempted
> 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents
> 0x2 - LBA extent not present, but client can treat the extent as zeroes
> and reads will succeed
> 0x3 - LBA extent present, client can treat the extent as zeroes and
> reads will succeed

I'm not sure that clients need this level of details. From client's POV
0x2 and 0x3 are the same.

> 
> Actually, we should probably pick the bit values such that 0x0 means
> allocated and readable, as the most common state, since we also require
> that the command returns a single extent over the entire length with
> status 0 if the server can't provide any further details.
> 
> I'm not familiar enough with the SCSI "GET LBA STATUS" command to know
> if your command sanely matches to that one.

Extents returned by "GET LBA STATUS" in SCSI can have three possible
state: MAPPED/ANCHORED/DEALLOCATED. These are not bits and cannot be
combined together.

> 
> > +
> > +    2. Block dirtiness state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> 
> This overlaps with the bit value for NBD_FLAG_SEND_FUA in the previous
> patch.  Is that okay?  Or should we use a different bit value, on the
> grounds that some future extension may want to use both flags
> orthogonally within the same (possibly new) command?  Again, consistency
> in the spelling ('bit 1' in the previous patch, '0x1' here), would be nice.

I couldn't think of any use case. But, just in case, I'll change the
value of the flags so they don't overlap.

-- 
Pavel

> 
> > +    the dirtiness status of the device. The following dirtiness states
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> > +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> 
> Again, it looks like you are using these as two entire 16-bit status
> values, rather than as two separate bits (1<<0 and 1<<1).  Another way
> of expressing it is that a single boolean flag is defined, if clear, the
> extent is dirty, if set, the extent is clean.
> 
> > +
> > +    Generic NBD client implementation without knowledge of a particular NBD
> > +    server operation MUST NOT make any assumption on the meaning of the
> > +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> > +
> > +The server SHOULD return `EINVAL` if it receives a `GET_LBA_STATUS` request
> > +including one or more sectors beyond the size of the device.
> 
> As mentioned in the previous mail, should we also recommend an EINVAL if
> NBD_CMD_GET_LBA_STATUS was not negotiated in the export options but the
> client sends the command anyways; and/or a requirement that the client
> must not issue the command in that case?
> 
> > +
> >  ## About this file
> >  
> >  This file tries to document the NBD protocol as it is currently
> > 
> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension
  2016-03-24 11:37         ` Paolo Bonzini
@ 2016-03-24 12:31           ` Wouter Verhelst
  0 siblings, 0 replies; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-24 12:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan Hajnoczi, Denis V. Lunev

On Thu, Mar 24, 2016 at 12:37:33PM +0100, Paolo Bonzini wrote:
> 
> 
> On 24/03/2016 09:26, Wouter Verhelst wrote:
> >> > 
> >> > No, there is no specific reason. Looks like NBD_CMD_FLAG_ZEROES fits the
> >> > spec and implementations nicely. So I'll rewrite the extension and add
> >> > the flag instead of the whole command.
> > Actually, having given this some more thought...
> > 
> > There is at least one server-side implementation of nbd (mine) which
> > silently ignores flags it doesn't know about. This isn't a problem for
> > non-critical flags, but it could be a problem for a flag like this. Of
> > course, a client shouldn't send a flag to a server which that server
> > hasn't heard of, but mistakes do happen.
> > 
> > Do we want to keep that in mind? If so, we might want to keep it as a
> > separate command after all.
> > 
> > OTOH, it could be said that silently ignoring unknown messages is a bug.
> > I should probably just fix my implementation instead.
> 
> Even if it is a bug, it does suggest that the payload format should not
> be changed by flags.  For example ignoring flags is a bug for an NBD
> server, but not for a Wireshark protocol dissector.

Agreed. Let's make this a different command then, instead.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 12:17             ` Alex Bligh
@ 2016-03-24 12:32               ` Paolo Bonzini
  2016-03-24 13:31                 ` Alex Bligh
  0 siblings, 1 reply; 54+ messages in thread
From: Paolo Bonzini @ 2016-03-24 12:32 UTC (permalink / raw)
  To: Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Denis V. Lunev, Wouter Verhelst



On 24/03/2016 13:17, Alex Bligh wrote:
>>> >> * unallocated
>>> >> * zero
>>> >> * non-zero
>>> >> 
>>> >> So the possible replies are a bitfield of those, with a '1' if it 'might'
>>> >> be in that state, i.e.
>>> >> 
>>> >> 111 = no idea
>>> >> 110 = might be zero or unallocated, but isn't zero
>>> >> 011 = I know it's allocated, but I don't know whether it is zero or not
>> > 
>> > How do you represent "definitely unallocated?"
> 100 is definitely allocated. The first '1' says it 'might' be in allocated state,
> but as we know it's NOT in any of the other states (next two zeroes), by a
> process of elimination, it's definitely unallocated. Similarly 010 and 001
> are the two other 'definite' states.

An unallocated block can still be definitely zero.

Paolo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 11:36           ` Pavel Borzenkov
@ 2016-03-24 12:32             ` Wouter Verhelst
  0 siblings, 0 replies; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-24 12:32 UTC (permalink / raw)
  To: Pavel Borzenkov
  Cc: Kevin Wolf, nbd-general, qemu-devel, Stefan Hajnoczi,
	Denis V. Lunev, Paolo Bonzini

On Thu, Mar 24, 2016 at 02:36:52PM +0300, Pavel Borzenkov wrote:
> On Thu, Mar 24, 2016 at 09:41:29AM +0100, Wouter Verhelst wrote:
> > Okay, that alleviates my concerns.
> > 
> > In that case it might be useful if the server could say something along
> > the lines of "I know it's allocated, but I didn't check whether there's
> > anything non-zero in there"? The client can then decide to do nothing
> > with that information; but the more useful information is sent along,
> > the better...
> 
> Doesn't allocated state mean exactly this? E.g. it is allocated and I
> have no idea what the content is.

I suppose you're right.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 11:55     ` Paolo Bonzini
@ 2016-03-24 12:43       ` Wouter Verhelst
  2016-03-24 15:25       ` Eric Blake
  1 sibling, 0 replies; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-24 12:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: nbd-general, Denis V. Lunev, Stefan Hajnoczi, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 4593 bytes --]

Hi Paolo,

On Thu, Mar 24, 2016 at 12:55:51PM +0100, Paolo Bonzini wrote:
> 
> 
> On 23/03/2016 18:58, Wouter Verhelst wrote:
> >> +To provide such class of information, `GET_LBA_STATUS` extension adds new
> >> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> >> +their respective states.
> >> +
> >> +* `NBD_CMD_GET_LBA_STATUS` (7)
> >> +
> >> +    An LBA range status query request. Length and offset define the range
> >> +    of interest. The server MUST reply with a reply header, followed
> >> +    immediately by the following data:
> > 
> > As Eric noted, please expand LBA at least once.
> 
> Let's just use "block" (e.g. NBD_CMD_GET_BLOCK_STATUS).

That works too :-)

[...]
> > Also, this may end up being a fairly expensive call for the server to
> > process. Is it really useful?
> 
> It's always okay for the server to omit NBD_STATE_ZERO, but it can be
> useful if the state is known for some reason.  For example, file system
> holes are always zero, but unallocated blocks on a block device are not
> always zero.
> 
> However, let's make these bits, so that
> 
> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> 
> and you can have NBD_STATE_ALLOCATED|NBD_STATE_ZERO.  File systems do
> have the concept of unwritten extents which would be represented like
> that.  The API to access the information (the FIEMAP ioctl) is broken,
> but perhaps in the future a non-broken API could be added---for example
> SEEK_ZERO and SEEK_NONZERO values for lseek's "whence" argument.

Okay, that works for me.

> >> +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> >> +        block device. A client MUST NOT make any assumptions about the
> >> +        contents of the extent.
> >> +
> >> +    2. Block dirtiness state
> >> +
> >> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> >> +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> >> +    the dirtiness status of the device. The following dirtiness states
> >> +    are defined for the command:
> >> +
> >> +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> >> +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> >> +
> >> +    Generic NBD client implementation without knowledge of a particular NBD
> >> +    server operation MUST NOT make any assumption on the meaning of the
> >> +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> > 
> > That makes it a useless call. A server can read /dev/random to decide
> > whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with
> > this spec.
> > 
> > Either the spec should define what it means for a block to be in a dirty
> > state, or it should not talk about it.
> 
> Here is my attempt:
> 
>     This command is meant to operate in tandem with other (non-NBD)
>     channels to the server.  Generally, a "dirty" block is a block that
>     has been written to by someone, but the exact meaning of "has been
>     written" is left to the implementation.  For example, a virtual
>     machine monitor could provide a (non-NBD) command to start tracking
>     blocks written by the virtual machine.  A backup client then can
>     connect to an NBD server provided by the virtual machine monitor
>     and use NBD_CMD_GET_BLOCK_STATUS only read blocks that the virtual
                                      ^ to
>     machine has changed.
> 
>     An implementation that doesn't track the "dirtiness" state of blocks
>     MUST either fail this command with EINVAL, or mark all blocks as
>     dirty in the descriptor that it returns.

That seems saner, yes -- and I'm starting to understand what the
rationale is for this protocol message :-)

I suppose I could also implement that in nbd-server to send out
information about changed blocks if the copy-on-write option has been
switched on.

It might also be possible to add an in-protocol message to start
tracking changes (e.g., a "create checkpoint" message), but I'm not sure
if that's worth it (and it could massively complicate the NBD state
machine and protocol; not sure whether that's worth it)

At any rate, your suggestion does alleviate my concerns.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 12:32               ` Paolo Bonzini
@ 2016-03-24 13:31                 ` Alex Bligh
  2016-03-24 13:32                   ` Paolo Bonzini
  0 siblings, 1 reply; 54+ messages in thread
From: Alex Bligh @ 2016-03-24 13:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: nbd-general, Kevin Wolf, Stefan stefanha@redhat. com, qemu-devel,
	Pavel Borzenkov, Alex Bligh, Denis V. Lunev, Wouter Verhelst


On 24 Mar 2016, at 12:32, Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 24/03/2016 13:17, Alex Bligh wrote:
>>>>>> * unallocated
>>>>>> * zero
>>>>>> * non-zero
>>>>>> 
>>>>>> So the possible replies are a bitfield of those, with a '1' if it 'might'
>>>>>> be in that state, i.e.
>>>>>> 
>>>>>> 111 = no idea
>>>>>> 110 = might be zero or unallocated, but isn't zero
>>>>>> 011 = I know it's allocated, but I don't know whether it is zero or not
>>>> 
>>>> How do you represent "definitely unallocated?"
>> 100 is definitely allocated. The first '1' says it 'might' be in allocated state,
>> but as we know it's NOT in any of the other states (next two zeroes), by a
>> process of elimination, it's definitely unallocated. Similarly 010 and 001
>> are the two other 'definite' states.
> 
> An unallocated block can still be definitely zero.

Sorry, I should have been clearer on the states:

bits
210

1-- Unallocated, and hence reads as zero
-1- Allocated, and reads as zero
--1 Allocated, and reads as non-zero

So 100 means 'definitely unallocated, will read as zero'.

If you are saying that there is also a state called 'Unallocated, but reads
as non-zero', that could be handled by adding a fourth bit. Same idea. One
would presume this would only ever be set in conjunction with bit 2.

My point in general was to represent all the possible states and let the client
determine what it wants to do with the information.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 13:31                 ` Alex Bligh
@ 2016-03-24 13:32                   ` Paolo Bonzini
  0 siblings, 0 replies; 54+ messages in thread
From: Paolo Bonzini @ 2016-03-24 13:32 UTC (permalink / raw)
  To: Alex Bligh
  Cc: nbd-general, Kevin Wolf, qemu-devel, Pavel Borzenkov,
	Stefan stefanha@redhat. com, Denis V. Lunev, Wouter Verhelst



On 24/03/2016 14:31, Alex Bligh wrote:
> Sorry, I should have been clearer on the states:
> 
> bits
> 210
> 
> 1-- Unallocated, and hence reads as zero
> -1- Allocated, and reads as zero
> --1 Allocated, and reads as non-zero
> 
> So 100 means 'definitely unallocated, will read as zero'.
> 
> If you are saying that there is also a state called 'Unallocated, but reads
> as non-zero',

Yes.

> that could be handled by adding a fourth bit. Same idea. One
> would presume this would only ever be set in conjunction with bit 2.
> My point in general was to represent all the possible states and let the client
> determine what it wants to do with the information.

This seems overengineered...

Paolo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension
  2016-03-24  8:26       ` Wouter Verhelst
  2016-03-24 11:35         ` Pavel Borzenkov
  2016-03-24 11:37         ` Paolo Bonzini
@ 2016-03-24 14:53         ` Eric Blake
  2 siblings, 0 replies; 54+ messages in thread
From: Eric Blake @ 2016-03-24 14:53 UTC (permalink / raw)
  To: Wouter Verhelst, Pavel Borzenkov
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 1539 bytes --]

On 03/24/2016 02:26 AM, Wouter Verhelst wrote:
>> No, there is no specific reason. Looks like NBD_CMD_FLAG_ZEROES fits the
>> spec and implementations nicely. So I'll rewrite the extension and add
>> the flag instead of the whole command.
> 
> Actually, having given this some more thought...
> 
> There is at least one server-side implementation of nbd (mine) which
> silently ignores flags it doesn't know about. This isn't a problem for
> non-critical flags, but it could be a problem for a flag like this. Of
> course, a client shouldn't send a flag to a server which that server
> hasn't heard of, but mistakes do happen.

This is the first time where the presence or absence of a flag would
determine whether the length of the payload should be 0 bytes or len
bytes.  If the server doesn't recognize the flag, then it will try to
read the next length bytes off the wire as the data to write, rather
than as the next command to process.

> 
> Do we want to keep that in mind? If so, we might want to keep it as a
> separate command after all.
> 
> OTOH, it could be said that silently ignoring unknown messages is a bug.
> I should probably just fix my implementation instead.

While I agree that you should fix your implementation, I am strongly in
favor of a new command, so that we can blindly state that the
NBD_CMD_WRITE always sends a payload of length bytes, independent of
flag value.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 12:30     ` Pavel Borzenkov
@ 2016-03-24 15:04       ` Eric Blake
  2016-03-24 16:36         ` Pavel Borzenkov
  0 siblings, 1 reply; 54+ messages in thread
From: Eric Blake @ 2016-03-24 15:04 UTC (permalink / raw)
  To: Pavel Borzenkov
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Wouter Verhelst, Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 3724 bytes --]

On 03/24/2016 06:30 AM, Pavel Borzenkov wrote:
>> Conversely, it would be possible to send less data over the wire, as
>> long as we require that all LBA status descriptors cover consecutive
>> offsets.  That is, having the server reply with offsets is pointless,
>> since they can be reconstructed on the client by starting with the
>> offset in the client's request, then adding length from each status
>> field.  Is less network traffic desirable?
> 
> I think it's better to explicitly send the start of each LBA extent.

Why? Is the redundancy for something that the client can reconstruct
worth the extra safety at the cost of more traffic?

> So I'll go with changing 'status' field to 32 bits to avoid
> packing/unpacking issues.

You may want to do that even if you eliminate the offset field, so that
you have 8 bytes per struct (instead of 16).

>>
>> Do we want to require that the server MUST reply with enough extents to
>> sum up to the length of the client's request, or are we permitting a
>> short reply?
> 
> While the "GET LBA STATUS" command in SCSI allows partial reply, I
> believe it'd better to keep things simple and require that the server
> must either return a list of extents that covers the whole requested
> range, or an error.

Make sure you specify whether ranges are allowed to overlap, or must be
distinct (I prefer the latter), and whether they must appear in sorted
order (which I also prefer) - once you mandate ordered and
non-overlapping coverage, client-side validation that the server's
answer makes sense is easier (remember, we want the client to detect
man-in-the-middle corruption of the server's reply).


> Actually, for this command I treat 'command flags' field not as a set of
> flags, but rather as a plain number representing required mode of
> operation. Probably, not a good idea as it doesn't match the rest of the
> commands.
> 
> I went this way because I didn't want to allow clients to request
> several modes simultaneously (e.g. provisioning + dirtiness) in the same
> request. This makes server side implementation harder.
> 
> I think I'll just switch to bits to match the rest of the commands and
> will add a note, that server should return EINVAL in case several modes
> are requested simultaneously.

But you don't need two bits.  Just a single bit will do (off for
provisioning mode, on for dirty mode).  So there are no conflicting
modes that can be simultaneously requested, at least in the current
definition of a single valid flag bit.  (Then again, I did make a
suggestion about additional bits, useful only during provisioning mode,
that might be used to state that the client is okay if the server
coalesces extents that differ only in allocation or only in zeroed
content - if we add that bit or two bits, it would be an error to use it
while requesting dirty mode).


>> then we can express four states:
>>
>> 0x0 - LBA extent not present, client MUST NOT make assumptions about
>> contents, and reads should not be attempted
>> 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents
>> 0x2 - LBA extent not present, but client can treat the extent as zeroes
>> and reads will succeed
>> 0x3 - LBA extent present, client can treat the extent as zeroes and
>> reads will succeed
> 
> I'm not sure that clients need this level of details. From client's POV
> 0x2 and 0x3 are the same.

No, if the client is trying to EXACTLY copy sparseness, then 0x2 and 0x3
differ on whether the client will punch a hole vs. explicitly allocate
zeroes.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 11:55     ` Paolo Bonzini
  2016-03-24 12:43       ` Wouter Verhelst
@ 2016-03-24 15:25       ` Eric Blake
  2016-03-24 15:33         ` Paolo Bonzini
  1 sibling, 1 reply; 54+ messages in thread
From: Eric Blake @ 2016-03-24 15:25 UTC (permalink / raw)
  To: Paolo Bonzini, Wouter Verhelst, Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 4032 bytes --]

On 03/24/2016 05:55 AM, Paolo Bonzini wrote:
>> As Eric noted, please expand LBA at least once.
> 
> Let's just use "block" (e.g. NBD_CMD_GET_BLOCK_STATUS).

Yes, avoiding the term LBA and using BLOCK everywhere also nicely solves
the problem of introducing yet more terminology.

> 
>>> +      - 32 bits, length of parameter data that follow (unsigned)
>>> +      - zero or more LBA status descriptors, each having the following
>>> +        structure:
>>> +
>>> +        * 64 bits, offset (unsigned)
>>> +        * 32 bits, length (unsigned)
>>> +        * 16 bits, status (unsigned)
>>> +
>>> +    unless an error condition has occurred.
>>> +
> 
> Can we just return one descriptor?  That would simplify the protocol a bit.

As in, the return is exactly one descriptor, consisting of:

* 32 bits, length (unsigned): must be > 0, <= the client's length
* 16 bits, status (unsigned): status of that block

Of course, it means more traffic. The nice part about returning an array
of descriptors is that I can learn the status of 1G of the file, even if
the file alternates every 512 bytes between extent status, in just one
client call. But returning only a single descriptor at a time means I'd
have to make 2M client calls to learn the same pattern of allocation.
Fortunately, in the common case, allocation patterns tend to not be that
disjoint.

On the other hand, returning only one descriptor at a time (for possibly
less length than the client requested) may be easier when using
lseek(SEEK_DATA/HOLE) as the mechanism for determining the bounds of
each extent, since the server only has to search once per command,
instead of dynamically construct the entire reply.

I don't have any strong opinions on which would be better, but it is
definitely food for thought.

> 
> However, let's make these bits, so that
> 
> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes

Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
allocated, 1 means not present), so that an overall status of 0 is a
safe default?  (That is, it should always be safe to state a sector is
allocated when it is not, and always safe to state a sector is not known
to read as zeroes even if that happens to be its contents - all that we
lose by reporting this safe default state is that the client will be
unable to optimize for skipping holes).

>> Either the spec should define what it means for a block to be in a dirty
>> state, or it should not talk about it.
> 
> Here is my attempt:
> 
>     This command is meant to operate in tandem with other (non-NBD)
>     channels to the server.  Generally, a "dirty" block is a block that
>     has been written to by someone, but the exact meaning of "has been
>     written" is left to the implementation.  For example, a virtual
>     machine monitor could provide a (non-NBD) command to start tracking
>     blocks written by the virtual machine.  A backup client then can
>     connect to an NBD server provided by the virtual machine monitor
>     and use NBD_CMD_GET_BLOCK_STATUS only read blocks that the virtual

s/only/to only/

>     machine has changed.

s/changed/changed since it started tracking/

> 
>     An implementation that doesn't track the "dirtiness" state of blocks
>     MUST either fail this command with EINVAL, or mark all blocks as
>     dirty in the descriptor that it returns.

Is it feasible to return zero/allocated/dirty status all at the same
time, or do we want to strictly require two different modes of
operation?  That is, if we are returning zero and allocated as two bits,
can we also return a third bit for dirty/clean?  Should we flip the
sense of the bit, where 0 means dirty and 1 means clean, again so that a
server can always return a status of 0 as the safe default?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 15:25       ` Eric Blake
@ 2016-03-24 15:33         ` Paolo Bonzini
  2016-03-24 15:53           ` Wouter Verhelst
  0 siblings, 1 reply; 54+ messages in thread
From: Paolo Bonzini @ 2016-03-24 15:33 UTC (permalink / raw)
  To: Eric Blake, Wouter Verhelst, Denis V. Lunev
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 1465 bytes --]

On 24/03/2016 16:25, Eric Blake wrote:
>> However, let's make these bits, so that
>>
>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> 
> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> allocated, 1 means not present), so that an overall status of 0 is a
> safe default?

Double negations are evil (and don't work the same in all languages), so
I think it's a worse option.

>>     An implementation that doesn't track the "dirtiness" state of blocks
>>     MUST either fail this command with EINVAL, or mark all blocks as
>>     dirty in the descriptor that it returns.
> 
> Is it feasible to return zero/allocated/dirty status all at the same
> time, or do we want to strictly require two different modes of
> operation?

I think we should differentiate them, because it makes sense to support
only one.

In particular, while it is more or less obvious that (in my proposal
above) a trivial implementation must return NBD_STATE_ALLOCATED, it is
quite weird to require a trivial implementation to return
NBD_STATE_ALLOCATED|NBD_STATE_DIRTY.

Paolo

> That is, if we are returning zero and allocated as two bits,
> can we also return a third bit for dirty/clean?  Should we flip the
> sense of the bit, where 0 means dirty and 1 means clean, again so that a
> server can always return a status of 0 as the safe default?
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 15:33         ` Paolo Bonzini
@ 2016-03-24 15:53           ` Wouter Verhelst
  2016-03-24 16:04             ` Eric Blake
  0 siblings, 1 reply; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-24 15:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi, Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 1112 bytes --]

On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> On 24/03/2016 16:25, Eric Blake wrote:
> >> However, let's make these bits, so that
> >>
> >> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> >> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> > 
> > Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> > allocated, 1 means not present), so that an overall status of 0 is a
> > safe default?
> 
> Double negations are evil (and don't work the same in all languages), so
> I think it's a worse option.

I agree that a bit which says "unallocated" is confusing in that manner,
but that just means we need a better name (one that doesn't contain
"un-" or "not")

I like the idea of having zero be the "sensible" default, although I'm
quite unable to come up with a better name myself.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 15:53           ` Wouter Verhelst
@ 2016-03-24 16:04             ` Eric Blake
  2016-03-24 16:07               ` Kevin Wolf
  0 siblings, 1 reply; 54+ messages in thread
From: Eric Blake @ 2016-03-24 16:04 UTC (permalink / raw)
  To: Wouter Verhelst, Paolo Bonzini
  Cc: nbd-general, Denis V. Lunev, Stefan Hajnoczi, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 1204 bytes --]

On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
>> On 24/03/2016 16:25, Eric Blake wrote:
>>>> However, let's make these bits, so that
>>>>
>>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
>>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
>>>
>>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
>>> allocated, 1 means not present), so that an overall status of 0 is a
>>> safe default?
>>
>> Double negations are evil (and don't work the same in all languages), so
>> I think it's a worse option.
> 
> I agree that a bit which says "unallocated" is confusing in that manner,
> but that just means we need a better name (one that doesn't contain
> "un-" or "not")
> 
> I like the idea of having zero be the "sensible" default, although I'm
> quite unable to come up with a better name myself.

NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
matches well that we have NBD_CMD_TRIM for requesting the creation of
such a state.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 16:04             ` Eric Blake
@ 2016-03-24 16:07               ` Kevin Wolf
  2016-03-24 16:47                 ` Wouter Verhelst
  0 siblings, 1 reply; 54+ messages in thread
From: Kevin Wolf @ 2016-03-24 16:07 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Wouter Verhelst, Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 1233 bytes --]

Am 24.03.2016 um 17:04 hat Eric Blake geschrieben:
> On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> >> On 24/03/2016 16:25, Eric Blake wrote:
> >>>> However, let's make these bits, so that
> >>>>
> >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> >>>
> >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> >>> allocated, 1 means not present), so that an overall status of 0 is a
> >>> safe default?
> >>
> >> Double negations are evil (and don't work the same in all languages), so
> >> I think it's a worse option.
> > 
> > I agree that a bit which says "unallocated" is confusing in that manner,
> > but that just means we need a better name (one that doesn't contain
> > "un-" or "not")
> > 
> > I like the idea of having zero be the "sensible" default, although I'm
> > quite unable to come up with a better name myself.
> 
> NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
> matches well that we have NBD_CMD_TRIM for requesting the creation of
> such a state.

How about NBD_STATE_HOLE?

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 15:04       ` Eric Blake
@ 2016-03-24 16:36         ` Pavel Borzenkov
  0 siblings, 0 replies; 54+ messages in thread
From: Pavel Borzenkov @ 2016-03-24 16:36 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Wouter Verhelst, Denis V. Lunev

On Thu, Mar 24, 2016 at 09:04:07AM -0600, Eric Blake wrote:
> On 03/24/2016 06:30 AM, Pavel Borzenkov wrote:
> >> Conversely, it would be possible to send less data over the wire, as
> >> long as we require that all LBA status descriptors cover consecutive
> >> offsets.  That is, having the server reply with offsets is pointless,
> >> since they can be reconstructed on the client by starting with the
> >> offset in the client's request, then adding length from each status
> >> field.  Is less network traffic desirable?
> > 
> > I think it's better to explicitly send the start of each LBA extent.
> 
> Why? Is the redundancy for something that the client can reconstruct
> worth the extra safety at the cost of more traffic?

On second thought, not sending offsets look better. Firstly, they are
really not required for client to process the reply. Secondly, it also
implies that the regions are distinct and in sorted order and makes it
impossible to do otherwise.

-- 
Pavel

> 
> > So I'll go with changing 'status' field to 32 bits to avoid
> > packing/unpacking issues.
> 
> You may want to do that even if you eliminate the offset field, so that
> you have 8 bytes per struct (instead of 16).
> 
> >>
> >> Do we want to require that the server MUST reply with enough extents to
> >> sum up to the length of the client's request, or are we permitting a
> >> short reply?
> > 
> > While the "GET LBA STATUS" command in SCSI allows partial reply, I
> > believe it'd better to keep things simple and require that the server
> > must either return a list of extents that covers the whole requested
> > range, or an error.
> 
> Make sure you specify whether ranges are allowed to overlap, or must be
> distinct (I prefer the latter), and whether they must appear in sorted
> order (which I also prefer) - once you mandate ordered and
> non-overlapping coverage, client-side validation that the server's
> answer makes sense is easier (remember, we want the client to detect
> man-in-the-middle corruption of the server's reply).
> 
> 
> > Actually, for this command I treat 'command flags' field not as a set of
> > flags, but rather as a plain number representing required mode of
> > operation. Probably, not a good idea as it doesn't match the rest of the
> > commands.
> > 
> > I went this way because I didn't want to allow clients to request
> > several modes simultaneously (e.g. provisioning + dirtiness) in the same
> > request. This makes server side implementation harder.
> > 
> > I think I'll just switch to bits to match the rest of the commands and
> > will add a note, that server should return EINVAL in case several modes
> > are requested simultaneously.
> 
> But you don't need two bits.  Just a single bit will do (off for
> provisioning mode, on for dirty mode).  So there are no conflicting
> modes that can be simultaneously requested, at least in the current
> definition of a single valid flag bit.  (Then again, I did make a
> suggestion about additional bits, useful only during provisioning mode,
> that might be used to state that the client is okay if the server
> coalesces extents that differ only in allocation or only in zeroed
> content - if we add that bit or two bits, it would be an error to use it
> while requesting dirty mode).
> 
> 
> >> then we can express four states:
> >>
> >> 0x0 - LBA extent not present, client MUST NOT make assumptions about
> >> contents, and reads should not be attempted
> >> 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents
> >> 0x2 - LBA extent not present, but client can treat the extent as zeroes
> >> and reads will succeed
> >> 0x3 - LBA extent present, client can treat the extent as zeroes and
> >> reads will succeed
> > 
> > I'm not sure that clients need this level of details. From client's POV
> > 0x2 and 0x3 are the same.
> 
> No, if the client is trying to EXACTLY copy sparseness, then 0x2 and 0x3
> differ on whether the client will punch a hole vs. explicitly allocate
> zeroes.
> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 16:07               ` Kevin Wolf
@ 2016-03-24 16:47                 ` Wouter Verhelst
  2016-03-29  9:38                   ` Kevin Wolf
  0 siblings, 1 reply; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-24 16:47 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: nbd-general, qemu-devel, Stefan Hajnoczi, Denis V. Lunev, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1965 bytes --]

On Thu, Mar 24, 2016 at 05:07:47PM +0100, Kevin Wolf wrote:
> Am 24.03.2016 um 17:04 hat Eric Blake geschrieben:
> > On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> > >> On 24/03/2016 16:25, Eric Blake wrote:
> > >>>> However, let's make these bits, so that
> > >>>>
> > >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> > >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> > >>>
> > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> > >>> allocated, 1 means not present), so that an overall status of 0 is a
> > >>> safe default?
> > >>
> > >> Double negations are evil (and don't work the same in all languages), so
> > >> I think it's a worse option.
> > > 
> > > I agree that a bit which says "unallocated" is confusing in that manner,
> > > but that just means we need a better name (one that doesn't contain
> > > "un-" or "not")
> > > 
> > > I like the idea of having zero be the "sensible" default, although I'm
> > > quite unable to come up with a better name myself.
> > 
> > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
> > matches well that we have NBD_CMD_TRIM for requesting the creation of
> > such a state.
> 
> How about NBD_STATE_HOLE?

Both will work, although I like NBD_STATE_TRIM slightly better because
it indeed nicely references NBD_CMD_TRIM.

However, I also think it should then be made clear that issuing
NBD_CMD_TRIM doesn't *require* that GET_BLOCK returns NBD_STATE_TRIM for
that region if the backend storage format dosn't support that, to avoid
confusion later on.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-23 14:16 ` [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension Denis V. Lunev
  2016-03-23 16:27   ` Eric Blake
  2016-03-23 17:58   ` [Qemu-devel] [Nbd] " Wouter Verhelst
@ 2016-03-24 22:08   ` Eric Blake
  2016-03-25  8:49     ` [Qemu-devel] [Nbd] " Wouter Verhelst
  2016-04-04 16:40   ` [Qemu-devel] " Eric Blake
  2016-04-04 20:16   ` Denis V. Lunev
  4 siblings, 1 reply; 54+ messages in thread
From: Eric Blake @ 2016-03-24 22:08 UTC (permalink / raw)
  To: Denis V. Lunev, nbd-general, qemu-devel
  Cc: Kevin Wolf, Paolo Bonzini, Pavel Borzenkov, Stefan Hajnoczi,
	Wouter Verhelst

[-- Attachment #1: Type: text/plain, Size: 4961 bytes --]

On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
> 

> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +    An LBA range status query request. Length and offset define the range
> +    of interest. The server MUST reply with a reply header, followed
> +    immediately by the following data:
> +
> +      - 32 bits, length of parameter data that follow (unsigned)
> +      - zero or more LBA status descriptors, each having the following
> +        structure:
> +
> +        * 64 bits, offset (unsigned)
> +        * 32 bits, length (unsigned)
> +        * 16 bits, status (unsigned)
> +
> +    unless an error condition has occurred.

To date, only the NBD_CMD_READ command caused the server to send data
after the handle in its reply.  This would be the second command with a
data field in the response, but it is returning a variable amount of
data, not directly tied to the client's length - but at least you did
make it structured so that the client knows how much to read.  However,
your patch is incomplete; you'll need to edit the "Transmission" section
of the document to mention the rules on the server sending data, as the
existing text would now be wrong:

> The server replies with:
> 
> S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC)
> S: 32 bits, error
> S: 64 bits, handle
> S: (length bytes of data if the request is of type NBD_CMD_READ)

You may also want to add a rule that for all future extensions, any
command that requires data in the server response (other than the server
response to NBD_CMD_READ) must include a 32-bit length as the first
field of its data payload, so that the server reply is always structured.

Hmm, I still think it would be hard to write a wireshark dissector for
server replies, since there is no indication whether a given reply will
include data or not.  The _client_ knows (well, any well-written client
that uses a different value for 'handle' for every command sent to the
server), because it can read the returned 'handle' field to know what
command it matches to and therefore what data to expect; but a
third-party observer seeing _just_ the server messages has no idea which
server responses should have payload.  Scanning the stream and assuming
that a new reply starts each time NBD_REPLY_MAGIC is encountered is a
close approximation, but hits false positives if the data being returned
for NBD_CMD_READ contains the same magic number as part of its contents.
 Obviously, back-compat says we can't change the response to
NBD_CMD_READ, but that means that a wireshark dissector has to either
maintain context, or hunt through returned data looking for potential
magic numbers and possibly hitting false positives.

That said, maybe we want to allow global flag negotiation prior to
transmission to add a new flag on both server and client side - the
server would advertise that it is capable of a new reply mode, and the
client then has to reply that it wants to use the new reply mode; in
that mode, all server replies starting with magic 0x67446698
(NBD_REPLY_MAGIC) will never have a data payload, and all commands that
cause a reply with payload (both NBD_CMD_READ and the new
NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it)
will reply with a NEW magic number:

S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2)
S: 32 bits, error
S: 64 bits, handle
S: 32 bits, length
S: length bytes of data

so that the server's data stream is fully structured without having to
maintain context of the client's requests.  That is, a dissector can now
merely scan for both magic numbers; and on a stream between a client and
server that have negotiated the new mode, the old magic number will
never have a payload, and the new magic number will always be
accompanied with a payload that describes how much data to read to the
boundary of the next reply.

For that matter, right now, NBD_CMD_READ is required to either end the
session or return length bytes of data even when error is non-zero (but
where those bytes MAY be invalid); but by adding a negotiated flag for
structured length replies, it would be possible to allow for short reads
and/or returning an error with 0 bytes of payload but still keeping the
connection to the client open, without having to send wasted bytes over
the wire.

I could write up a negotiation of global flags for structured reply
lengths as an extension proposal, if you think it is worth it.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 22:08   ` [Qemu-devel] " Eric Blake
@ 2016-03-25  8:49     ` Wouter Verhelst
  2016-03-25  9:01       ` Alex Bligh
                         ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-25  8:49 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Markus Pargmann, Denis V. Lunev

On Thu, Mar 24, 2016 at 04:08:13PM -0600, Eric Blake wrote:
> On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > 
> > With the availability of sparse storage formats, it is often needed to
> > query status of a particular LBA range and read only those blocks of
> > data that are actually present on the block device.
> > 
> > To provide such information, the patch adds GET_LBA_STATUS extension
> > with one new NBD_CMD_GET_LBA_STATUS command.
> > 
> 
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +    An LBA range status query request. Length and offset define the range
> > +    of interest. The server MUST reply with a reply header, followed
> > +    immediately by the following data:
> > +
> > +      - 32 bits, length of parameter data that follow (unsigned)
> > +      - zero or more LBA status descriptors, each having the following
> > +        structure:
> > +
> > +        * 64 bits, offset (unsigned)
> > +        * 32 bits, length (unsigned)
> > +        * 16 bits, status (unsigned)
> > +
> > +    unless an error condition has occurred.
> 
> To date, only the NBD_CMD_READ command caused the server to send data
> after the handle in its reply.  This would be the second command with a
> data field in the response, but it is returning a variable amount of
> data, not directly tied to the client's length - but at least you did
> make it structured so that the client knows how much to read.  However,
> your patch is incomplete; you'll need to edit the "Transmission" section
> of the document to mention the rules on the server sending data, as the
> existing text would now be wrong:
> 
> > The server replies with:
> > 
> > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC)
> > S: 32 bits, error
> > S: 64 bits, handle
> > S: (length bytes of data if the request is of type NBD_CMD_READ)
> 
> You may also want to add a rule that for all future extensions, any
> command that requires data in the server response (other than the server
> response to NBD_CMD_READ) must include a 32-bit length as the first
> field of its data payload, so that the server reply is always structured.

Right.

> Hmm, I still think it would be hard to write a wireshark dissector for
> server replies, since there is no indication whether a given reply will
> include data or not.

Well, a wireshark dissector already exists. However, it is very old; it
doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't
support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't
deal with negotiation at all. It was written when newstyle negotiation
didn't exist yet, and scanning for INIT_PASSWD at the time was too hard,
according to the guy who wrote it (don't remember who that was).

> The _client_ knows (well, any well-written client
> that uses a different value for 'handle' for every command sent to the
> server), because it can read the returned 'handle' field to know what
> command it matches to and therefore what data to expect; but a
> third-party observer seeing _just_ the server messages has no idea which
> server responses should have payload.

It can if it knows enough about the protocol, but granted, that makes it
harder for us to extend the protocol without breaking the dissector.

> Scanning the stream and assuming that a new reply starts each time
> NBD_REPLY_MAGIC is encountered is a close approximation, but hits
> false positives if the data being returned for NBD_CMD_READ contains
> the same magic number as part of its contents.

Indeed.

> Obviously, back-compat says we can't change the response to
> NBD_CMD_READ, but that means that a wireshark dissector has to either
> maintain context, or hunt through returned data looking for potential
> magic numbers and possibly hitting false positives.
>
> That said, maybe we want to allow global flag negotiation prior to
> transmission to add a new flag on both server and client side - the
> server would advertise that it is capable of a new reply mode, and the
> client then has to reply that it wants to use the new reply mode; in

Global flag negotiation is already possible. There is a client flags
field, which is sent before option haggling, that could be used for
negotiation of such backwards-incompatible features.

> that mode, all server replies starting with magic 0x67446698
> (NBD_REPLY_MAGIC) will never have a data payload, and all commands that
> cause a reply with payload (both NBD_CMD_READ and the new
> NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it)
> will reply with a NEW magic number:
> 
> S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2)
> S: 32 bits, error
> S: 64 bits, handle
> S: 32 bits, length
> S: length bytes of data

I like this. However, before committing to it, I would like to see
Markus' feedback on that (explicit Cc added, even though he's on the
list, too).

We'd also need a way to communicate the ability to speak this protocol
from the kernel to userspace before telling the server that the client
understands something which maybe its kernel doesn't.

Markus?

> so that the server's data stream is fully structured without having to
> maintain context of the client's requests.  That is, a dissector can now
> merely scan for both magic numbers; and on a stream between a client and
> server that have negotiated the new mode, the old magic number will
> never have a payload, and the new magic number will always be
> accompanied with a payload that describes how much data to read to the
> boundary of the next reply.
> 
> For that matter, right now, NBD_CMD_READ is required to either end the
> session or return length bytes of data even when error is non-zero (but
> where those bytes MAY be invalid); but by adding a negotiated flag for
> structured length replies, it would be possible to allow for short reads
> and/or returning an error with 0 bytes of payload but still keeping the
> connection to the client open, without having to send wasted bytes over
> the wire.

Yes. This has been discussed on the nbd-general list in the past. There
is also the (significant) problem of the server having maybe already
sent out the header before discovering there is an error, at which point
it can *only* drop the connection.

> I could write up a negotiation of global flags for structured reply
> lengths as an extension proposal, if you think it is worth it.

I think it is worth it...

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-25  8:49     ` [Qemu-devel] [Nbd] " Wouter Verhelst
@ 2016-03-25  9:01       ` Alex Bligh
  2016-03-28 15:58       ` Eric Blake
  2016-04-04 10:18       ` Markus Pargmann
  2 siblings, 0 replies; 54+ messages in thread
From: Alex Bligh @ 2016-03-25  9:01 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, Alex Bligh, qemu-devel,
	Stefan stefanha@redhat. com, Denis V. Lunev, Paolo Bonzini


On 25 Mar 2016, at 08:49, Wouter Verhelst <w@uter.be> wrote:

> Yes. This has been discussed on the nbd-general list in the past. There
> is also the (significant) problem of the server having maybe already
> sent out the header before discovering there is an error, at which point
> it can *only* drop the connection.

Indeed.

I think where we got to last time this was discussed was that the
server could have the option of returning 'chunks' of reply, each
chunk either being a {length, data} tuple, or an error. The total
data transmitted would add up to the full length if there is no
error.

>> I could write up a negotiation of global flags for structured reply
>> lengths as an extension proposal, if you think it is worth it.
> 
> I think it is worth it...

+1 - for NBD_CMD_READ too

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-25  8:49     ` [Qemu-devel] [Nbd] " Wouter Verhelst
  2016-03-25  9:01       ` Alex Bligh
@ 2016-03-28 15:58       ` Eric Blake
  2016-04-04 10:32         ` Markus Pargmann
  2016-04-04 10:18       ` Markus Pargmann
  2 siblings, 1 reply; 54+ messages in thread
From: Eric Blake @ 2016-03-28 15:58 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Markus Pargmann, Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 4659 bytes --]

On 03/25/2016 02:49 AM, Wouter Verhelst wrote:

>> You may also want to add a rule that for all future extensions, any
>> command that requires data in the server response (other than the server
>> response to NBD_CMD_READ) must include a 32-bit length as the first
>> field of its data payload, so that the server reply is always structured.
> 
> Right.
> 
>> Hmm, I still think it would be hard to write a wireshark dissector for
>> server replies, since there is no indication whether a given reply will
>> include data or not.
> 
> Well, a wireshark dissector already exists. However, it is very old; it
> doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't
> support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't
> deal with negotiation at all. It was written when newstyle negotiation
> didn't exist yet, and scanning for INIT_PASSWD at the time was too hard,
> according to the guy who wrote it (don't remember who that was).

And I've never written a wireshark dissector myself, to know how easy or
hard it would be to extend that work.

> 
>> The _client_ knows (well, any well-written client
>> that uses a different value for 'handle' for every command sent to the
>> server), because it can read the returned 'handle' field to know what
>> command it matches to and therefore what data to expect; but a
>> third-party observer seeing _just_ the server messages has no idea which
>> server responses should have payload.
> 
> It can if it knows enough about the protocol, but granted, that makes it
> harder for us to extend the protocol without breaking the dissector.
> 
>> Scanning the stream and assuming that a new reply starts each time
>> NBD_REPLY_MAGIC is encountered is a close approximation, but hits
>> false positives if the data being returned for NBD_CMD_READ contains
>> the same magic number as part of its contents.
> 
> Indeed.

One benefit of TCP: we can rely on packet boundaries (whether or not
fragmentation is also happening); I'm not sure if UDP shares the same
benefits (then again, I'm not even sure if UDP is usable for the NBD
protocol, since we definitely rely on in-order delivery of packets: a
read command can definitely return more bytes than even a jumbo frame
can contain, even though we do allow out-of-order delivery of replies).
 So if the server always sends each reply as the start of its own packet
(rather than coalescing multiple replies into a single network packet),
and the dissector only looks for magic numbers at the start of a packet,
then any server packet not starting with the magic number must be data
payload, and you could potentially even avoid the false positives by
choosing to break data replies into packets by adjusting packet lengths
by one byte any time the next data chunk would otherwise happen to start
with the same two bytes as the magic number.  But I haven't actually
tested any of this, to know if packet coalescing goes on, and I
certainly don't think it is worth complicating the server to avoid false
positive magic number detection by splitting read payloads across packet
boundaries, just for the benefit of wire-sniffing.

> I like this. However, before committing to it, I would like to see
> Markus' feedback on that (explicit Cc added, even though he's on the
> list, too).
> 
> We'd also need a way to communicate the ability to speak this protocol
> from the kernel to userspace before telling the server that the client
> understands something which maybe its kernel doesn't.

proto.md already documents that ioctl()s exist for the user space to
inform the kernel about options sent by the server prior to kicking off
transmission phase, and that the NBD protocol itself does not go into
full detail about available ioctl()s (the kernel/userspace coordination
does not affect the protocol).  It seems fairly straightforward for the
kernel to supply another ioctl() that userspace can use to learn whether
it is allowed or forbidden from advertising structured reply status
during the handshake phase (including the case where the ioctl() is not
present being treated as the client must not enable structured replies).

> 
> Markus?

Markus is on vacation for a bit, so we'll just have to wait for a reply.

> 
>> I could write up a negotiation of global flags for structured reply
>> lengths as an extension proposal, if you think it is worth it.
> 
> I think it is worth it...

Okay, I'll give it a shot, in a separate thread.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-24 16:47                 ` Wouter Verhelst
@ 2016-03-29  9:38                   ` Kevin Wolf
  2016-03-29  9:53                     ` Wouter Verhelst
  2016-03-29 10:25                     ` Paolo Bonzini
  0 siblings, 2 replies; 54+ messages in thread
From: Kevin Wolf @ 2016-03-29  9:38 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, qemu-devel, Stefan Hajnoczi, Denis V. Lunev, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2185 bytes --]

Am 24.03.2016 um 17:47 hat Wouter Verhelst geschrieben:
> On Thu, Mar 24, 2016 at 05:07:47PM +0100, Kevin Wolf wrote:
> > Am 24.03.2016 um 17:04 hat Eric Blake geschrieben:
> > > On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> > > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> > > >> On 24/03/2016 16:25, Eric Blake wrote:
> > > >>>> However, let's make these bits, so that
> > > >>>>
> > > >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> > > >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> > > >>>
> > > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> > > >>> allocated, 1 means not present), so that an overall status of 0 is a
> > > >>> safe default?
> > > >>
> > > >> Double negations are evil (and don't work the same in all languages), so
> > > >> I think it's a worse option.
> > > > 
> > > > I agree that a bit which says "unallocated" is confusing in that manner,
> > > > but that just means we need a better name (one that doesn't contain
> > > > "un-" or "not")
> > > > 
> > > > I like the idea of having zero be the "sensible" default, although I'm
> > > > quite unable to come up with a better name myself.
> > > 
> > > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
> > > matches well that we have NBD_CMD_TRIM for requesting the creation of
> > > such a state.
> > 
> > How about NBD_STATE_HOLE?
> 
> Both will work, although I like NBD_STATE_TRIM slightly better because
> it indeed nicely references NBD_CMD_TRIM.

I just thought that "trim" sounds more like an action than a status, and
while the reason for a hole to exist can be a previous TRIM command,
another option is that it's an area in an image that just has never been
written to. In that case "trim" would be a misnomer.

> However, I also think it should then be made clear that issuing
> NBD_CMD_TRIM doesn't *require* that GET_BLOCK returns NBD_STATE_TRIM for
> that region if the backend storage format dosn't support that, to avoid
> confusion later on.

Good point. That might be another reason for not calling the status
"trim".

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-29  9:38                   ` Kevin Wolf
@ 2016-03-29  9:53                     ` Wouter Verhelst
  2016-03-29 10:25                     ` Paolo Bonzini
  1 sibling, 0 replies; 54+ messages in thread
From: Wouter Verhelst @ 2016-03-29  9:53 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: nbd-general, qemu-devel, Stefan Hajnoczi, Paolo Bonzini, Denis V. Lunev

On Tue, Mar 29, 2016 at 11:38:35AM +0200, Kevin Wolf wrote:
> Am 24.03.2016 um 17:47 hat Wouter Verhelst geschrieben:
> > On Thu, Mar 24, 2016 at 05:07:47PM +0100, Kevin Wolf wrote:
> > > Am 24.03.2016 um 17:04 hat Eric Blake geschrieben:
> > > > On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> > > > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> > > > >> On 24/03/2016 16:25, Eric Blake wrote:
> > > > >>>> However, let's make these bits, so that
> > > > >>>>
> > > > >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> > > > >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> > > > >>>
> > > > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> > > > >>> allocated, 1 means not present), so that an overall status of 0 is a
> > > > >>> safe default?
> > > > >>
> > > > >> Double negations are evil (and don't work the same in all languages), so
> > > > >> I think it's a worse option.
> > > > > 
> > > > > I agree that a bit which says "unallocated" is confusing in that manner,
> > > > > but that just means we need a better name (one that doesn't contain
> > > > > "un-" or "not")
> > > > > 
> > > > > I like the idea of having zero be the "sensible" default, although I'm
> > > > > quite unable to come up with a better name myself.
> > > > 
> > > > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
> > > > matches well that we have NBD_CMD_TRIM for requesting the creation of
> > > > such a state.
> > > 
> > > How about NBD_STATE_HOLE?
> > 
> > Both will work, although I like NBD_STATE_TRIM slightly better because
> > it indeed nicely references NBD_CMD_TRIM.
> 
> I just thought that "trim" sounds more like an action than a status, and
> while the reason for a hole to exist can be a previous TRIM command,
> another option is that it's an area in an image that just has never been
> written to. In that case "trim" would be a misnomer.

Point. It could be "TRIMMED" instead, I suppose.

> > However, I also think it should then be made clear that issuing
> > NBD_CMD_TRIM doesn't *require* that GET_BLOCK returns NBD_STATE_TRIM for
> > that region if the backend storage format dosn't support that, to avoid
> > confusion later on.
> 
> Good point. That might be another reason for not calling the status
> "trim".

Also a good point...

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-29  9:38                   ` Kevin Wolf
  2016-03-29  9:53                     ` Wouter Verhelst
@ 2016-03-29 10:25                     ` Paolo Bonzini
  1 sibling, 0 replies; 54+ messages in thread
From: Paolo Bonzini @ 2016-03-29 10:25 UTC (permalink / raw)
  To: Kevin Wolf, Wouter Verhelst
  Cc: nbd-general, Denis V. Lunev, qemu-devel, Stefan Hajnoczi


[-- Attachment #1.1: Type: text/plain, Size: 616 bytes --]



On 29/03/2016 11:38, Kevin Wolf wrote:
> > > How about NBD_STATE_HOLE?
> > 
> > Both will work, although I like NBD_STATE_TRIM slightly better because
> > it indeed nicely references NBD_CMD_TRIM.
> 
> I just thought that "trim" sounds more like an action than a status, and
> while the reason for a hole to exist can be a previous TRIM command,
> another option is that it's an area in an image that just has never been
> written to. In that case "trim" would be a misnomer.

I agree with Kevin.  My preference is still on NBD_STATE_ALLOCATED, but
NBD_STATE_HOLE is a fine name as well.

Paolo


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-25  8:49     ` [Qemu-devel] [Nbd] " Wouter Verhelst
  2016-03-25  9:01       ` Alex Bligh
  2016-03-28 15:58       ` Eric Blake
@ 2016-04-04 10:18       ` Markus Pargmann
  2016-04-04 16:54         ` Eric Blake
  2016-04-04 22:17         ` Wouter Verhelst
  2 siblings, 2 replies; 54+ messages in thread
From: Markus Pargmann @ 2016-04-04 10:18 UTC (permalink / raw)
  To: Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 7678 bytes --]

Hi,

back from my easter vacation. A bit surprised to find 200 mails in the
NBD mailing list ;).

On Friday 25 March 2016 09:49:29 Wouter Verhelst wrote:
> On Thu, Mar 24, 2016 at 04:08:13PM -0600, Eric Blake wrote:
> > On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > > 
> > > With the availability of sparse storage formats, it is often needed to
> > > query status of a particular LBA range and read only those blocks of
> > > data that are actually present on the block device.
> > > 
> > > To provide such information, the patch adds GET_LBA_STATUS extension
> > > with one new NBD_CMD_GET_LBA_STATUS command.
> > > 
> > 
> > > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > > +
> > > +    An LBA range status query request. Length and offset define the range
> > > +    of interest. The server MUST reply with a reply header, followed
> > > +    immediately by the following data:
> > > +
> > > +      - 32 bits, length of parameter data that follow (unsigned)
> > > +      - zero or more LBA status descriptors, each having the following
> > > +        structure:
> > > +
> > > +        * 64 bits, offset (unsigned)
> > > +        * 32 bits, length (unsigned)
> > > +        * 16 bits, status (unsigned)
> > > +
> > > +    unless an error condition has occurred.
> > 
> > To date, only the NBD_CMD_READ command caused the server to send data
> > after the handle in its reply.  This would be the second command with a
> > data field in the response, but it is returning a variable amount of
> > data, not directly tied to the client's length - but at least you did
> > make it structured so that the client knows how much to read.  However,
> > your patch is incomplete; you'll need to edit the "Transmission" section
> > of the document to mention the rules on the server sending data, as the
> > existing text would now be wrong:
> > 
> > > The server replies with:
> > > 
> > > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC)
> > > S: 32 bits, error
> > > S: 64 bits, handle
> > > S: (length bytes of data if the request is of type NBD_CMD_READ)
> > 
> > You may also want to add a rule that for all future extensions, any
> > command that requires data in the server response (other than the server
> > response to NBD_CMD_READ) must include a 32-bit length as the first
> > field of its data payload, so that the server reply is always structured.
> 
> Right.
> 
> > Hmm, I still think it would be hard to write a wireshark dissector for
> > server replies, since there is no indication whether a given reply will
> > include data or not.
> 
> Well, a wireshark dissector already exists. However, it is very old; it
> doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't
> support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't
> deal with negotiation at all. It was written when newstyle negotiation
> didn't exist yet, and scanning for INIT_PASSWD at the time was too hard,
> according to the guy who wrote it (don't remember who that was).
> 
> > The _client_ knows (well, any well-written client
> > that uses a different value for 'handle' for every command sent to the
> > server), because it can read the returned 'handle' field to know what
> > command it matches to and therefore what data to expect; but a
> > third-party observer seeing _just_ the server messages has no idea which
> > server responses should have payload.
> 
> It can if it knows enough about the protocol, but granted, that makes it
> harder for us to extend the protocol without breaking the dissector.
> 
> > Scanning the stream and assuming that a new reply starts each time
> > NBD_REPLY_MAGIC is encountered is a close approximation, but hits
> > false positives if the data being returned for NBD_CMD_READ contains
> > the same magic number as part of its contents.
> 
> Indeed.
> 
> > Obviously, back-compat says we can't change the response to
> > NBD_CMD_READ, but that means that a wireshark dissector has to either
> > maintain context, or hunt through returned data looking for potential
> > magic numbers and possibly hitting false positives.
> >
> > That said, maybe we want to allow global flag negotiation prior to
> > transmission to add a new flag on both server and client side - the
> > server would advertise that it is capable of a new reply mode, and the
> > client then has to reply that it wants to use the new reply mode; in
> 
> Global flag negotiation is already possible. There is a client flags
> field, which is sent before option haggling, that could be used for
> negotiation of such backwards-incompatible features.
> 
> > that mode, all server replies starting with magic 0x67446698
> > (NBD_REPLY_MAGIC) will never have a data payload, and all commands that
> > cause a reply with payload (both NBD_CMD_READ and the new
> > NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it)
> > will reply with a NEW magic number:
> > 
> > S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2)
> > S: 32 bits, error
> > S: 64 bits, handle
> > S: 32 bits, length
> > S: length bytes of data
> 
> I like this. However, before committing to it, I would like to see
> Markus' feedback on that (explicit Cc added, even though he's on the
> list, too).
> 
> We'd also need a way to communicate the ability to speak this protocol
> from the kernel to userspace before telling the server that the client
> understands something which maybe its kernel doesn't.
> 
> Markus?
> 
> > so that the server's data stream is fully structured without having to
> > maintain context of the client's requests.  That is, a dissector can now
> > merely scan for both magic numbers; and on a stream between a client and
> > server that have negotiated the new mode, the old magic number will
> > never have a payload, and the new magic number will always be
> > accompanied with a payload that describes how much data to read to the
> > boundary of the next reply.
> > 
> > For that matter, right now, NBD_CMD_READ is required to either end the
> > session or return length bytes of data even when error is non-zero (but
> > where those bytes MAY be invalid); but by adding a negotiated flag for
> > structured length replies, it would be possible to allow for short reads
> > and/or returning an error with 0 bytes of payload but still keeping the
> > connection to the client open, without having to send wasted bytes over
> > the wire.
> 
> Yes. This has been discussed on the nbd-general list in the past. There
> is also the (significant) problem of the server having maybe already
> sent out the header before discovering there is an error, at which point
> it can *only* drop the connection.

I am still not through all the new mails on the list, so there may be
some more discussions about this. But I will answer here simply.

I generally like the idea of having this new kind of reply. Is this
solving our issues where we want to "stream" data directly from a
filedescriptor into a tcp-stream?

Does it make sense to extend this for reads with an offset? This way we
could not only send in chunks but also order them randomly. Is there any
use-case where it does make sense to read data not sequentially?

Best Regards,

Markus

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-28 15:58       ` Eric Blake
@ 2016-04-04 10:32         ` Markus Pargmann
  0 siblings, 0 replies; 54+ messages in thread
From: Markus Pargmann @ 2016-04-04 10:32 UTC (permalink / raw)
  To: Eric Blake
  Cc: nbd-general, Kevin Wolf, Denis V. Lunev, qemu-devel,
	Stefan Hajnoczi, Paolo Bonzini, Wouter Verhelst

[-- Attachment #1: Type: text/plain, Size: 5410 bytes --]

Hi,

On Monday 28 March 2016 09:58:52 Eric Blake wrote:
> On 03/25/2016 02:49 AM, Wouter Verhelst wrote:
> 
> >> You may also want to add a rule that for all future extensions, any
> >> command that requires data in the server response (other than the server
> >> response to NBD_CMD_READ) must include a 32-bit length as the first
> >> field of its data payload, so that the server reply is always structured.
> > 
> > Right.
> > 
> >> Hmm, I still think it would be hard to write a wireshark dissector for
> >> server replies, since there is no indication whether a given reply will
> >> include data or not.
> > 
> > Well, a wireshark dissector already exists. However, it is very old; it
> > doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't
> > support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't
> > deal with negotiation at all. It was written when newstyle negotiation
> > didn't exist yet, and scanning for INIT_PASSWD at the time was too hard,
> > according to the guy who wrote it (don't remember who that was).
> 
> And I've never written a wireshark dissector myself, to know how easy or
> hard it would be to extend that work.
> 
> > 
> >> The _client_ knows (well, any well-written client
> >> that uses a different value for 'handle' for every command sent to the
> >> server), because it can read the returned 'handle' field to know what
> >> command it matches to and therefore what data to expect; but a
> >> third-party observer seeing _just_ the server messages has no idea which
> >> server responses should have payload.
> > 
> > It can if it knows enough about the protocol, but granted, that makes it
> > harder for us to extend the protocol without breaking the dissector.
> > 
> >> Scanning the stream and assuming that a new reply starts each time
> >> NBD_REPLY_MAGIC is encountered is a close approximation, but hits
> >> false positives if the data being returned for NBD_CMD_READ contains
> >> the same magic number as part of its contents.
> > 
> > Indeed.
> 
> One benefit of TCP: we can rely on packet boundaries (whether or not
> fragmentation is also happening); I'm not sure if UDP shares the same
> benefits (then again, I'm not even sure if UDP is usable for the NBD
> protocol, since we definitely rely on in-order delivery of packets: a
> read command can definitely return more bytes than even a jumbo frame
> can contain, even though we do allow out-of-order delivery of replies).
>  So if the server always sends each reply as the start of its own packet
> (rather than coalescing multiple replies into a single network packet),
> and the dissector only looks for magic numbers at the start of a packet,
> then any server packet not starting with the magic number must be data
> payload, and you could potentially even avoid the false positives by
> choosing to break data replies into packets by adjusting packet lengths
> by one byte any time the next data chunk would otherwise happen to start
> with the same two bytes as the magic number.  But I haven't actually
> tested any of this, to know if packet coalescing goes on, and I
> certainly don't think it is worth complicating the server to avoid false
> positive magic number detection by splitting read payloads across packet
> boundaries, just for the benefit of wire-sniffing.
> 
> > I like this. However, before committing to it, I would like to see
> > Markus' feedback on that (explicit Cc added, even though he's on the
> > list, too).
> > 
> > We'd also need a way to communicate the ability to speak this protocol
> > from the kernel to userspace before telling the server that the client
> > understands something which maybe its kernel doesn't.
> 
> proto.md already documents that ioctl()s exist for the user space to
> inform the kernel about options sent by the server prior to kicking off
> transmission phase, and that the NBD protocol itself does not go into
> full detail about available ioctl()s (the kernel/userspace coordination
> does not affect the protocol).  It seems fairly straightforward for the
> kernel to supply another ioctl() that userspace can use to learn whether
> it is allowed or forbidden from advertising structured reply status
> during the handshake phase (including the case where the ioctl() is not
> present being treated as the client must not enable structured replies).

Depending on the implementation we may not even need to communicate the
used NBD protocol from userspace to the kernel. We have a different
magic and the magic stays at the beginning of the message so we can
simply use the magic to decide what message we have. Of course that
would need another receive which may be too slow.

For the other direction, giving the userspace information about the
capabilities of the kernel implementation, it may be better to use a
sysfs entry? All current ioctls are used for the exact nbd device it is
used on. But implementation capabilities are a common property over all
nbd devices.

Best Regards,

Markus

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-23 14:16 ` [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension Denis V. Lunev
                     ` (2 preceding siblings ...)
  2016-03-24 22:08   ` [Qemu-devel] " Eric Blake
@ 2016-04-04 16:40   ` Eric Blake
  2016-04-04 20:16   ` Denis V. Lunev
  4 siblings, 0 replies; 54+ messages in thread
From: Eric Blake @ 2016-04-04 16:40 UTC (permalink / raw)
  To: Denis V. Lunev, nbd-general, qemu-devel
  Cc: Kevin Wolf, Paolo Bonzini, Pavel Borzenkov, Stefan Hajnoczi,
	Wouter Verhelst

[-- Attachment #1: Type: text/plain, Size: 1698 bytes --]

On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
> 
> There exists a concept of data dirtiness, which is required during, for
> example, incremental block device backup. To express this concept via
> NBD protocol, this patch also adds additional mode of operation to
> NBD_CMD_GET_LBA_STATUS command.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> LBA STATUS" command more closely, it has been chosen to return a list of
> extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> bitmap.
> 
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 82 insertions(+)

I've posted a v2 of this proposal under a new title, rebased on top of
the recent work to add structured replies, and trying to take into
account a number of the suggestions that occurred in this thread.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-04-04 10:18       ` Markus Pargmann
@ 2016-04-04 16:54         ` Eric Blake
  2016-04-04 22:17         ` Wouter Verhelst
  1 sibling, 0 replies; 54+ messages in thread
From: Eric Blake @ 2016-04-04 16:54 UTC (permalink / raw)
  To: Markus Pargmann, Wouter Verhelst
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

[-- Attachment #1: Type: text/plain, Size: 1942 bytes --]

On 04/04/2016 04:18 AM, Markus Pargmann wrote:
> Hi,
> 
> back from my easter vacation. A bit surprised to find 200 mails in the
> NBD mailing list ;).
> 

>> Yes. This has been discussed on the nbd-general list in the past. There
>> is also the (significant) problem of the server having maybe already
>> sent out the header before discovering there is an error, at which point
>> it can *only* drop the connection.
> 
> I am still not through all the new mails on the list, so there may be
> some more discussions about this. But I will answer here simply.
> 
> I generally like the idea of having this new kind of reply. Is this
> solving our issues where we want to "stream" data directly from a
> filedescriptor into a tcp-stream?

I'm a relative newcomer on the NBD list, so I'm not sure what the issue
was there, but yes, structured replies DOES help with read semantics
that encounter a partial error that can be detected only after you have
started the reply stream.

> 
> Does it make sense to extend this for reads with an offset? This way we
> could not only send in chunks but also order them randomly. Is there any
> use-case where it does make sense to read data not sequentially?

In fact, that's what already got committed while you were gone.  It may
help if you jump in and read the current state of proto.md, if you don't
want to plow through all the mail churn in the meantime for how we got
to the state committed.  And of course, if you have suggestions on how
to further improve it, the extension is still documented as
experimental, so we can further tweak it before releasing upstream NBD
or downstream QEMU implementations of the extension (I'm currently
coding up the work on a qemu implementation, and will report on anything
that I run into during that process).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-03-23 14:16 ` [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension Denis V. Lunev
                     ` (3 preceding siblings ...)
  2016-04-04 16:40   ` [Qemu-devel] " Eric Blake
@ 2016-04-04 20:16   ` Denis V. Lunev
  2016-04-04 20:36     ` [Qemu-devel] [Nbd] " Eric Blake
  4 siblings, 1 reply; 54+ messages in thread
From: Denis V. Lunev @ 2016-04-04 20:16 UTC (permalink / raw)
  To: nbd-general, qemu-devel
  Cc: Kevin Wolf, Wouter Verhelst, Pavel Borzenkov, Stefan Hajnoczi,
	Paolo Bonzini

On 03/23/2016 05:16 PM, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
>
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.
>
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
>
> There exists a concept of data dirtiness, which is required during, for
> example, incremental block device backup. To express this concept via
> NBD protocol, this patch also adds additional mode of operation to
> NBD_CMD_GET_LBA_STATUS command.
>
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> LBA STATUS" command more closely, it has been chosen to return a list of
> extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> bitmap.
>
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 82 insertions(+)
>
> diff --git a/doc/proto.md b/doc/proto.md
> index cda213c..fff515d 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -243,6 +243,8 @@ immediately after the global flags field in oldstyle negotiation:
>     `NBD_CMD_TRIM` commands
>   - bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
>     supports `NBD_CMD_WRITE_ZEROES` commands
> +- bit 7, `NBD_FLAG_SEND_GET_LBA_STATUS`; should be set to 1 if the server
> +  supports `NBD_CMD_GET_LBA_STATUS` commands
>   
>   ##### Client flags
>   
> @@ -477,6 +479,10 @@ The following request types exist:
>   
>       Defined by the experimental `WRITE_ZEROES` extension; see below.
>   
> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +    Defined by the experimental `GET_LBA_STATUS` extension; see below.
> +
>   * Other requests
>   
>       Some third-party implementations may require additional protocol
> @@ -638,6 +644,82 @@ The server SHOULD return `ENOSPC` if it receives a write zeroes request
>   including one or more sectors beyond the size of the device. It SHOULD
>   return `EPERM` if it receives a write zeroes request on a read-only export.
>   
> +### `GET_LBA_STATUS` extension
> +
> +With the availability of sparse storage formats, it is often needed to query
> +status of a particular LBA range and read only those blocks of data that are
> +actually present on the block device.
> +
> +Some storage formats and operations over such formats express a concept of
> +data dirtiness. Whether the operation is block device mirroring,
> +incremental block device backup or any other operation with a concept of
> +data dirtiness, they all share a need to provide a list of LBA ranges
> +that this particular operation treats as dirty.
> +
> +To provide such class of information, `GET_LBA_STATUS` extension adds new
> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> +their respective states.
> +
> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +    An LBA range status query request. Length and offset define the range
> +    of interest. The server MUST reply with a reply header, followed
> +    immediately by the following data:
> +
> +      - 32 bits, length of parameter data that follow (unsigned)
> +      - zero or more LBA status descriptors, each having the following
> +        structure:
> +
> +        * 64 bits, offset (unsigned)
> +        * 32 bits, length (unsigned)
> +        * 16 bits, status (unsigned)
> +
> +    unless an error condition has occurred.
> +
> +    If an error occurs, the server SHOULD set the appropriate error code
> +    in the error field. The server MUST then either close the
> +    connection, or send *length of parameter data* bytes of data
> +    (which MAY be invalid).
> +
> +    The type of information required by the client is passed to server in the
> +    command flags field. If the server does not implement requested type or
> +    have no means to express it, it MUST NOT return an error, but instead MUST
> +    return a single LBA status descriptor with *offset* and *length* equal to
> +    the *offset* and *length* from request, and *status* set to `0`.
> +
> +    The following request types are currently defined for the command:
> +
> +    1. Block provisioning state
> +
> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
> +    the provisioning state of the device. The following provisionnig states
> +    are defined for the command:
> +
> +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> +        and contains zeroes;
> +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> +        block device. A client MUST NOT make any assumptions about the
> +        contents of the extent.
> +
> +    2. Block dirtiness state
> +
> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> +    the dirtiness status of the device. The following dirtiness states
> +    are defined for the command:
> +
> +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
we can add 'NBD_STATE_DIRTY_DEALLOCATED' (0x2) here as additional hint

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-04-04 20:16   ` Denis V. Lunev
@ 2016-04-04 20:36     ` Eric Blake
  0 siblings, 0 replies; 54+ messages in thread
From: Eric Blake @ 2016-04-04 20:36 UTC (permalink / raw)
  To: Denis V. Lunev, nbd-general, qemu-devel
  Cc: Kevin Wolf, Paolo Bonzini, Pavel Borzenkov, Hajnoczi, Wouter Verhelst

[-- Attachment #1: Type: text/plain, Size: 1442 bytes --]

On 04/04/2016 02:16 PM, Denis V. Lunev wrote:

>> +    The following request types are currently defined for the command:
>> +
>> +    1. Block provisioning state
>> +
>> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
>> +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
>> +    the provisioning state of the device. The following provisionnig states
>> +    are defined for the command:
>> +
>> +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
>> +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
>> +        and contains zeroes;
>> +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
>> +        block device. A client MUST NOT make any assumptions about the
>> +        contents of the extent.

> we can add 'NBD_STATE_DIRTY_DEALLOCATED' (0x2) here as additional hint

No, DEALLOCATED and HOLE are the same thing, and we really want the
status to be a bitwise-OR of flags (bit 0: is it allocated or
deallocated. bit 1: is it unknown content or all 0), rather than a set
of 3 states, since it really is possible to have all four combinations
of those two orthognal status information. That was one of the topics
already hashed out in the v1 conversation, and fixed in v2.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
  2016-04-04 10:18       ` Markus Pargmann
  2016-04-04 16:54         ` Eric Blake
@ 2016-04-04 22:17         ` Wouter Verhelst
  1 sibling, 0 replies; 54+ messages in thread
From: Wouter Verhelst @ 2016-04-04 22:17 UTC (permalink / raw)
  To: Markus Pargmann
  Cc: nbd-general, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

On Mon, Apr 04, 2016 at 12:18:37PM +0200, Markus Pargmann wrote:
> Hi,
> 
> back from my easter vacation. A bit surprised to find 200 mails in the
> NBD mailing list ;).

I'm sure you were :-)

[...]
> > Yes. This has been discussed on the nbd-general list in the past. There
> > is also the (significant) problem of the server having maybe already
> > sent out the header before discovering there is an error, at which point
> > it can *only* drop the connection.
> 
> I am still not through all the new mails on the list, so there may be
> some more discussions about this. But I will answer here simply.
> 
> I generally like the idea of having this new kind of reply. Is this
> solving our issues where we want to "stream" data directly from a
> filedescriptor into a tcp-stream?

The eventual suggestion does, yes (as I'm sure you've found out by now).

> Does it make sense to extend this for reads with an offset?

I'm not sure what you mean by that. "reads with an offset"? Don't all our reads
have an offset?

> This way we could not only send in chunks but also order them randomly. Is
> there any use-case where it does make sense to read data not sequentially?

If you just mean "read replies with offset", then yes :-)

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2016-04-04 22:17 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-23 14:16 [Qemu-devel] [PATCH 0/2] NBD protocol extensions: WRITE_ZEROES and GET_LBA_STATUS Denis V. Lunev
2016-03-23 14:16 ` [Qemu-devel] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension Denis V. Lunev
2016-03-23 15:14   ` Eric Blake
2016-03-23 17:40     ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-24  7:16     ` [Qemu-devel] " Pavel Borzenkov
2016-03-24  7:36       ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-23 17:21   ` Wouter Verhelst
2016-03-24  7:57     ` Pavel Borzenkov
2016-03-24  8:26       ` Wouter Verhelst
2016-03-24 11:35         ` Pavel Borzenkov
2016-03-24 11:37         ` Paolo Bonzini
2016-03-24 12:31           ` Wouter Verhelst
2016-03-24 14:53         ` Eric Blake
2016-03-23 14:16 ` [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension Denis V. Lunev
2016-03-23 16:27   ` Eric Blake
2016-03-24 12:30     ` Pavel Borzenkov
2016-03-24 15:04       ` Eric Blake
2016-03-24 16:36         ` Pavel Borzenkov
2016-03-23 17:58   ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-23 18:14     ` Kevin Wolf
2016-03-24  8:25       ` Pavel Borzenkov
2016-03-24  8:41         ` Wouter Verhelst
2016-03-24 11:36           ` Pavel Borzenkov
2016-03-24 12:32             ` Wouter Verhelst
2016-03-24  8:43     ` Pavel Borzenkov
2016-03-24  9:33       ` Wouter Verhelst
2016-03-24 10:32         ` Alex Bligh
2016-03-24 11:58           ` Paolo Bonzini
2016-03-24 12:17             ` Alex Bligh
2016-03-24 12:32               ` Paolo Bonzini
2016-03-24 13:31                 ` Alex Bligh
2016-03-24 13:32                   ` Paolo Bonzini
2016-03-24 11:55     ` Paolo Bonzini
2016-03-24 12:43       ` Wouter Verhelst
2016-03-24 15:25       ` Eric Blake
2016-03-24 15:33         ` Paolo Bonzini
2016-03-24 15:53           ` Wouter Verhelst
2016-03-24 16:04             ` Eric Blake
2016-03-24 16:07               ` Kevin Wolf
2016-03-24 16:47                 ` Wouter Verhelst
2016-03-29  9:38                   ` Kevin Wolf
2016-03-29  9:53                     ` Wouter Verhelst
2016-03-29 10:25                     ` Paolo Bonzini
2016-03-24 22:08   ` [Qemu-devel] " Eric Blake
2016-03-25  8:49     ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-25  9:01       ` Alex Bligh
2016-03-28 15:58       ` Eric Blake
2016-04-04 10:32         ` Markus Pargmann
2016-04-04 10:18       ` Markus Pargmann
2016-04-04 16:54         ` Eric Blake
2016-04-04 22:17         ` Wouter Verhelst
2016-04-04 16:40   ` [Qemu-devel] " Eric Blake
2016-04-04 20:16   ` Denis V. Lunev
2016-04-04 20:36     ` [Qemu-devel] [Nbd] " Eric Blake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).