* [Qemu-devel] [PATCH] doc: Mention proper use of handle @ 2016-03-28 13:59 Eric Blake 2016-03-29 3:56 ` [Qemu-devel] [PATCH 2/1] doc: More details on flag negotiation Eric Blake ` (3 more replies) 0 siblings, 4 replies; 43+ messages in thread From: Eric Blake @ 2016-03-28 13:59 UTC (permalink / raw) To: nbd-general; +Cc: w, qemu-devel Although the proper use of the handle field during transmission phase was implied, it never hurts to make it more explicit that clients should alter the handle on each message, and the server repeat the handle unchanged, in order for the client to track when the server is sending replies out of order. Signed-off-by: Eric Blake <eblake@redhat.com> --- doc/proto.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/doc/proto.md b/doc/proto.md index 6d1cb34..d0102e0 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -200,7 +200,11 @@ S: 64 bits, handle S: (*length* bytes of data if the request is of type `NBD_CMD_READ`) Replies need not be sent in the same order as requests (i.e., requests -may be handled by the server asynchronously). +may be handled by the server asynchronously). Clients SHOULD send a +different value of handle for each request, and the server MUST use the +same value for handle as was sent by the client for each request that +the server is replying to, so that the client may correlate which +request is receiving a response. ## Values -- 2.5.5 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [Qemu-devel] [PATCH 2/1] doc: More details on flag negotiation 2016-03-28 13:59 [Qemu-devel] [PATCH] doc: Mention proper use of handle Eric Blake @ 2016-03-29 3:56 ` Eric Blake 2016-03-29 3:56 ` [Qemu-devel] [PATCH 3/1] doc: Propose Structured Replies extension Eric Blake ` (2 subsequent siblings) 3 siblings, 0 replies; 43+ messages in thread From: Eric Blake @ 2016-03-29 3:56 UTC (permalink / raw) To: nbd-general; +Cc: w, qemu-devel Add documentation that makes it clear that the server may add flags that the client does not recognize, but that the client may ignore those flags because the server will not change behavior without agreement; meanwhile, the client must not set flags the server does not recognize (since there is no second round of server reply, the only sane course of action is for the server to disconnect). Also, give a forward reference to the effect of negotiating NO_ZEROES on the server's reply to NBD_OPT_EXPORT_NAME, and call out the fact that none of the server's global flags should be used during oldstyle negotiation since a client has no chance to respond with the corresponding client flag. Signed-off-by: Eric Blake <eblake@redhat.com> --- doc/proto.md | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/doc/proto.md b/doc/proto.md index d0102e0..44579fc 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -103,8 +103,10 @@ C: 32 bits, flags This completes the initial phase of negotiation; the client and server now both know they understand the first version of the newstyle -handshake, with no options. What follows is a repeating group of -options. In non-fixed newstyle only one option can be set +handshake, with no options. The client SHOULD ignore any global flags +it does not recognize, while the server MUST close the connection if +it does not recognize the client's flags. What follows is a repeating +group of options. In non-fixed newstyle only one option can be set (`NBD_OPT_EXPORT_NAME`), and it is not optional. At this point, we move on to option haggling, during which point the @@ -126,7 +128,8 @@ about the used export: S: 64 bits, size of the export in bytes (unsigned) S: 16 bits, export flags -S: 124 bytes, zeroes (reserved) +S: 124 bytes, zeroes (reserved) (unless `NBD_FLAG_C_NO_ZEROES` was + negotiated by the client) If the server is unwilling to allow the export, it should close the connection. @@ -229,6 +232,10 @@ the first magic number. `NBD_FLAG_C_NO_ZEROES` in the client flags field, the server MUST NOT send the 124 bytes of zero at the end of the negotiation. +The server MUST NOT set any other flags, and SHOULD NOT change behaviour +unless the client responds with a corresponding flag. The server MUST +NOT set any of these flags during oldstyle negotiation. + ##### Export flags This field of 16 bits is sent by the server after option haggling, or @@ -259,6 +266,10 @@ receiving the global flags from the server. set `NBD_FLAG_NO_ZEROES`. If set, the server MUST NOT send the 124 bytes of zeroes at the end of the negotiation. +Clients SHOULD NOT set any other flags; the server MUST drop the +connection if the client sets an unknown flag, or a flag that does +not match something advertised by the server. + #### Option types These values are used in the "option" field during the option haggling -- 2.5.5 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [Qemu-devel] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-28 13:59 [Qemu-devel] [PATCH] doc: Mention proper use of handle Eric Blake 2016-03-29 3:56 ` [Qemu-devel] [PATCH 2/1] doc: More details on flag negotiation Eric Blake @ 2016-03-29 3:56 ` Eric Blake 2016-03-29 7:33 ` [Qemu-devel] [Nbd] " Wouter Verhelst ` (2 more replies) 2016-03-29 7:11 ` [Qemu-devel] [PATCH] doc: Mention proper use of handle Wouter Verhelst 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 0/3] NBD Structured Read Eric Blake 3 siblings, 3 replies; 43+ messages in thread From: Eric Blake @ 2016-03-29 3:56 UTC (permalink / raw) To: nbd-general; +Cc: w, qemu-devel The existing transmission phase protocol is difficult to sniff, because correct interpretation of the server stream requires context from the client stream (or risks false positives if data payloads happen to contain the protocol magic numbers). It also prohibits the ability to do short reads, or to return a read error without also sending length bytes of data. Remedy this by adding a new global/client flag negotiation, which affects the response of the NBD_CMD_READ command, and sets forth rules for how future command responses must behave when they carry a data payload. Signed-off-by: Eric Blake <eblake@redhat.com> --- doc/proto.md | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 123 insertions(+) diff --git a/doc/proto.md b/doc/proto.md index 44579fc..f687e3e 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -209,6 +209,10 @@ same value for handle as was sent by the client for each request that the server is replying to, so that the client may correlate which request is receiving a response. +Note that it is impossible to tell by reading just the server traffic +whether a data field will be present. To remedy this, the experimental +`Structured Reply` extension has been introduced; see below. + ## Values This section describes the value and meaning of constants (other than @@ -231,6 +235,8 @@ the first magic number. - bit 1, `NBD_FLAG_NO_ZEROES`; if set, and if the client replies with `NBD_FLAG_C_NO_ZEROES` in the client flags field, the server MUST NOT send the 124 bytes of zero at the end of the negotiation. +- bit 2, `NBD_FLAG_STRUCTURED_REPLY`; defined by the experimental + `Structured Reply` extension; see below. The server MUST NOT set any other flags, and SHOULD NOT change behaviour unless the client responds with a corresponding flag. The server MUST @@ -265,6 +271,8 @@ receiving the global flags from the server. - bit 1, `NBD_FLAG_C_NO_ZEROES`; MUST NOT be set if the server did not set `NBD_FLAG_NO_ZEROES`. If set, the server MUST NOT send the 124 bytes of zeroes at the end of the negotiation. +- bit 2, `NBD_FLAG_C_STRUCTURED_REPLY`; defined by the experimental + `Structured Reply` extension; see below. Clients SHOULD NOT set any other flags; the server MUST drop the connection if the client sets an unknown flag, or a flag that does @@ -435,6 +443,10 @@ The following request types exist: signalling no error), the server MUST immediately close the connection; it MUST NOT send any further data to the client. + The experimental `Structured Reply` extension changes the set of + valid replies, in part to allow recovery after a partial read; see + below. + * `NBD_CMD_WRITE` (1) A write request. Length and offset define the location and amount of @@ -609,6 +621,117 @@ option reply type. message if they do not also send it as a reply to the `NBD_OPT_SELECT` message. +### `Structured Reply` extension + +Some major downsides of the default reply to `NBD_CMD_READ` is that it +is not possible to support partial reads (the command must succeed or +fail as a whole, and len bytes of data must be sent even on an error, +unless the connection is closed); nor is it possible to decode the +server traffic without also knowing what pending read requests were +sent by the client. + +To remedy this, a `Structured Reply` extension is envisioned. This +extension adds a global flag, a client flag, a new reply type during +the transmission phase, and alters the set of valid replies to an +existing command. + +* `NBD_FLAG_STRUCTURED_REPLY` (bit 2) + + This is a global flag; if set, and if the client replies with + `NBD_FLAG_C_STRUCTURED_REPLY` in the client flags field, the server + MUST use structured replies to the `NBD_CMD_READ` command. + +* `NBD_FLAG_C_STRUCTURED_REPLY` (bit 2) + + This is a client flag; MUST NOT be set if the server did not set + `NBD_FLAG_STRUCTURED_REPLY`. If set, the server must use structured + replies to the `NBD_CMD_READ` command. + +* Transmission phase + + The transmission phase includes a third message type: the structured + reply. If `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, then the + normal server reply will never contain a data payload, and all + server replies that send data will be in the following form: + + S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`) + S: 32 bits, error + S: 64 bits, handle + S: 32 bits, length of payload (unsigned) + S: *length* bytes of payload data + + If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then the normal + server reply with a data payload will be used for `NBD_CMD_READ`; + but any other replies with a data payload will still use a + structured reply (that is, only `NBD_CMD_READ` is allowed to send + data in the non-structured form, and negotiating + `NBD_FLAG_C_STRUCTURED_REPLY` only affects replies to + `NBD_CMD_READ`). This implies that any new commands that require + data in the reply should be gated by their own new global and client + flag. A server MAY refuse to allow a client that negotiates a + command that requires a structured reply, but does not also + negotiate `NBD_FLAG_C_STRUCTURED_REPLY`. + + In the generic form, the length field of a structured response MAY + be zero if there is no data payload, and the error field may be set + regardless of the length field (although where possible, the server + SHOULD use a length of zero when setting the error field). However, + particular commands may document additional restrictions regarding + what forms a valid response (for example, a structured response to + `NBD_CMD_READ` MUST NOT set the error field, and MUST have a + non-zero length of at least 9). + +* `NBD_CMD_READ` + + If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then a read + request MUST always be answered by a single response (with magic + 0x67446698 `NBD_REPLY_MAGIC`); the response MUST include length + bytes of data according to the client's request, although those + bytes MAY be invalid if an error is returned, and the connection + MUST if an error occurs after a header claiming no error. + + Conversely, if the `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, the + response MUST be a sequence of zero or more structured replies (with + magic 0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), followed by a + concluding normal response (0x67446698 `NBD_REPLY_MAGIC`), where the + final response MUST NOT have a data payload; all responses in the + sequence use the same handle from the client. The payload of each + intermediate structured reply, called a read chunk, MUST be of the + following form: + + S: 64 bits, offset (unsigned) + S: (*length* - 8) bytes of data + + Note that the amount of data returned in a read chunk is determined + by the length field of the structured reply, and not the original + length of the client's request. If the server sends a single read + chunk for a successful read, the server's length will be the + client's length plus 8, because the server must account for the + offset field in its reply. Similarly, a successful client request + for a read of 2^32-8 or more bytes MUST be split into at least two + read chunks, so that the length field does not suffer from overflow. + + The server MUST ensure that each read chunk lies within the original + offset and length of the original client request, MUST NOT send read + chunks that would cover the same offset more than once, and MUST + send at least one byte of data in addition to the offset field of + each read chunk. The server MAY send read chunks out of order, and + may interleave other responses between read replies. The server + MUST NOT set the error field of a read chunk; if an error occurs, it + MAY immediately end the sequence of structured response messages, + MUST send the error in the concluding normal response, and SHOULD + keep the connection open. The final non-structured response MUST + set an error unless the sum of data sent by all read chunks totals + the original client length request. + + The client SHOULD immediately close the connection if it detects + that the server has sent an offset more than once (whether or not + the overlapping data claimed to have the same contents), or if + receives the concluding normal reply without an error set but + without all bytes covered by read chunk(s). A future extension may + add a command flag that would allow the server to skip read chunks + for portions of the file that read as all zeroes. + ## About this file This file tries to document the NBD protocol as it is currently -- 2.5.5 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 3:56 ` [Qemu-devel] [PATCH 3/1] doc: Propose Structured Replies extension Eric Blake @ 2016-03-29 7:33 ` Wouter Verhelst 2016-03-29 8:24 ` Alex Bligh 2016-03-29 17:53 ` Wouter Verhelst 2 siblings, 0 replies; 43+ messages in thread From: Wouter Verhelst @ 2016-03-29 7:33 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, qemu-devel Hi Eric, After applying some of the other outstanding patches, this one doesn't apply anymore. Can you rebase? On Mon, Mar 28, 2016 at 09:56:36PM -0600, Eric Blake wrote: > The existing transmission phase protocol is difficult to sniff, > because correct interpretation of the server stream requires > context from the client stream (or risks false positives if > data payloads happen to contain the protocol magic numbers). It > also prohibits the ability to do short reads, or to return a > read error without also sending length bytes of data. > > Remedy this by adding a new global/client flag negotiation, > which affects the response of the NBD_CMD_READ command, and sets > forth rules for how future command responses must behave when > they carry a data payload. > > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > doc/proto.md | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 123 insertions(+) > > diff --git a/doc/proto.md b/doc/proto.md > index 44579fc..f687e3e 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -209,6 +209,10 @@ same value for handle as was sent by the client for each request that > the server is replying to, so that the client may correlate which > request is receiving a response. > > +Note that it is impossible to tell by reading just the server traffic > +whether a data field will be present. To remedy this, the experimental > +`Structured Reply` extension has been introduced; see below. > + > ## Values > > This section describes the value and meaning of constants (other than > @@ -231,6 +235,8 @@ the first magic number. > - bit 1, `NBD_FLAG_NO_ZEROES`; if set, and if the client replies with > `NBD_FLAG_C_NO_ZEROES` in the client flags field, the server MUST NOT > send the 124 bytes of zero at the end of the negotiation. > +- bit 2, `NBD_FLAG_STRUCTURED_REPLY`; defined by the experimental > + `Structured Reply` extension; see below. > > The server MUST NOT set any other flags, and SHOULD NOT change behaviour > unless the client responds with a corresponding flag. The server MUST > @@ -265,6 +271,8 @@ receiving the global flags from the server. > - bit 1, `NBD_FLAG_C_NO_ZEROES`; MUST NOT be set if the server did not > set `NBD_FLAG_NO_ZEROES`. If set, the server MUST NOT send the 124 > bytes of zeroes at the end of the negotiation. > +- bit 2, `NBD_FLAG_C_STRUCTURED_REPLY`; defined by the experimental > + `Structured Reply` extension; see below. > > Clients SHOULD NOT set any other flags; the server MUST drop the > connection if the client sets an unknown flag, or a flag that does > @@ -435,6 +443,10 @@ The following request types exist: > signalling no error), the server MUST immediately close the > connection; it MUST NOT send any further data to the client. > > + The experimental `Structured Reply` extension changes the set of > + valid replies, in part to allow recovery after a partial read; see > + below. > + > * `NBD_CMD_WRITE` (1) > > A write request. Length and offset define the location and amount of > @@ -609,6 +621,117 @@ option reply type. > message if they do not also send it as a reply to the > `NBD_OPT_SELECT` message. > > +### `Structured Reply` extension > + > +Some major downsides of the default reply to `NBD_CMD_READ` is that it > +is not possible to support partial reads (the command must succeed or > +fail as a whole, and len bytes of data must be sent even on an error, > +unless the connection is closed); nor is it possible to decode the > +server traffic without also knowing what pending read requests were > +sent by the client. > + > +To remedy this, a `Structured Reply` extension is envisioned. This > +extension adds a global flag, a client flag, a new reply type during > +the transmission phase, and alters the set of valid replies to an > +existing command. > + > +* `NBD_FLAG_STRUCTURED_REPLY` (bit 2) > + > + This is a global flag; if set, and if the client replies with > + `NBD_FLAG_C_STRUCTURED_REPLY` in the client flags field, the server > + MUST use structured replies to the `NBD_CMD_READ` command. > + > +* `NBD_FLAG_C_STRUCTURED_REPLY` (bit 2) > + > + This is a client flag; MUST NOT be set if the server did not set > + `NBD_FLAG_STRUCTURED_REPLY`. If set, the server must use structured > + replies to the `NBD_CMD_READ` command. > + > +* Transmission phase > + > + The transmission phase includes a third message type: the structured > + reply. If `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, then the > + normal server reply will never contain a data payload, and all > + server replies that send data will be in the following form: > + > + S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`) > + S: 32 bits, error > + S: 64 bits, handle > + S: 32 bits, length of payload (unsigned) > + S: *length* bytes of payload data > + > + If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then the normal > + server reply with a data payload will be used for `NBD_CMD_READ`; > + but any other replies with a data payload will still use a > + structured reply (that is, only `NBD_CMD_READ` is allowed to send > + data in the non-structured form, and negotiating > + `NBD_FLAG_C_STRUCTURED_REPLY` only affects replies to > + `NBD_CMD_READ`). This implies that any new commands that require > + data in the reply should be gated by their own new global and client > + flag. A server MAY refuse to allow a client that negotiates a > + command that requires a structured reply, but does not also > + negotiate `NBD_FLAG_C_STRUCTURED_REPLY`. > + > + In the generic form, the length field of a structured response MAY > + be zero if there is no data payload, and the error field may be set > + regardless of the length field (although where possible, the server > + SHOULD use a length of zero when setting the error field). However, > + particular commands may document additional restrictions regarding > + what forms a valid response (for example, a structured response to > + `NBD_CMD_READ` MUST NOT set the error field, and MUST have a > + non-zero length of at least 9). > + > +* `NBD_CMD_READ` > + > + If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then a read > + request MUST always be answered by a single response (with magic > + 0x67446698 `NBD_REPLY_MAGIC`); the response MUST include length > + bytes of data according to the client's request, although those > + bytes MAY be invalid if an error is returned, and the connection > + MUST if an error occurs after a header claiming no error. > + > + Conversely, if the `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, the > + response MUST be a sequence of zero or more structured replies (with > + magic 0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), followed by a > + concluding normal response (0x67446698 `NBD_REPLY_MAGIC`), where the > + final response MUST NOT have a data payload; all responses in the > + sequence use the same handle from the client. The payload of each > + intermediate structured reply, called a read chunk, MUST be of the > + following form: > + > + S: 64 bits, offset (unsigned) > + S: (*length* - 8) bytes of data > + > + Note that the amount of data returned in a read chunk is determined > + by the length field of the structured reply, and not the original > + length of the client's request. If the server sends a single read > + chunk for a successful read, the server's length will be the > + client's length plus 8, because the server must account for the > + offset field in its reply. Similarly, a successful client request > + for a read of 2^32-8 or more bytes MUST be split into at least two > + read chunks, so that the length field does not suffer from overflow. > + > + The server MUST ensure that each read chunk lies within the original > + offset and length of the original client request, MUST NOT send read > + chunks that would cover the same offset more than once, and MUST > + send at least one byte of data in addition to the offset field of > + each read chunk. The server MAY send read chunks out of order, and > + may interleave other responses between read replies. The server > + MUST NOT set the error field of a read chunk; if an error occurs, it > + MAY immediately end the sequence of structured response messages, > + MUST send the error in the concluding normal response, and SHOULD > + keep the connection open. The final non-structured response MUST > + set an error unless the sum of data sent by all read chunks totals > + the original client length request. > + > + The client SHOULD immediately close the connection if it detects > + that the server has sent an offset more than once (whether or not > + the overlapping data claimed to have the same contents), or if > + receives the concluding normal reply without an error set but > + without all bytes covered by read chunk(s). A future extension may > + add a command flag that would allow the server to skip read chunks > + for portions of the file that read as all zeroes. > + > ## About this file > > This file tries to document the NBD protocol as it is currently > -- > 2.5.5 > > > ------------------------------------------------------------------------------ > Transform Data into Opportunity. > Accelerate data analysis in your applications with > Intel Data Analytics Acceleration Library. > Click to learn more. > http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140 > _______________________________________________ > Nbd-general mailing list > Nbd-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nbd-general > -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 3:56 ` [Qemu-devel] [PATCH 3/1] doc: Propose Structured Replies extension Eric Blake 2016-03-29 7:33 ` [Qemu-devel] [Nbd] " Wouter Verhelst @ 2016-03-29 8:24 ` Alex Bligh 2016-03-29 14:21 ` Eric Blake 2016-03-29 17:53 ` Wouter Verhelst 2 siblings, 1 reply; 43+ messages in thread From: Alex Bligh @ 2016-03-29 8:24 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, Wouter Verhelst, qemu-devel, Alex Bligh On 29 Mar 2016, at 04:56, Eric Blake <eblake@redhat.com> wrote: > The existing transmission phase protocol is difficult to sniff, > because correct interpretation of the server stream requires > context from the client stream (or risks false positives if > data payloads happen to contain the protocol magic numbers). It > also prohibits the ability to do short reads, or to return a > read error without also sending length bytes of data. > > Remedy this by adding a new global/client flag negotiation, > which affects the response of the NBD_CMD_READ command, and sets > forth rules for how future command responses must behave when > they carry a data payload. > > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > doc/proto.md | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 123 insertions(+) > > diff --git a/doc/proto.md b/doc/proto.md > index 44579fc..f687e3e 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -209,6 +209,10 @@ same value for handle as was sent by the client for each request that > the server is replying to, so that the client may correlate which > request is receiving a response. > > +Note that it is impossible to tell by reading just the server traffic > +whether a data field will be present. To remedy this, the experimental > +`Structured Reply` extension has been introduced; see below. > + > ## Values > > This section describes the value and meaning of constants (other than > @@ -231,6 +235,8 @@ the first magic number. > - bit 1, `NBD_FLAG_NO_ZEROES`; if set, and if the client replies with > `NBD_FLAG_C_NO_ZEROES` in the client flags field, the server MUST NOT > send the 124 bytes of zero at the end of the negotiation. > +- bit 2, `NBD_FLAG_STRUCTURED_REPLY`; defined by the experimental > + `Structured Reply` extension; see below. > > The server MUST NOT set any other flags, and SHOULD NOT change behaviour > unless the client responds with a corresponding flag. The server MUST > @@ -265,6 +271,8 @@ receiving the global flags from the server. > - bit 1, `NBD_FLAG_C_NO_ZEROES`; MUST NOT be set if the server did not > set `NBD_FLAG_NO_ZEROES`. If set, the server MUST NOT send the 124 > bytes of zeroes at the end of the negotiation. > +- bit 2, `NBD_FLAG_C_STRUCTURED_REPLY`; defined by the experimental > + `Structured Reply` extension; see below. > > Clients SHOULD NOT set any other flags; the server MUST drop the > connection if the client sets an unknown flag, or a flag that does > @@ -435,6 +443,10 @@ The following request types exist: > signalling no error), the server MUST immediately close the > connection; it MUST NOT send any further data to the client. > > + The experimental `Structured Reply` extension changes the set of > + valid replies, in part to allow recovery after a partial read; see > + below. > + > * `NBD_CMD_WRITE` (1) > > A write request. Length and offset define the location and amount of > @@ -609,6 +621,117 @@ option reply type. > message if they do not also send it as a reply to the > `NBD_OPT_SELECT` message. > > +### `Structured Reply` extension > + > +Some major downsides of the default reply to `NBD_CMD_READ` is that it > +is not possible to support partial reads (the command must succeed or > +fail as a whole, and len bytes of data must be sent even on an error, > +unless the connection is closed); nor is it possible to decode the > +server traffic without also knowing what pending read requests were > +sent by the client. "Some major downsides is" does not agree grammatically, and this sentence is pretty convoluted. How about: Two of the major downsides of the default reply to `NBD_CMD_READ` (without structured replies) are as follows. Firstly, it is not possible to support partial reads (the command must succeed or fail as a whole, and len bytes of data must be sent even on an error unless the connection is closed). Secondly, it is not possible to decode the server traffic without also knowing what pending read requests were sent by the client. > +To remedy this, a `Structured Reply` extension is envisioned. This if it's actually part of the standard (i.e. if this diff is accepted) then its more than 'envisioned', it's 'specified', or 'available' or similar. > +extension adds a global flag, a client flag, a new reply type during > +the transmission phase, and alters the set of valid replies to an > +existing command. > + > +* `NBD_FLAG_STRUCTURED_REPLY` (bit 2) > + > + This is a global flag; if set, and if the client replies with > + `NBD_FLAG_C_STRUCTURED_REPLY` in the client flags field, the server > + MUST use structured replies to the `NBD_CMD_READ` command. > + > +* `NBD_FLAG_C_STRUCTURED_REPLY` (bit 2) > + > + This is a client flag; MUST NOT be set if the server did not set *it* MUST not be set ... > + `NBD_FLAG_STRUCTURED_REPLY`. If set, the server must use structured > + replies to the `NBD_CMD_READ` command. > + > +* Transmission phase > + > + The transmission phase includes a third message type: the structured > + reply. If `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, then the > + normal server reply will never contain a data payload, and all > + server replies that send data will be in the following form: > + > + S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`) > + S: 32 bits, error > + S: 64 bits, handle > + S: 32 bits, length of payload (unsigned) > + S: *length* bytes of payload data Unless I'm missing something this doesn't entirely solve the problem. Imagine you are implementing NBD_CMD_READ (with structured reply) and are asked to read 4G of data. 1G in you find an error. You can't set the error ahead of time as you don't know (yet) there is an error. By the time you discover, you've already streamed 1G of data. What do you do now? And this section seems at odds with the section below (starting "Conversely" which describes an offset/length scheme not detailed above. I think you are saying that there can be one or more of these structured replies in reference to any request, in which case it should be made clearer, and not wait until NBD_CMD_READ. > + If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then the normal > + server reply with a data payload will be used for `NBD_CMD_READ`; > + but any other replies with a data payload will still use a > + structured reply (that is, only `NBD_CMD_READ` is allowed to send > + data in the non-structured form, and negotiating > + `NBD_FLAG_C_STRUCTURED_REPLY` only affects replies to > + `NBD_CMD_READ`). This implies that any new commands that require > + data in the reply should be gated by their own new global and client > + flag. A server MAY refuse to allow a client that negotiates a > + command that requires a structured reply, but does not also > + negotiate `NBD_FLAG_C_STRUCTURED_REPLY`. > + > + In the generic form, the length field of a structured response MAY > + be zero if there is no data payload, and the error field may be set > + regardless of the length field (although where possible, the server > + SHOULD use a length of zero when setting the error field). However, > + particular commands may document additional restrictions regarding > + what forms a valid response (for example, a structured response to > + `NBD_CMD_READ` MUST NOT set the error field, and MUST have a > + non-zero length of at least 9). > + > +* `NBD_CMD_READ` > + > + If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then a read > + request MUST always be answered by a single response (with magic > + 0x67446698 `NBD_REPLY_MAGIC`); the response MUST include length > + bytes of data according to the client's request, although those > + bytes MAY be invalid if an error is returned, and the connection > + MUST if an error occurs after a header claiming no error. Word(s) apparently missing after "MUST" on last line. "MUST be closed by the server" I think. > + > + Conversely, if the `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, the > + response MUST be a sequence of zero or more structured replies (with > + magic 0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), followed by a > + concluding normal response (0x67446698 `NBD_REPLY_MAGIC`), where the > + final response MUST NOT have a data payload; all responses in the > + sequence use the same handle from the client. The payload of each > + intermediate structured reply, called a read chunk, MUST be of the > + following form: > + > + S: 64 bits, offset (unsigned) > + S: (*length* - 8) bytes of data > + > + Note that the amount of data returned in a read chunk is determined > + by the length field of the structured reply, and not the original > + length of the client's request. If the server sends a single read > + chunk for a successful read, the server's length will be the > + client's length plus 8, because the server must account for the > + offset field in its reply. Similarly, a successful client request > + for a read of 2^32-8 or more bytes MUST be split into at least two > + read chunks, so that the length field does not suffer from overflow. OK, so the error is not sent with the chunk (fine), but at the end. > + > + The server MUST ensure that each read chunk lies within the original > + offset and length of the original client request, MUST NOT send read > + chunks that would cover the same offset more than once, and MUST > + send at least one byte of data in addition to the offset field of > + each read chunk. The server MAY send read chunks out of order, and > + may interleave other responses between read replies. I can see why it's attractive from the server's point of view to support out of order replies - for instance an NBD server with a back end like Ceph could use this to launch requests simultaneously to multiple backend stores. However, theoretically a server can now send single one byte packets back, which the client would have to reconstruct. Also, given new commands aren't available unless you support structured replies, you now have to support reassembly of replies (if you want to use new features) even if all your reads are (e.g.) 1k. Can I suggest that the client should be able to specify a minimum 'power of 2' chunk boundary, e.g. if the client says '1k', and its requests do not cross a 1k boundary, the server will guarantee to return them as a single chunk? > The server > + MUST NOT set the error field of a read chunk; if an error occurs, it > + MAY immediately end the sequence of structured response messages, > + MUST send the error in the concluding normal response, and SHOULD > + keep the connection open. The final non-structured response MUST > + set an error unless the sum of data sent by all read chunks totals > + the original client length request. add "and data for the entire range requested has been supplied." (I know this is technically implied by the fact data cannot be duplicated). > + > + The client SHOULD immediately close the connection if it detects > + that the server has sent an offset more than once (whether or not > + the overlapping data claimed to have the same contents), or if > + receives the concluding normal reply without an error set but > + without all bytes covered by read chunk(s). A future extension may > + add a command flag that would allow the server to skip read chunks > + for portions of the file that read as all zeroes. > + > ## About this file > > This file tries to document the NBD protocol as it is currently > -- > 2.5.5 > > > ------------------------------------------------------------------------------ > Transform Data into Opportunity. > Accelerate data analysis in your applications with > Intel Data Analytics Acceleration Library. > Click to learn more. > http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140 > _______________________________________________ > Nbd-general mailing list > Nbd-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nbd-general > -- Alex Bligh ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 8:24 ` Alex Bligh @ 2016-03-29 14:21 ` Eric Blake 2016-03-29 14:37 ` Alex Bligh 0 siblings, 1 reply; 43+ messages in thread From: Eric Blake @ 2016-03-29 14:21 UTC (permalink / raw) To: Alex Bligh; +Cc: nbd-general, Wouter Verhelst, qemu-devel [-- Attachment #1: Type: text/plain, Size: 7785 bytes --] On 03/29/2016 02:24 AM, Alex Bligh wrote: > > On 29 Mar 2016, at 04:56, Eric Blake <eblake@redhat.com> wrote: > >> The existing transmission phase protocol is difficult to sniff, >> because correct interpretation of the server stream requires >> context from the client stream (or risks false positives if >> data payloads happen to contain the protocol magic numbers). It >> also prohibits the ability to do short reads, or to return a >> read error without also sending length bytes of data. >> >> +* `NBD_FLAG_C_STRUCTURED_REPLY` (bit 2) >> + >> + This is a client flag; MUST NOT be set if the server did not set > > *it* MUST not be set ... Copy-and-paste from above; I can correct the place I copied from. > >> + `NBD_FLAG_STRUCTURED_REPLY`. If set, the server must use structured >> + replies to the `NBD_CMD_READ` command. >> + >> +* Transmission phase >> + >> + The transmission phase includes a third message type: the structured >> + reply. If `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, then the >> + normal server reply will never contain a data payload, and all >> + server replies that send data will be in the following form: >> + >> + S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`) >> + S: 32 bits, error >> + S: 64 bits, handle >> + S: 32 bits, length of payload (unsigned) >> + S: *length* bytes of payload data > > Unless I'm missing something this doesn't entirely solve the problem. > Imagine you are implementing NBD_CMD_READ (with structured reply) > and are asked to read 4G of data. 1G in you find an error. You can't > set the error ahead of time as you don't know (yet) there is an > error. By the time you discover, you've already streamed 1G of data. > What do you do now? As the server, you can now either send 3G of (bogus) data followed by the concluding normal response with error set, or you can immediately send the normal response with error set and skip sending the remaining 3G of data. I guess what I need to add is that in transmission phase, most commands have exactly one response per request; but commands may document scenarios where there will be multiple responses to a single request. NBD_CMD_READ uses the multiple responses to make partial read and error handling possible. > > And this section seems at odds with the section below (starting > "Conversely" which describes an offset/length scheme not detailed > above. > > I think you are saying that there can be one or more of these > structured replies in reference to any request, in which > case it should be made clearer, and not wait until NBD_CMD_READ. Not for ANY request, but only for commands that document it. I envision that our proposal for NBD_CMD_GET_LBA_STATUS (or whatever we name it) will always use a single structured reply (if error is set, the payload is empty; otherwise, the payload is variable-sized but the length is part of the structured reply header). So far, only NBD_CMD_READ has a reason for multiple replies. >> +* `NBD_CMD_READ` >> + >> + If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then a read >> + request MUST always be answered by a single response (with magic >> + 0x67446698 `NBD_REPLY_MAGIC`); the response MUST include length >> + bytes of data according to the client's request, although those >> + bytes MAY be invalid if an error is returned, and the connection >> + MUST if an error occurs after a header claiming no error. > > Word(s) apparently missing after "MUST" on last line. "MUST be > closed by the server" I think. Yep. Will fix in v2. >> + >> + The server MUST ensure that each read chunk lies within the original >> + offset and length of the original client request, MUST NOT send read >> + chunks that would cover the same offset more than once, and MUST >> + send at least one byte of data in addition to the offset field of >> + each read chunk. The server MAY send read chunks out of order, and >> + may interleave other responses between read replies. > > I can see why it's attractive from the server's point of view to > support out of order replies - for instance an NBD server with a back > end like Ceph could use this to launch requests simultaneously to > multiple backend stores. > > However, theoretically a server can now send single one byte packets > back, which the client would have to reconstruct. Yeah, but the reconstruction is easy; naively: while response_magic == structured: copy len-8 bytes of data from response to given offset response_magic == normal, read is complete Detecting overlap or incomplete reads would requires more complexity in the client, but I don't know that a client has to care (the protocol is specifically written that a client MAY detect bad servers, but not MUST; a client that assumes the server is well-behaved is still compliant). However, you DO have a point that the server SHOULD send data in reasonable-size chunks; and maybe I should propose a parallel extension where, when negotiated between client and server, the server will advertise minimum and preferred I/O sizes in response to the export name request (for example, a server backed by a real block device may have a minimum of 512 bytes or 4096 bytes, and a preferred size of 64k; while a server backed by a normal file system may have a minimum of 1 byte); then put in restrictions that a server SHOULD reject read/write requests where offset and length are not multiples of the minimum, and that the server SHOULD send read chunks aligned to the preferred size (with exceptions for the head and tail of a larger buffer that meets minimum alignment but not preferred alignment). > > Also, given new commands aren't available unless you support structured > replies, you now have to support reassembly of replies (if you want > to use new features) even if all your reads are (e.g.) 1k. Are you arguing that there should be a flag that controls whether reads must be in-order vs. reassembled? Reassembly must happen either way, the question is whether having a way to allow out-of-order for efficiency, vs. defaulting to in-order for easier computation, is worth it. > > Can I suggest that the client should be able to specify a minimum > 'power of 2' chunk boundary, e.g. if the client says '1k', and its > requests do not cross a 1k boundary, the server will guarantee to > return them as a single chunk? If we want to negotiate minimum and preferred transaction sizes, it should probably be done in a separate proposal. For this proposal, I think the best we can do is merely suggest that the server SHOULD keep read chunks suitably blocked (larger blocks lead to less overhead). > >> The server >> + MUST NOT set the error field of a read chunk; if an error occurs, it >> + MAY immediately end the sequence of structured response messages, >> + MUST send the error in the concluding normal response, and SHOULD >> + keep the connection open. The final non-structured response MUST >> + set an error unless the sum of data sent by all read chunks totals >> + the original client length request. > > add "and data for the entire range requested has been supplied." (I > know this is technically implied by the fact data cannot be duplicated). Sure. But keep in mind that if (when?) we add a flag for allowing the server to skip read chunks on holes, we'll have to tweak the wording to allow the server to send fewer chunks than the client's length, where the client must then assume zeroes for all chunks not received. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 14:21 ` Eric Blake @ 2016-03-29 14:37 ` Alex Bligh 2016-03-29 15:12 ` Eric Blake 0 siblings, 1 reply; 43+ messages in thread From: Alex Bligh @ 2016-03-29 14:37 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, Wouter Verhelst, qemu-devel, Alex Bligh [-- Attachment #1: Type: text/plain, Size: 3837 bytes --] Eric, > I guess what I need to add is that in transmission phase, most commands > have exactly one response per request; but commands may document > scenarios where there will be multiple responses to a single request. > NBD_CMD_READ uses the multiple responses to make partial read and error > handling possible Yes, this. > Yeah, but the reconstruction is easy; naively: > > while response_magic == structured: > copy len-8 bytes of data from response to given offset > response_magic == normal, read is complete It's easy if the result is written to memory. It's not easy if the purpose was (e.g.) to send it to a socket in a sendfile type way. It now requires the entire response be held in memory, which wasn't a requirement before. > Detecting overlap or incomplete reads would requires more complexity in > the client, but I don't know that a client has to care (the protocol is > specifically written that a client MAY detect bad servers, but not MUST; > a client that assumes the server is well-behaved is still compliant). Yep > However, you DO have a point that the server SHOULD send data in > reasonable-size chunks; and maybe I should propose a parallel extension > where, when negotiated between client and server, the server will > advertise minimum and preferred I/O sizes in response to the export name > request (for example, a server backed by a real block device may have a > minimum of 512 bytes or 4096 bytes, and a preferred size of 64k; while a > server backed by a normal file system may have a minimum of 1 byte); > then put in restrictions that a server SHOULD reject read/write requests > where offset and length are not multiples of the minimum, and that the > server SHOULD send read chunks aligned to the preferred size (with > exceptions for the head and tail of a larger buffer that meets minimum > alignment but not preferred alignment). What I'm really after is something that enables me to read 'nicely' in a manner where I won't get fragments. >> Also, given new commands aren't available unless you support structured >> replies, you now have to support reassembly of replies (if you want >> to use new features) even if all your reads are (e.g.) 1k. > > Are you arguing that there should be a flag that controls whether reads > must be in-order vs. reassembled? Reassembly must happen either way, > the question is whether having a way to allow out-of-order for > efficiency, vs. defaulting to in-order for easier computation, is worth it. No, that sounds overengineered. More a way of guaranteeing avoiding a fragmentation on 'simple' reads. Perhaps a 'DF' bit (don't fragment)! If the server doesn't like it, it can always error the command. >>> The server >>> + MUST NOT set the error field of a read chunk; if an error occurs, it >>> + MAY immediately end the sequence of structured response messages, >>> + MUST send the error in the concluding normal response, and SHOULD >>> + keep the connection open. The final non-structured response MUST >>> + set an error unless the sum of data sent by all read chunks totals >>> + the original client length request. >> >> add "and data for the entire range requested has been supplied." (I >> know this is technically implied by the fact data cannot be duplicated). > > Sure. But keep in mind that if (when?) we add a flag for allowing the > server to skip read chunks on holes, we'll have to tweak the wording to > allow the server to send fewer chunks than the client's length, where > the client must then assume zeroes for all chunks not received. Or alternatively a chunk representing a hole. I wonder whether you might be better to extend the chunk structure by 4 bytes to allow for future modifications like this (e.g. NBD_CHUNK_FLAG_HOLE means the chunk is full of zeroes and has no payload). -- Alex Bligh [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 842 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 14:37 ` Alex Bligh @ 2016-03-29 15:12 ` Eric Blake 2016-03-29 16:37 ` Wouter Verhelst 2016-03-29 17:34 ` Alex Bligh 0 siblings, 2 replies; 43+ messages in thread From: Eric Blake @ 2016-03-29 15:12 UTC (permalink / raw) To: Alex Bligh; +Cc: nbd-general, Wouter Verhelst, qemu-devel [-- Attachment #1: Type: text/plain, Size: 2976 bytes --] On 03/29/2016 08:37 AM, Alex Bligh wrote: > Eric, > >> I guess what I need to add is that in transmission phase, most commands >> have exactly one response per request; but commands may document >> scenarios where there will be multiple responses to a single request. >> NBD_CMD_READ uses the multiple responses to make partial read and error >> handling possible > > Yes, this. > >> Are you arguing that there should be a flag that controls whether reads >> must be in-order vs. reassembled? Reassembly must happen either way, >> the question is whether having a way to allow out-of-order for >> efficiency, vs. defaulting to in-order for easier computation, is worth it. > > No, that sounds overengineered. > > More a way of guaranteeing avoiding a fragmentation on 'simple' reads. > Perhaps a 'DF' bit (don't fragment)! If the server doesn't like it, it > can always error the command. Okay, that makes sense. Does reusing NBD_CMD_FLAG_FUA sound reasonable for this purpose, or should it be a new flag? I guess from the standpoint of client/server negotiation, we want to support this don't-fragment request even if NBD_FLAG_SEND_FUA was not listed in export flags, so it sounds like a new flag is best. Next, should it be independently negotiated, or implied by negotiating NBD_FLAG_C_STRUCTURED_REPLIES? I'm leaning towards implied - it's all-or-none for use of the improved read structures. I'm also leaning towards the name NBD_FLAG_C_STRUCTURED_READ, since elsewhere I'm documenting that negotiating this particular global flag affects only the replies to NBD_CMD_READ (other commands may use structured replies, but those commands will be independently negotiated). >> >> Sure. But keep in mind that if (when?) we add a flag for allowing the >> server to skip read chunks on holes, we'll have to tweak the wording to >> allow the server to send fewer chunks than the client's length, where >> the client must then assume zeroes for all chunks not received. > > Or alternatively a chunk representing a hole. I wonder whether you > might be better to extend the chunk structure by 4 bytes to allow for > future modifications like this (e.g. NBD_CHUNK_FLAG_HOLE means > the chunk is full of zeroes and has no payload). Seems reasonable (then I need to reword everything from len-8 to instead be len-12 when dealing with data size within the len bytes of payload). Network traffic-wise, I think it's better to always send the chunk flags, than it would be to try and make the sending of the chunk flags dependent on the client's choice of command flags (that is, we already argued that wire format should never be changed based merely on command flags, as it makes the server stream harder to decipher in isolation). Thanks for the good feedback, by the way; it will make v2 better. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 15:12 ` Eric Blake @ 2016-03-29 16:37 ` Wouter Verhelst 2016-03-29 17:34 ` Alex Bligh 1 sibling, 0 replies; 43+ messages in thread From: Wouter Verhelst @ 2016-03-29 16:37 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, qemu-devel, Alex Bligh [-- Attachment #1: Type: text/plain, Size: 5090 bytes --] On Tue, Mar 29, 2016 at 09:12:02AM -0600, Eric Blake wrote: > On 03/29/2016 08:37 AM, Alex Bligh wrote: > > Eric, > > > >> I guess what I need to add is that in transmission phase, most commands > >> have exactly one response per request; but commands may document > >> scenarios where there will be multiple responses to a single request. > >> NBD_CMD_READ uses the multiple responses to make partial read and error > >> handling possible > > > > Yes, this. > > > > >> Are you arguing that there should be a flag that controls whether reads > >> must be in-order vs. reassembled? Reassembly must happen either way, > >> the question is whether having a way to allow out-of-order for > >> efficiency, vs. defaulting to in-order for easier computation, is worth it. > > > > No, that sounds overengineered. > > > > More a way of guaranteeing avoiding a fragmentation on 'simple' reads. > > Perhaps a 'DF' bit (don't fragment)! If the server doesn't like it, it > > can always error the command. > > Okay, that makes sense. Does reusing NBD_CMD_FLAG_FUA sound reasonable > for this purpose, or should it be a new flag? I guess from the > standpoint of client/server negotiation, we want to support this > don't-fragment request even if NBD_FLAG_SEND_FUA was not listed in > export flags, so it sounds like a new flag is best. Next, should it be > independently negotiated, or implied by negotiating > NBD_FLAG_C_STRUCTURED_REPLIES? I'm leaning towards implied - it's I think it should be implied, yes. Having said that, There's only a limited number of flag bits available. We can obviously always add more flags by adding in a second flags field, but that introduces more backwards compatibility issues (would require another global flag to say "we support extended flags", which the client then has to ack, too, etc). As such, not using flag bits when we don't strictly need them is a feature. I'm not sure if this really needs to be negotiated using a flag bit. The NO_ZEROES thing was negotiated using a flag because NBD_OPT_EXPORT_NAME can't be upgraded, but that isn't the case here. This could easily be negotiated using some NBD_OPT thing: client->kernel: check whether structured replies are supported (if yes:) client->server: NBD_OPT_STRUCTURED_REPLIES server->client: NBD_REP_ACK (if supported, or NBD_REP_UNSUP if not) At this point, the server can send structured replies at leisure. We could also set a "support don't fragment" flag bit in the per-export flags field (which is a larger flags field than the global one), so that the client kernel knows it can request non-fragmented replies without requiring an extra kernel API (since per-export flags are passed to the kernel by way of the NBD_SET_FLAGS ioctl). (the spec can then possibly also say that the kernel MAY send structured replies without sending the "support don't fragment" bit, but that it then MUST NOT send fragmented replies -- although I'm not too sure whether that's a good idea) This would also get it more in line with the way the CMD_TRIM and CMD_FLUSH commands are negotiated (by way of a per-export flag). > all-or-none for use of the improved read structures. I'm also leaning > towards the name NBD_FLAG_C_STRUCTURED_READ, since elsewhere I'm That's probably a good idea too, yes (with obvious s/FLAG_C/OPT/ change as per above). > documenting that negotiating this particular global flag affects only > the replies to NBD_CMD_READ (other commands may use structured replies, > but those commands will be independently negotiated). Right. > >> Sure. But keep in mind that if (when?) we add a flag for allowing the > >> server to skip read chunks on holes, we'll have to tweak the wording to > >> allow the server to send fewer chunks than the client's length, where > >> the client must then assume zeroes for all chunks not received. > > > > Or alternatively a chunk representing a hole. I wonder whether you > > might be better to extend the chunk structure by 4 bytes to allow for > > future modifications like this (e.g. NBD_CHUNK_FLAG_HOLE means > > the chunk is full of zeroes and has no payload). > > Seems reasonable (then I need to reword everything from len-8 to instead > be len-12 when dealing with data size within the len bytes of payload). > Network traffic-wise, I think it's better to always send the chunk > flags, than it would be to try and make the sending of the chunk flags > dependent on the client's choice of command flags (that is, we already > argued that wire format should never be changed based merely on command > flags, as it makes the server stream harder to decipher in isolation). > > Thanks for the good feedback, by the way; it will make v2 better. Regards, -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 15:12 ` Eric Blake 2016-03-29 16:37 ` Wouter Verhelst @ 2016-03-29 17:34 ` Alex Bligh 2016-03-29 17:45 ` Eric Blake 1 sibling, 1 reply; 43+ messages in thread From: Alex Bligh @ 2016-03-29 17:34 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, Wouter Verhelst, qemu-devel, Alex Bligh [-- Attachment #1: Type: text/plain, Size: 3474 bytes --] On 29 Mar 2016, at 16:12, Eric Blake <eblake@redhat.com> wrote: >> >> More a way of guaranteeing avoiding a fragmentation on 'simple' reads. >> Perhaps a 'DF' bit (don't fragment)! If the server doesn't like it, it >> can always error the command. > > Okay, that makes sense. Does reusing NBD_CMD_FLAG_FUA sound reasonable > for this purpose, or should it be a new flag? I guess from the > standpoint of client/server negotiation, we want to support this > don't-fragment request even if NBD_FLAG_SEND_FUA was not listed in > export flags, so it sounds like a new flag is best. I think it should be separate from FUA, as there are possibly servers out there that support FUA but not this, but ... > Next, should it be > independently negotiated, or implied by negotiating > NBD_FLAG_C_STRUCTURED_REPLIES? I'm leaning towards implied - it's > all-or-none for use of the improved read structures. I would agree. I think if it supports the structured reply semantics, it should also support 'DF'. So if you know the server supports structured replies, you know you can set DF on them without any further requirements. > I'm also leaning > towards the name NBD_FLAG_C_STRUCTURED_READ, since elsewhere I'm > documenting that negotiating this particular global flag affects only > the replies to NBD_CMD_READ (other commands may use structured replies, > but those commands will be independently negotiated). I suspect the name is the thing that makes the least difference, and hence do not feel strongly at all, but: a) Why '_C_'? b) Throughout the current documentation you've called them 'structured replies', not 'structured reads' and said that in the future multiple commands might support them. So you should arguably call the flag '*_STRUCTURED_REPLY' or change the text. >>> Sure. But keep in mind that if (when?) we add a flag for allowing the >>> server to skip read chunks on holes, we'll have to tweak the wording to >>> allow the server to send fewer chunks than the client's length, where >>> the client must then assume zeroes for all chunks not received. >> >> Or alternatively a chunk representing a hole. I wonder whether you >> might be better to extend the chunk structure by 4 bytes to allow for >> future modifications like this (e.g. NBD_CHUNK_FLAG_HOLE means >> the chunk is full of zeroes and has no payload). > > Seems reasonable (then I need to reword everything from len-8 to instead > be len-12 when dealing with data size within the len bytes of payload). > Network traffic-wise, I think it's better to always send the chunk > flags, than it would be to try and make the sending of the chunk flags > dependent on the client's choice of command flags (that is, we already > argued that wire format should never be changed based merely on command > flags, as it makes the server stream harder to decipher in isolation). Absolutely. And that way if we have to add anything to the chunk (e.g. we run out of flags!), we can use one or more bits to indicate this. > Thanks for the good feedback, by the way; it will make v2 better. No problem. Some time ago I rewrote chunks of the nbd test suite and wrote the bit that tested parallel outstanding commands. At the back of my mind is whether I should extend the test suite to test this and how we could persuade a server to 'often fragment' so we can test reassembly (some form of debug setting on the server like 'max fragment size' or similar I suspect). -- Alex Bligh [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 842 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 17:34 ` Alex Bligh @ 2016-03-29 17:45 ` Eric Blake 2016-03-29 18:03 ` Wouter Verhelst 0 siblings, 1 reply; 43+ messages in thread From: Eric Blake @ 2016-03-29 17:45 UTC (permalink / raw) To: Alex Bligh; +Cc: nbd-general, Wouter Verhelst, qemu-devel [-- Attachment #1: Type: text/plain, Size: 3198 bytes --] On 03/29/2016 11:34 AM, Alex Bligh wrote: > > On 29 Mar 2016, at 16:12, Eric Blake <eblake@redhat.com> wrote: >>> >>> More a way of guaranteeing avoiding a fragmentation on 'simple' reads. >>> Perhaps a 'DF' bit (don't fragment)! If the server doesn't like it, it >>> can always error the command. >> >> Okay, that makes sense. Does reusing NBD_CMD_FLAG_FUA sound reasonable >> for this purpose, or should it be a new flag? I guess from the >> standpoint of client/server negotiation, we want to support this >> don't-fragment request even if NBD_FLAG_SEND_FUA was not listed in >> export flags, so it sounds like a new flag is best. > > I think it should be separate from FUA, as there are possibly > servers out there that support FUA but not this, but ... > >> Next, should it be >> independently negotiated, or implied by negotiating >> NBD_FLAG_C_STRUCTURED_REPLIES? I'm leaning towards implied - it's >> all-or-none for use of the improved read structures. > > I would agree. I think if it supports the structured reply semantics, > it should also support 'DF'. So if you know the server supports > structured replies, you know you can set DF on them without any > further requirements. Supporting DF merely transfers the burden of collection between server and client. I suspect that there are cases where the server does NOT want to support DF (because it would require the server to allocate memory to collect the data before sending a single structured read reply), just as you have stated that there are cases where the client has an additional burden if DF is not supported. So for v2, I'm going to explicitly document a DF export flag, and recommend (but not require) that the server support it. > >> I'm also leaning >> towards the name NBD_FLAG_C_STRUCTURED_READ, since elsewhere I'm >> documenting that negotiating this particular global flag affects only >> the replies to NBD_CMD_READ (other commands may use structured replies, >> but those commands will be independently negotiated). > > I suspect the name is the thing that makes the least difference, and > hence do not feel strongly at all, but: > > a) Why '_C_'? Matches existing convention on client flags; but Wouter's idea of using NBD_OPT_ instead of global/client flags is better, so the _C_ disappears in v2. > > b) Throughout the current documentation you've called them 'structured > replies', not 'structured reads' and said that in the future multiple > commands might support them. So you should arguably call the flag > '*_STRUCTURED_REPLY' or change the text. I'm changing the text, and favoring the name STRUCTURED_READ except in the description of the transmission phase, where Structured Reply is the header used for ANY form of reply with data (to make it more obvious that structured read is a subset of structured replies), while at the same time emphasizing that NBD_CMD_READ is the only command that can get away with data in a non-structured reply, and only when structured read was not negotiated. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 17:45 ` Eric Blake @ 2016-03-29 18:03 ` Wouter Verhelst 2016-03-29 18:07 ` Eric Blake 2016-03-29 18:09 ` Alex Bligh 0 siblings, 2 replies; 43+ messages in thread From: Wouter Verhelst @ 2016-03-29 18:03 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, qemu-devel, Alex Bligh On Tue, Mar 29, 2016 at 11:45:45AM -0600, Eric Blake wrote: > On 03/29/2016 11:34 AM, Alex Bligh wrote: > > I would agree. I think if it supports the structured reply semantics, > > it should also support 'DF'. So if you know the server supports > > structured replies, you know you can set DF on them without any > > further requirements. > > Supporting DF merely transfers the burden of collection between server > and client. I suspect that there are cases where the server does NOT > want to support DF (because it would require the server to allocate > memory to collect the data before sending a single structured read > reply), There are other ways to handle that; e.g., the server could have a "request too large for non-fragmented read" error message. The spec should give a minimum size that the server MUST support (which should be reasonably large), and should state that a server MAY reply to any request with DF set for a block larger than that minimum, with that error. Otherwise the client could conceivably send out heaps of requests for (UINT32_MAX - 8) bytes with DF set and basically DoS the server. > just as you have stated that there are cases where the client > has an additional burden if DF is not supported. So for v2, I'm going > to explicitly document a DF export flag, and recommend (but not require) > that the server support it. I'd prefer not to see that. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 18:03 ` Wouter Verhelst @ 2016-03-29 18:07 ` Eric Blake 2016-03-29 18:19 ` Wouter Verhelst 2016-03-29 18:09 ` Alex Bligh 1 sibling, 1 reply; 43+ messages in thread From: Eric Blake @ 2016-03-29 18:07 UTC (permalink / raw) To: Wouter Verhelst; +Cc: nbd-general, qemu-devel, Alex Bligh [-- Attachment #1: Type: text/plain, Size: 1850 bytes --] On 03/29/2016 12:03 PM, Wouter Verhelst wrote: > On Tue, Mar 29, 2016 at 11:45:45AM -0600, Eric Blake wrote: >> On 03/29/2016 11:34 AM, Alex Bligh wrote: >>> I would agree. I think if it supports the structured reply semantics, >>> it should also support 'DF'. So if you know the server supports >>> structured replies, you know you can set DF on them without any >>> further requirements. >> >> Supporting DF merely transfers the burden of collection between server >> and client. I suspect that there are cases where the server does NOT >> want to support DF (because it would require the server to allocate >> memory to collect the data before sending a single structured read >> reply), > > There are other ways to handle that; e.g., the server could have a > "request too large for non-fragmented read" error message. The spec > should give a minimum size that the server MUST support (which should be > reasonably large), and should state that a server MAY reply to any > request with DF set for a block larger than that minimum, with that > error. How does 64k sound? > > Otherwise the client could conceivably send out heaps of requests for > (UINT32_MAX - 8) bytes with DF set and basically DoS the server. Point taken. > >> just as you have stated that there are cases where the client >> has an additional burden if DF is not supported. So for v2, I'm going >> to explicitly document a DF export flag, and recommend (but not require) >> that the server support it. > > I'd prefer not to see that. Okay, good thing I haven't sent v2 yet; I'm making more edits to drop the export flag, and require unconditional support for the command flag once structured reads are negotiated. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 18:07 ` Eric Blake @ 2016-03-29 18:19 ` Wouter Verhelst 2016-03-29 18:25 ` Eric Blake 0 siblings, 1 reply; 43+ messages in thread From: Wouter Verhelst @ 2016-03-29 18:19 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, qemu-devel On Tue, Mar 29, 2016 at 12:07:59PM -0600, Eric Blake wrote: > On 03/29/2016 12:03 PM, Wouter Verhelst wrote: > > On Tue, Mar 29, 2016 at 11:45:45AM -0600, Eric Blake wrote: > >> Supporting DF merely transfers the burden of collection between server > >> and client. I suspect that there are cases where the server does NOT > >> want to support DF (because it would require the server to allocate > >> memory to collect the data before sending a single structured read > >> reply), > > > > There are other ways to handle that; e.g., the server could have a > > "request too large for non-fragmented read" error message. The spec > > should give a minimum size that the server MUST support (which should be > > reasonably large), and should state that a server MAY reply to any > > request with DF set for a block larger than that minimum, with that > > error. > > How does 64k sound? Dunno. It might make sense for this number to be based upon some "standard" minimum request size in things like ATA or SCSI if such a number exists there, but I don't know enough about either standard to answer that question myself. If such a number doesn't exist (or nobody who knows speaks up soon enough), 64k is certainly good enough, I suppose. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 18:19 ` Wouter Verhelst @ 2016-03-29 18:25 ` Eric Blake 0 siblings, 0 replies; 43+ messages in thread From: Eric Blake @ 2016-03-29 18:25 UTC (permalink / raw) To: Wouter Verhelst; +Cc: nbd-general, qemu-devel [-- Attachment #1: Type: text/plain, Size: 1846 bytes --] On 03/29/2016 12:19 PM, Wouter Verhelst wrote: > On Tue, Mar 29, 2016 at 12:07:59PM -0600, Eric Blake wrote: >> On 03/29/2016 12:03 PM, Wouter Verhelst wrote: >>> On Tue, Mar 29, 2016 at 11:45:45AM -0600, Eric Blake wrote: >>>> Supporting DF merely transfers the burden of collection between server >>>> and client. I suspect that there are cases where the server does NOT >>>> want to support DF (because it would require the server to allocate >>>> memory to collect the data before sending a single structured read >>>> reply), >>> >>> There are other ways to handle that; e.g., the server could have a >>> "request too large for non-fragmented read" error message. The spec >>> should give a minimum size that the server MUST support (which should be >>> reasonably large), and should state that a server MAY reply to any >>> request with DF set for a block larger than that minimum, with that >>> error. >> >> How does 64k sound? > > Dunno. It might make sense for this number to be based upon some > "standard" minimum request size in things like ATA or SCSI if such a > number exists there, but I don't know enough about either standard to > answer that question myself. > > If such a number doesn't exist (or nobody who knows speaks up soon > enough), 64k is certainly good enough, I suppose. And as mentioned in another email, we may want to propose an independent extension that allows NBD_OPT_LIST and friends to start advertising the minimum and preferred sizes of operations on a given export, where the server can give hard errors if the client requests a read or write not aligned to the minimum, and where the server must not fail a DF set for anything smaller than preferred size. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 18:03 ` Wouter Verhelst 2016-03-29 18:07 ` Eric Blake @ 2016-03-29 18:09 ` Alex Bligh 1 sibling, 0 replies; 43+ messages in thread From: Alex Bligh @ 2016-03-29 18:09 UTC (permalink / raw) To: Wouter Verhelst; +Cc: nbd-general, qemu-devel, Alex Bligh On 29 Mar 2016, at 19:03, Wouter Verhelst <w@uter.be> wrote: > There are other ways to handle that; e.g., the server could have a > "request too large for non-fragmented read" error message. The spec > should give a minimum size that the server MUST support (which should be > reasonably large), and should state that a server MAY reply to any > request with DF set for a block larger than that minimum, with that > error. Yeah something like that. Or the server could simply publish this as part of the option negotiation. -- Alex Bligh ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 3:56 ` [Qemu-devel] [PATCH 3/1] doc: Propose Structured Replies extension Eric Blake 2016-03-29 7:33 ` [Qemu-devel] [Nbd] " Wouter Verhelst 2016-03-29 8:24 ` Alex Bligh @ 2016-03-29 17:53 ` Wouter Verhelst 2016-03-29 18:23 ` Eric Blake 2 siblings, 1 reply; 43+ messages in thread From: Wouter Verhelst @ 2016-03-29 17:53 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, qemu-devel Hi Eric, Having read this in more detail now: On Mon, Mar 28, 2016 at 09:56:36PM -0600, Eric Blake wrote: > + The server MUST ensure that each read chunk lies within the original > + offset and length of the original client request, MUST NOT send read > + chunks that would cover the same offset more than once, and MUST > + send at least one byte of data in addition to the offset field of > + each read chunk. The server MAY send read chunks out of order, and > + may interleave other responses between read replies. The server > + MUST NOT set the error field of a read chunk; if an error occurs, it > + MAY immediately end the sequence of structured response messages, > + MUST send the error in the concluding normal response, and SHOULD > + keep the connection open. The final non-structured response MUST > + set an error unless the sum of data sent by all read chunks totals > + the original client length request. I'm thinking it would probably be a good idea to have the concluding response (if the error field is nonzero) have an offset too; the server could use that to specify where, exactly, the error occurred (so that a client which sent a very large read request doesn't have to go through a binary search or some such to figure out where the read error happened) i.e., C: read X bytes at offset Y S: (X bytes) S: (error, offset Z) client now has Z-1 bytes of valid data (with the rest being garbage, plus a read error) The alternative (in the above) would be that the client has 0 bytes of valid data, and would have to issue another read request to figure out which parts of the data are valid. > + The client SHOULD immediately close the connection if it detects > + that the server has sent an offset more than once (whether or not > + the overlapping data claimed to have the same contents), or if > + receives the concluding normal reply without an error set but > + without all bytes covered by read chunk(s). A future extension may I would reword this to... The client MAY immediately close the connection if it detects that [...]. The server MUST NOT send an offset more than once. > + add a command flag that would allow the server to skip read chunks > + for portions of the file that read as all zeroes. Not sure if that part is necessary or helpful, really. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 17:53 ` Wouter Verhelst @ 2016-03-29 18:23 ` Eric Blake 2016-03-29 18:51 ` Wouter Verhelst 0 siblings, 1 reply; 43+ messages in thread From: Eric Blake @ 2016-03-29 18:23 UTC (permalink / raw) To: Wouter Verhelst; +Cc: nbd-general, qemu-devel [-- Attachment #1: Type: text/plain, Size: 5511 bytes --] On 03/29/2016 11:53 AM, Wouter Verhelst wrote: > Hi Eric, > > Having read this in more detail now: > > On Mon, Mar 28, 2016 at 09:56:36PM -0600, Eric Blake wrote: >> + The server MUST ensure that each read chunk lies within the original >> + offset and length of the original client request, MUST NOT send read >> + chunks that would cover the same offset more than once, and MUST >> + send at least one byte of data in addition to the offset field of >> + each read chunk. The server MAY send read chunks out of order, and >> + may interleave other responses between read replies. The server >> + MUST NOT set the error field of a read chunk; if an error occurs, it >> + MAY immediately end the sequence of structured response messages, >> + MUST send the error in the concluding normal response, and SHOULD >> + keep the connection open. The final non-structured response MUST >> + set an error unless the sum of data sent by all read chunks totals >> + the original client length request. > > I'm thinking it would probably be a good idea to have the concluding > response (if the error field is nonzero) have an offset too; the server > could use that to specify where, exactly, the error occurred (so that a > client which sent a very large read request doesn't have to go through a > binary search or some such to figure out where the read error happened) > > i.e., > > C: read X bytes at offset Y > S: (X bytes) > S: (error, offset Z) Here, I'm assuming that you mean X > Z. Unfortunately, I chose the design of 0 or more structured replies followed by a normal reply, so that the normal reply is a reliable indicator that the read is complete (whether successful or not); and the whole goal of the extension is to avoid sending any data payload on a normal reply. I'm not sure how to send the offset in the normal reply without violating the premise that a normal reply has no payload. But what we could do is allow for the server to send a structured reply data chunk of zero bytes, with the offset in question, as the offset where an error occurred, prior to then sending the normal reply with the final error indicator. I guess that also means that if we don't have the DF command flag set, the server could then report multiple failed reads interspersed among larger successful clusters, when trying to recover as much of the failing disk as possible, if each failure is reported via a separate structured read of zero bytes. Hmm, that also means that we have to be careful on the wording - if we allow a structured reply with 0 data bytes to report an error, after already sending a larger reply with partially valid bytes, then that means that a client will receive more than one read chunk visiting the same offset, so we'd have to make the wording permit that. > client now has Z-1 bytes of valid data (with the rest being garbage, > plus a read error) > > The alternative (in the above) would be that the client has 0 bytes of > valid data, and would have to issue another read request to figure out > which parts of the data are valid. So if I'm understanding you, you are trying to state that the server may report the header for X bytes, then fail partway through those X bytes; it must still send X bytes, but can then report how many are valid (that is, a client must assume that 0 of the X bytes received are valid _unless_ the server also reported where it failed). But I was envisioning the opposite: the server must NOT send X bytes unless it knows they are valid; if it encounters a read error at Z, then it sends a structured read of Z-1 bytes before the final normal message that reports overall failure. The client then assumes that all X bytes received are valid. But I also documented that the client MAY, but not MUST, abort the read at the first error; so the idea of being able to report multiple errors and/or send headers prior to learning whether there are read errors means that your interpretation is probably safer than mine. I guess it will help to have actual v2 wording in front of us to further fine-tune the wording. > >> + The client SHOULD immediately close the connection if it detects >> + that the server has sent an offset more than once (whether or not >> + the overlapping data claimed to have the same contents), or if >> + receives the concluding normal reply without an error set but >> + without all bytes covered by read chunk(s). A future extension may > > I would reword this to... > > The client MAY immediately close the connection if it detects that > [...]. The server MUST NOT send an offset more than once. > >> + add a command flag that would allow the server to skip read chunks >> + for portions of the file that read as all zeroes. > > Not sure if that part is necessary or helpful, really. I envision such an extension in parallel to (or as part of) the proposed NBD_CMD_GET_LBA_STATUS (or whatever we name it) - it is slightly more efficient to skip reads of holes with a single read command flag than it is to first read status to determine where holes are and only then issue reads for the non-hole regions. But I can also buy your argument that such language belongs in the extension for sparse reads, and doesn't need to be present in the extension for structured reads. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 18:23 ` Eric Blake @ 2016-03-29 18:51 ` Wouter Verhelst 2016-03-29 19:06 ` Wouter Verhelst 2016-03-29 19:39 ` Alex Bligh 0 siblings, 2 replies; 43+ messages in thread From: Wouter Verhelst @ 2016-03-29 18:51 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, qemu-devel On Tue, Mar 29, 2016 at 12:23:31PM -0600, Eric Blake wrote: > On 03/29/2016 11:53 AM, Wouter Verhelst wrote: > > Hi Eric, > > > > Having read this in more detail now: > > > > On Mon, Mar 28, 2016 at 09:56:36PM -0600, Eric Blake wrote: > >> + The server MUST ensure that each read chunk lies within the original > >> + offset and length of the original client request, MUST NOT send read > >> + chunks that would cover the same offset more than once, and MUST > >> + send at least one byte of data in addition to the offset field of > >> + each read chunk. The server MAY send read chunks out of order, and > >> + may interleave other responses between read replies. The server > >> + MUST NOT set the error field of a read chunk; if an error occurs, it > >> + MAY immediately end the sequence of structured response messages, > >> + MUST send the error in the concluding normal response, and SHOULD > >> + keep the connection open. The final non-structured response MUST > >> + set an error unless the sum of data sent by all read chunks totals > >> + the original client length request. > > > > I'm thinking it would probably be a good idea to have the concluding > > response (if the error field is nonzero) have an offset too; the server > > could use that to specify where, exactly, the error occurred (so that a > > client which sent a very large read request doesn't have to go through a > > binary search or some such to figure out where the read error happened) > > > > i.e., > > > > C: read X bytes at offset Y > > S: (X bytes) > > S: (error, offset Z) > > Here, I'm assuming that you mean X > Z. Yes, obviously. > Unfortunately, I chose the design of 0 or more structured replies > followed by a normal reply, so that the normal reply is a reliable > indicator that the read is complete (whether successful or not); and the > whole goal of the extension is to avoid sending any data payload on a > normal reply. I'm not sure how to send the offset in the normal reply > without violating the premise that a normal reply has no payload. Oh. I thought you meant for the concluding message to also be a structured reply with the length field be zero, but you mean for it to be a non-structured reply message? If so, you should clarify that a bit more (this wasn't clear to me)... [...] > But what we could do is allow for the server to send a structured reply > data chunk of zero bytes, with the offset in question, as the offset > where an error occurred, prior to then sending the normal reply with the > final error indicator. I guess that also means that if we don't have > the DF command flag set, the server could then report multiple failed > reads interspersed among larger successful clusters, when trying to > recover as much of the failing disk as possible, if each failure is > reported via a separate structured read of zero bytes. Hmm, that also > means that we have to be careful on the wording - if we allow a > structured reply with 0 data bytes to report an error, after already > sending a larger reply with partially valid bytes, then that means that > a client will receive more than one read chunk visiting the same offset, > so we'd have to make the wording permit that. > > > client now has Z-1 bytes of valid data (with the rest being garbage, > > plus a read error) > > > > The alternative (in the above) would be that the client has 0 bytes of > > valid data, and would have to issue another read request to figure out > > which parts of the data are valid. > > So if I'm understanding you, you are trying to state that the server may > report the header for X bytes, then fail partway through those X bytes; > it must still send X bytes, but can then report how many are valid (that > is, a client must assume that 0 of the X bytes received are valid > _unless_ the server also reported where it failed). Yes. > But I was envisioning the opposite: the server must NOT send X bytes > unless it knows they are valid; if it encounters a read error at Z, > then it sends a structured read of Z-1 bytes before the final normal > message that reports overall failure. The client then assumes that > all X bytes received are valid. The problem with that approach is that it makes it impossible for a server to use a sendfile()-like system call, where you don't know that there's a read error until start sending out data to the client (which implies that you must've already sent out the header). > But I also documented that the client MAY, but not MUST, abort the read > at the first error; so the idea of being able to report multiple errors > and/or send headers prior to learning whether there are read errors > means that your interpretation is probably safer than mine. I didn't mean to imply that. I do think that aborting the read at the first error is probably a good idea. If the error occurs because the disk is dying, having the server go ahead and try to read more data anyway is probably not a very good idea (unless instructed to do so by the user, i.e., client). > I guess it will help to have actual v2 wording in front of us to further > fine-tune the wording. Certainly :-) > >> + The client SHOULD immediately close the connection if it detects > >> + that the server has sent an offset more than once (whether or not > >> + the overlapping data claimed to have the same contents), or if > >> + receives the concluding normal reply without an error set but > >> + without all bytes covered by read chunk(s). A future extension may > > > > I would reword this to... > > > > The client MAY immediately close the connection if it detects that > > [...]. The server MUST NOT send an offset more than once. > > > >> + add a command flag that would allow the server to skip read chunks > >> + for portions of the file that read as all zeroes. > > > > Not sure if that part is necessary or helpful, really. > > I envision such an extension in parallel to (or as part of) the proposed > NBD_CMD_GET_LBA_STATUS (or whatever we name it) - it is slightly more > efficient to skip reads of holes with a single read command flag than it > is to first read status to determine where holes are and only then issue > reads for the non-hole regions. Sure. > But I can also buy your argument that such language belongs in the > extension for sparse reads, and doesn't need to be present in the > extension for structured reads. Right, that was my point, mainly. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 18:51 ` Wouter Verhelst @ 2016-03-29 19:06 ` Wouter Verhelst 2016-03-29 19:39 ` Alex Bligh 1 sibling, 0 replies; 43+ messages in thread From: Wouter Verhelst @ 2016-03-29 19:06 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, qemu-devel On Tue, Mar 29, 2016 at 08:51:57PM +0200, Wouter Verhelst wrote: > On Tue, Mar 29, 2016 at 12:23:31PM -0600, Eric Blake wrote: > > Unfortunately, I chose the design of 0 or more structured replies > > followed by a normal reply, so that the normal reply is a reliable > > indicator that the read is complete (whether successful or not); and the > > whole goal of the extension is to avoid sending any data payload on a > > normal reply. I'm not sure how to send the offset in the normal reply > > without violating the premise that a normal reply has no payload. > > Oh. I thought you meant for the concluding message to also be a > structured reply with the length field be zero, but you mean for it to > be a non-structured reply message? If so, you should clarify that a bit > more (this wasn't clear to me)... Also, I'm not convinced that's a very good approach, since it also requires analyzers to have more context than just requiring a single final "empty" structured reply message. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 18:51 ` Wouter Verhelst 2016-03-29 19:06 ` Wouter Verhelst @ 2016-03-29 19:39 ` Alex Bligh 2016-03-29 20:00 ` Eric Blake 1 sibling, 1 reply; 43+ messages in thread From: Alex Bligh @ 2016-03-29 19:39 UTC (permalink / raw) To: Wouter Verhelst; +Cc: nbd-general, qemu-devel, Alex Bligh On 29 Mar 2016, at 19:51, Wouter Verhelst <w@uter.be> wrote: >> >> But I was envisioning the opposite: the server must NOT send X bytes >> unless it knows they are valid; if it encounters a read error at Z, >> then it sends a structured read of Z-1 bytes before the final normal >> message that reports overall failure. The client then assumes that >> all X bytes received are valid. > > The problem with that approach is that it makes it impossible for a > server to use a sendfile()-like system call, where you don't know that > there's a read error until start sending out data to the client (which > implies that you must've already sent out the header). I don't think sendfile semantics are ever compatible with reporting read errors *unless* you pad after the read. IIRC the way sendfile works is that you specify a pointer to an offset, and sendfile sends as much as it can read (up to the length specified) and updates the offset for the length read. Naturally at the start of the read section, you don't know when the error is going to occur, so you must say that the length of the data read is going to be the length of the actual chunk. sendfile then does its stuff, and fills up either the whole thing, or part of it. In the case that part of the data (only) is available, you can't report the error there and then, because the client is expecting chunk data, so you must either close the connection (potentially disruptive) or pad the data, and report the error at the end. Using Eric's current scheme, you have no way of knowing where the error occurred. Remember the chunks could be out of order, e.g. you get chunks 1,3,5,7,9 in, and then an error, so you have no idea where the error was. It could be in chunks 1,3,5,7,9 (and the server might have padded the rest of the chunk) or in an unread chunk (2,4,6,8,10). This seems undesirable. I think we are paying too much attention to trying to keep NBD_RESPONSE intact. The justification for this was (I think) that it made it easier for existing protocol analysers. It doesn't, really, as all the data is going to come BEFORE the NBD_RESPONSE (unlike in NBD_CMD_READ in other situations). I think we should therefore look at this the other way around. Here's a straw man proposal as an alternative for the reply bits. For a structured reply ALL we get is the chunks. The final chunk (possibly the only chunk) is marked specially. Each chunk looks something like: offset+ 0000 32 bit NBD_STRUCTURED_REPLY_MAGIC 0004 64 bit handle 000C 32 bit Flags 0010 32 bit Payload length We have a couple of flags defined: NBD_CHUNK_IS_DATA: the chunk is data, and the payload is a 64 bit offset plus the data read NBD_CHUNK_IS_HOLE: the chunk is zeroes, and the payload is a 64 bit offset (only) NBD_CHUNK_IS_END: (must be the final chunk). The payload is a 64 bit offset plus a 32 bit error code, or zero. If no error, the offset must be set to the total amount read. If there is an error, the offset MAY indicate the position of the error. If an error occurs, no more chunks should be sent. The advantages of this scheme are: 1. Only one packet type in the reply (chunks) 2. It's no more difficult to implement wireshark decoding of this (in addition to the normal NBD protocol) than the current proposal. I'd suggest in fact they could be easier. 3. Chunks that error part way through (sendfile type) must still be padded but now can indicate error location. 4. It would be possible to allow EVERY server reply to be a structured reply that simply set NBD_CHUNK_IS_END. That gives us a convenient route to servers which only implement structured replies. With DF, this would be little harder than implementing the current protocol. -- Alex Bligh ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 19:39 ` Alex Bligh @ 2016-03-29 20:00 ` Eric Blake 2016-03-29 20:18 ` Alex Bligh 2016-03-29 20:44 ` Alex Bligh 0 siblings, 2 replies; 43+ messages in thread From: Eric Blake @ 2016-03-29 20:00 UTC (permalink / raw) To: Alex Bligh, Wouter Verhelst; +Cc: nbd-general, qemu-devel [-- Attachment #1: Type: text/plain, Size: 3473 bytes --] On 03/29/2016 01:39 PM, Alex Bligh wrote: > I think we are paying too much attention to trying to keep NBD_RESPONSE > intact. The justification for this was (I think) that it made it easier > for existing protocol analysers. It doesn't, really, as all the data > is going to come BEFORE the NBD_RESPONSE (unlike in NBD_CMD_READ in > other situations). > > I think we should therefore look at this the other way around. Here's > a straw man proposal as an alternative for the reply bits. For > a structured reply ALL we get is the chunks. The final chunk > (possibly the only chunk) is marked specially. Each chunk looks something > like: > > offset+ > 0000 32 bit NBD_STRUCTURED_REPLY_MAGIC > 0004 64 bit handle > 000C 32 bit Flags > 0010 32 bit Payload length > > > We have a couple of flags defined: > > NBD_CHUNK_IS_DATA: the chunk is data, and the payload is a 64 bit offset > plus the data read > > NBD_CHUNK_IS_HOLE: the chunk is zeroes, and the payload is a 64 bit offset > (only) > > NBD_CHUNK_IS_END: (must be the final chunk). The payload is a 64 bit offset > plus a 32 bit error code, or zero. If no error, the offset must be set to > the total amount read. If there is an error, the offset MAY indicate the > position of the error. If an error occurs, no more chunks should be sent. I'm liking it - then we aren't sending a mandatory 0 error field on read chunks. Although I might use '32 bit Type' rather than '32 bit Flags', since you don't really want to allow multiple reply types at once; rather each reply type id is documented on its specific payload layout. Another argument in favor of this over my original proposal: my proposal is context-free for determining reply boundaries, but still requires context in deciphering between a reply to NBD_CMD_READ vs. a reply to NBD_CMD_GET_LBA_STATUS (that is, there was nothing in the reply that said what the payload represents, only how many bytes to skip). However, with a '32 bit Type', the wireshark detector can be taught every type of payload, and as long as every command with a structured reply uses 1 or more distinct types, the dissector can display more details about the payload when it recognizes the type, and still skip the payload on extensions it does not recognize. > > 4. It would be possible to allow EVERY server reply to be a structured > reply that simply set NBD_CHUNK_IS_END. That gives us a convenient > route to servers which only implement structured replies. With DF, > this would be little harder than implementing the current > protocol. For all remaining existing commands, that is just more overhead on the wire. The existing non-structured replies do not send any data; they are 16 bytes each (only NBD_CMD_READ sends more than 16 bytes in one reply). But your proposal inflates that to a minimum of 20 bytes (if length is 0) or longer (if an error is set). I'm still strongly in favor of keeping the existing non-structured replies to commands that don't have to return data. I'm okay if a new command sometimes returns data and sometimes does not; in which case, that command can always return a single structured reply where the id of the reply says whether the payload will be length 0 or not, but only new commands should get that treatment. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 20:00 ` Eric Blake @ 2016-03-29 20:18 ` Alex Bligh 2016-03-29 20:44 ` Alex Bligh 1 sibling, 0 replies; 43+ messages in thread From: Alex Bligh @ 2016-03-29 20:18 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, Wouter Verhelst, qemu-devel, Alex Bligh [-- Attachment #1: Type: text/plain, Size: 245 bytes --] On 29 Mar 2016, at 21:00, Eric Blake <eblake@redhat.com> wrote: > I'm liking it - then we aren't sending a mandatory 0 error field on read > chunks. I'm writing it up as a strawman. I'll comment in a sec in further detail. -- Alex Bligh [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 842 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 20:00 ` Eric Blake 2016-03-29 20:18 ` Alex Bligh @ 2016-03-29 20:44 ` Alex Bligh 2016-03-29 21:05 ` Wouter Verhelst 1 sibling, 1 reply; 43+ messages in thread From: Alex Bligh @ 2016-03-29 20:44 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, Wouter Verhelst, qemu-devel, Alex Bligh [-- Attachment #1: Type: text/plain, Size: 2290 bytes --] Eric, > I'm liking it - then we aren't sending a mandatory 0 error field on read > chunks. Straw man patch sent through. Alternatively at: https://github.com/abligh/nbd/commit/3c40272704904ac74040ceb099fee0b44e355e1e and in markdown format at: https://github.com/abligh/nbd/blob/strawman-structured-reply/doc/proto.md Very much off the top of my head. > Although I might use '32 bit Type' rather than '32 bit Flags', > since you don't really want to allow multiple reply types at once; > rather each reply type id is documented on its specific payload layout. I used the bottom byte of the flags for this, so we can keep the flags. >> 4. It would be possible to allow EVERY server reply to be a structured >> reply that simply set NBD_CHUNK_IS_END. That gives us a convenient >> route to servers which only implement structured replies. With DF, >> this would be little harder than implementing the current >> protocol. > > For all remaining existing commands, that is just more overhead on the > wire. The existing non-structured replies do not send any data; they > are 16 bytes each (only NBD_CMD_READ sends more than 16 bytes in one > reply). But your proposal inflates that to a minimum of 20 bytes (if > length is 0) or longer (if an error is set). I'm still strongly in > favor of keeping the existing non-structured replies to commands that > don't have to return data. I was saying that should be up to the server. If the server wants to write something easily decodable (and easier to maintain) at the expense of a few more bytes on the wire, then let it. If it wants to use unstructured replies occasionally, that's fine. > I'm okay if a new command sometimes returns data and sometimes does not; > in which case, that command can always return a single structured reply > where the id of the reply says whether the payload will be length 0 or > not, but only new commands should get that treatment. I think we are at cross purposes. I am saying that if you've negotiated structured replies, you should be be able to return either for any command, as the client can disambiguate using the magic number. Clearly some future commands might REQUIRE structured replies as there would be no way to represent them in an unstructured reply. -- Alex Bligh [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 842 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 20:44 ` Alex Bligh @ 2016-03-29 21:05 ` Wouter Verhelst 2016-03-29 22:05 ` Alex Bligh 0 siblings, 1 reply; 43+ messages in thread From: Wouter Verhelst @ 2016-03-29 21:05 UTC (permalink / raw) To: Alex Bligh; +Cc: nbd-general, qemu-devel Hi Alex, On Tue, Mar 29, 2016 at 09:44:39PM +0100, Alex Bligh wrote: > Eric, > > For all remaining existing commands, that is just more overhead on the > > wire. The existing non-structured replies do not send any data; they > > are 16 bytes each (only NBD_CMD_READ sends more than 16 bytes in one > > reply). But your proposal inflates that to a minimum of 20 bytes (if > > length is 0) or longer (if an error is set). I'm still strongly in > > favor of keeping the existing non-structured replies to commands that > > don't have to return data. > > I was saying that should be up to the server. If the server wants to > write something easily decodable (and easier to maintain) at the expense > of a few more bytes on the wire, then let it. If it wants to use > unstructured replies occasionally, that's fine. In adding that flexibility, you're adding more code paths on the client (that need to be tested, etc), for (IMO) little benefit. I would instead prefer to specify per command whether the reply is going to be structured or not, and only have the read command be a special case were both are possible, for backwards compatibility only. That way, it can eventually be deprecated, too. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 21:05 ` Wouter Verhelst @ 2016-03-29 22:05 ` Alex Bligh 2016-03-29 22:45 ` Wouter Verhelst 0 siblings, 1 reply; 43+ messages in thread From: Alex Bligh @ 2016-03-29 22:05 UTC (permalink / raw) To: Wouter Verhelst; +Cc: nbd-general, qemu-devel, Alex Bligh On 29 Mar 2016, at 22:05, Wouter Verhelst <w@uter.be> wrote: >>> For all remaining existing commands, that is just more overhead on the >>> wire. The existing non-structured replies do not send any data; they >>> are 16 bytes each (only NBD_CMD_READ sends more than 16 bytes in one >>> reply). But your proposal inflates that to a minimum of 20 bytes (if >>> length is 0) or longer (if an error is set). I'm still strongly in >>> favor of keeping the existing non-structured replies to commands that >>> don't have to return data. >> >> I was saying that should be up to the server. If the server wants to >> write something easily decodable (and easier to maintain) at the expense >> of a few more bytes on the wire, then let it. If it wants to use >> unstructured replies occasionally, that's fine. > > In adding that flexibility, you're adding more code paths on the client > (that need to be tested, etc), for (IMO) little benefit. > > I would instead prefer to specify per command whether the reply is going > to be structured or not, and only have the read command be a special > case were both are possible, for backwards compatibility only. That way, > it can eventually be deprecated, too. I guess this is what comes of doing more NBD server work than client work :-) I'd look at it the other way around and say that only one code path is being exercised on the server, and that having multiple types of reply depending on command builds fragility into the protocol. If you want no choice in response type for the server for any given session (i.e. code path minimisation on the client) my preference would be what Eric didn't like, i.e. always send structured replies for all commands if you negotiate structured replies, else always send unstructured replies. We're talking an overhead of 8 bytes here (flags & error offset); somehow I suspect the time to transmit 8 bytes is going to be negligible compared to disk time or the rest of the network tx/rx time. -- Alex Bligh ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 22:05 ` Alex Bligh @ 2016-03-29 22:45 ` Wouter Verhelst 2016-03-29 22:53 ` Alex Bligh 0 siblings, 1 reply; 43+ messages in thread From: Wouter Verhelst @ 2016-03-29 22:45 UTC (permalink / raw) To: Alex Bligh; +Cc: nbd-general, qemu-devel On Tue, Mar 29, 2016 at 11:05:38PM +0100, Alex Bligh wrote: > > On 29 Mar 2016, at 22:05, Wouter Verhelst <w@uter.be> wrote: > > >>> For all remaining existing commands, that is just more overhead on the > >>> wire. The existing non-structured replies do not send any data; they > >>> are 16 bytes each (only NBD_CMD_READ sends more than 16 bytes in one > >>> reply). But your proposal inflates that to a minimum of 20 bytes (if > >>> length is 0) or longer (if an error is set). I'm still strongly in > >>> favor of keeping the existing non-structured replies to commands that > >>> don't have to return data. > >> > >> I was saying that should be up to the server. If the server wants to > >> write something easily decodable (and easier to maintain) at the expense > >> of a few more bytes on the wire, then let it. If it wants to use > >> unstructured replies occasionally, that's fine. > > > > In adding that flexibility, you're adding more code paths on the client > > (that need to be tested, etc), for (IMO) little benefit. > > > > I would instead prefer to specify per command whether the reply is going > > to be structured or not, and only have the read command be a special > > case were both are possible, for backwards compatibility only. That way, > > it can eventually be deprecated, too. > > I guess this is what comes of doing more NBD server work than client > work :-) I'd look at it the other way around and say that only one > code path is being exercised on the server, Yes, but both code paths need to _exist_, which isn't the case with having only one legal reply type per request type. The server just needs to send header X for replies A, B, C, and header Y for replies D, E, F. Forming the header is part of producing the reply type, and will be the same for every conversation -- except for read replies, where it could possibly be either (but that can't be avoided). > and that having multiple types of reply depending on command builds > fragility into the protocol. I'd think that having the legal reply type depend on context is actually more fragile. > If you want no choice in response type for the server for any given > session (i.e. code path minimisation on the client) my preference would > be what Eric didn't like, i.e. always send structured replies for > all commands if you negotiate structured replies, else always send > unstructured replies. Doing that requires performing a lookup to negotiated state (and a code branch) for every response type that can possibly be structured or nonstructured, and introduces exactly the two code paths that I think should be avoided. With what I'm suggesting, this will still be required for read requests, but only while we retain backwards compatibility. > We're talking an overhead of 8 bytes here (flags & error offset); > somehow I suspect the time to transmit 8 bytes is going to be > negligible compared to disk time or the rest of the network tx/rx > time. Sure, but I'm not worried about that. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH 3/1] doc: Propose Structured Replies extension 2016-03-29 22:45 ` Wouter Verhelst @ 2016-03-29 22:53 ` Alex Bligh 0 siblings, 0 replies; 43+ messages in thread From: Alex Bligh @ 2016-03-29 22:53 UTC (permalink / raw) To: Wouter Verhelst; +Cc: nbd-general, qemu-devel, Alex Bligh On 29 Mar 2016, at 23:45, Wouter Verhelst <w@uter.be> wrote: > Doing that requires performing a lookup to negotiated state (and a code > branch) for every response type that can possibly be structured or > nonstructured, and introduces exactly the two code paths that I think > should be avoided. > > With what I'm suggesting, this will still be required for read requests, > but only while we retain backwards compatibility. OK, I *think* I've encapsulated all 3 options in the revised version I just sent. From the other email: >> As a third option then: >> >> Each chunk consists of the following: >> >> S: 32 bits, 0x668e33ef, magic (NBD_STRUCTURED_REPLY_MAGIC) >> S: 8 bits: type >> S: 8 bits: reserved (must be zero) >> S: 16 bits, flags >> S: 64 bits, handle >> S: 32 bits, payload length S: (length bytes of payload data) >> >> The flags have the following meanings: >> >> • bits 0-15: reserved (server MUST set these to zero) > > That seems better in that context, yes. The reserved byte could later on > be assigned as extra flags if need be. Or for compatibility with NBD_CMD, it could be 2 16 bit quantities. Missed this on v2. > The reason why I suggested zero is that it doesn't require special-case > code. If an error offset implies that everything beyond that offset is > invalid, then having an offset of zero implies that the whole read is > invalid -- which is correct if the server encountered an error, but > doesn't know or doesn't want to say (for whatever reason) where. I think that's wrong. As reads can be disordered, then if you get a chunk at offset X, then a chunk at offset 0, then an end chunk specifying an error, you (at least in theory) need to disambiguate the two so you know whether the chunk at offset X was OK. That's why I'm using 0xffffffff (now) to say "don't know where the error is". -- Alex Bligh ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [PATCH] doc: Mention proper use of handle 2016-03-28 13:59 [Qemu-devel] [PATCH] doc: Mention proper use of handle Eric Blake 2016-03-29 3:56 ` [Qemu-devel] [PATCH 2/1] doc: More details on flag negotiation Eric Blake 2016-03-29 3:56 ` [Qemu-devel] [PATCH 3/1] doc: Propose Structured Replies extension Eric Blake @ 2016-03-29 7:11 ` Wouter Verhelst 2016-03-29 13:59 ` Eric Blake 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 0/3] NBD Structured Read Eric Blake 3 siblings, 1 reply; 43+ messages in thread From: Wouter Verhelst @ 2016-03-29 7:11 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, qemu-devel On Mon, Mar 28, 2016 at 07:59:15AM -0600, Eric Blake wrote: > Although the proper use of the handle field during transmission > phase was implied, it never hurts to make it more explicit that > clients should alter the handle on each message, and the server > repeat the handle unchanged, in order for the client to track > when the server is sending replies out of order. > > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > doc/proto.md | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/doc/proto.md b/doc/proto.md > index 6d1cb34..d0102e0 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -200,7 +200,11 @@ S: 64 bits, handle > S: (*length* bytes of data if the request is of type `NBD_CMD_READ`) > > Replies need not be sent in the same order as requests (i.e., requests > -may be handled by the server asynchronously). > +may be handled by the server asynchronously). Clients SHOULD send a > +different value of handle for each request, and the server MUST use the > +same value for handle as was sent by the client for each request that > +the server is replying to, so that the client may correlate which > +request is receiving a response. NAK. This implies that a client should not ever reuse handles, while it is legal for a client (and in fact the kernel does this) to reuse handles once the server has ack'd the request. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [PATCH] doc: Mention proper use of handle 2016-03-29 7:11 ` [Qemu-devel] [PATCH] doc: Mention proper use of handle Wouter Verhelst @ 2016-03-29 13:59 ` Eric Blake 0 siblings, 0 replies; 43+ messages in thread From: Eric Blake @ 2016-03-29 13:59 UTC (permalink / raw) To: Wouter Verhelst; +Cc: nbd-general, qemu-devel [-- Attachment #1: Type: text/plain, Size: 1687 bytes --] On 03/29/2016 01:11 AM, Wouter Verhelst wrote: > On Mon, Mar 28, 2016 at 07:59:15AM -0600, Eric Blake wrote: >> Although the proper use of the handle field during transmission >> phase was implied, it never hurts to make it more explicit that >> clients should alter the handle on each message, and the server >> repeat the handle unchanged, in order for the client to track >> when the server is sending replies out of order. >> >> >> Replies need not be sent in the same order as requests (i.e., requests >> -may be handled by the server asynchronously). >> +may be handled by the server asynchronously). Clients SHOULD send a >> +different value of handle for each request, and the server MUST use the >> +same value for handle as was sent by the client for each request that >> +the server is replying to, so that the client may correlate which >> +request is receiving a response. > > NAK. This implies that a client should not ever reuse handles, while it > is legal for a client (and in fact the kernel does this) to reuse > handles once the server has ack'd the request. Nothing a little word-smithing can't fix. I'll try a v2, probably along the lines of: Clients SHOULD use a handle that is distinct from all other currently pending transactions, but MAY reuse handles that are no longer in flight; handles need not be consecutive. In each reply, the server MUST use the same value for handle as was sent by the client in the corresponding request. In this way, the client can correlate which request is receiving a response. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* [Qemu-devel] [PATCH v2 0/3] NBD Structured Read 2016-03-28 13:59 [Qemu-devel] [PATCH] doc: Mention proper use of handle Eric Blake ` (2 preceding siblings ...) 2016-03-29 7:11 ` [Qemu-devel] [PATCH] doc: Mention proper use of handle Wouter Verhelst @ 2016-03-29 23:00 ` Eric Blake 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 1/3] NBD proto: add "Command flags" section Eric Blake ` (3 more replies) 3 siblings, 4 replies; 43+ messages in thread From: Eric Blake @ 2016-03-29 23:00 UTC (permalink / raw) To: nbd-general; +Cc: w, qemu-devel, alex I wrote this in parallel with Alex's strawman proposals, so I may have picked up on some of his ideas, while diverging in other places. Changes since v1: rebase, resend some pre-req patches, switch from global/client flag negotiation over to option negotiation, document a flags/type scheme in all structured replies, use ONLY structured replies in response to a structured read, make the server stream fully context-free (thanks to the type scheme), go into more details about error reporting by using two different structured errors (multiple errors each with offset, or single error with no offset). Eric Blake (2): doc: Mention proper use of handle doc: Propose Structured Read extension Pavel Borzenkov (1): NBD proto: add "Command flags" section doc/proto.md | 282 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 277 insertions(+), 5 deletions(-) -- 2.5.5 ^ permalink raw reply [flat|nested] 43+ messages in thread
* [Qemu-devel] [PATCH v2 1/3] NBD proto: add "Command flags" section 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 0/3] NBD Structured Read Eric Blake @ 2016-03-29 23:00 ` Eric Blake 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 2/3] doc: Mention proper use of handle Eric Blake ` (2 subsequent siblings) 3 siblings, 0 replies; 43+ messages in thread From: Eric Blake @ 2016-03-29 23:00 UTC (permalink / raw) To: nbd-general; +Cc: Denis V. Lunev, w, qemu-devel, alex, Pavel Borzenkov From: Pavel Borzenkov <pborzenkov@virtuozzo.com> Add separate "Command flags" section to make it clear which flags are currently defined by the protocol. Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Wouter Verhelst <w@uter.be> CC: Alex Bligh <alex@alex.org.uk> Message-Id: <1459161798-32120-4-git-send-email-den@openvz.org> [rearrange subsections to parallel Handshake phase, add more details] Signed-off-by: Eric Blake <eblake@redhat.com> --- doc/proto.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/doc/proto.md b/doc/proto.md index aaae0a2..3d49162 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -411,6 +411,20 @@ case that data is an error message suitable for display to the user. ### Transmission phase +#### Command flags + +This field of 16 bits is sent by the client with every request and provides +additional information to the server to execute the command. Refer to +the "Request types" section below for more details about how a given flag +affects a particular command. Clients MUST NOT set a command flag bit +that is not documented for the particular command; and whether a flag is +valid may depend on negotiation during the handshake phase. + +- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`. SHOULD be + set to 1 if the client requires "Force Unit Access" mode of + operation. MUST NOT be set unless export flags included + `NBD_FLAG_SEND_FUA`. + #### Request types The following request types exist: -- 2.5.5 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [Qemu-devel] [PATCH v2 2/3] doc: Mention proper use of handle 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 0/3] NBD Structured Read Eric Blake 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 1/3] NBD proto: add "Command flags" section Eric Blake @ 2016-03-29 23:00 ` Eric Blake 2016-03-29 23:01 ` [Qemu-devel] [PATCH v2 3/3] doc: Propose Structured Read extension Eric Blake 2016-03-30 8:09 ` [Qemu-devel] [Nbd] [PATCH v2 0/3] NBD Structured Read Wouter Verhelst 3 siblings, 0 replies; 43+ messages in thread From: Eric Blake @ 2016-03-29 23:00 UTC (permalink / raw) To: nbd-general; +Cc: w, qemu-devel, alex Although the proper use of the handle field during transmission phase was implied, it never hurts to make it more explicit that clients should use distinct handles for multiple in-flight requests. Likewise, the server must repeat the handle unchanged, in order for the client to track when the server is sending replies out of order. Make it clear that the client does not have to follow any particular order of handles, and can reuse values. Signed-off-by: Eric Blake <eblake@redhat.com> --- doc/proto.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/doc/proto.md b/doc/proto.md index 3d49162..3f9ee23 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -203,7 +203,13 @@ S: 64 bits, handle S: (*length* bytes of data if the request is of type `NBD_CMD_READ`) Replies need not be sent in the same order as requests (i.e., requests -may be handled by the server asynchronously). +may be handled by the server asynchronously). Clients SHOULD use a +handle that is distinct from all other currently pending transactions, +but MAY reuse handles that are no longer in flight; handles need not +be consecutive. In each reply, the server MUST use the same value for +handle as was sent by the client in the corresponding request. In +this way, the client can correlate which request is receiving a +response. ## Values -- 2.5.5 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [Qemu-devel] [PATCH v2 3/3] doc: Propose Structured Read extension 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 0/3] NBD Structured Read Eric Blake 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 1/3] NBD proto: add "Command flags" section Eric Blake 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 2/3] doc: Mention proper use of handle Eric Blake @ 2016-03-29 23:01 ` Eric Blake 2016-03-29 23:29 ` Eric Blake ` (2 more replies) 2016-03-30 8:09 ` [Qemu-devel] [Nbd] [PATCH v2 0/3] NBD Structured Read Wouter Verhelst 3 siblings, 3 replies; 43+ messages in thread From: Eric Blake @ 2016-03-29 23:01 UTC (permalink / raw) To: nbd-general; +Cc: w, qemu-devel, alex The existing transmission phase protocol is difficult to sniff, because correct interpretation of the server stream requires context from the client stream (or risks false positives if data payloads happen to contain the protocol magic numbers). It also prohibits the ability to do efficient sparse reads, or to return a short read where an error is reported without also sending length bytes of (bogus) data. Remedy this by adding a new option request negotiation, which affects the response of the NBD_CMD_READ command, and sets forth rules for how future command responses must behave when they carry a data payload. This proposal does NOT permit structured replies to anything other than NBD_CMD_READ, although a future proposal may wish to make that valid (so that a server could be written that only returns structured replies). Signed-off-by: Eric Blake <eblake@redhat.com> --- doc/proto.md | 260 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 256 insertions(+), 4 deletions(-) diff --git a/doc/proto.md b/doc/proto.md index 3f9ee23..75b1534 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -211,6 +211,14 @@ handle as was sent by the client in the corresponding request. In this way, the client can correlate which request is receiving a response. +By default, there is exactly one reply message for each request +(unless the connection is closed due to an error). Note that it is +impossible to tell by reading just the server traffic whether a data +field will be present. The experimental `Structured Read` extension +adds an additional reply type, documents when there will be multiple +replies to a single request, and creates a context-free server stream; +see below. + ## Values This section describes the value and meaning of constants (other than @@ -255,6 +263,8 @@ immediately after the global flags field in oldstyle negotiation: - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports `NBD_CMD_TRIM` commands +Clients SHOULD ignore unknown flags. + ##### Client flags This field of 32 bits is sent after initial connection and after @@ -338,6 +348,10 @@ of the newstyle negotiation. Defined by the experimental `SELECT` extension; see below. +- `NBD_OPT_STRUCTURED_READ` (8) + + Defined by the experimental `Structured Read` extension; see below. + #### Option reply types These values are used in the "reply type" field, sent by the server @@ -430,6 +444,8 @@ valid may depend on negotiation during the handshake phase. set to 1 if the client requires "Force Unit Access" mode of operation. MUST NOT be set unless export flags included `NBD_FLAG_SEND_FUA`. +- bit 1, `NBD_CMD_FLAG_DF`; defined by the experimental `Structured + Read` extension; see below #### Request types @@ -451,6 +467,10 @@ The following request types exist: signalling no error), the server MUST immediately close the connection; it MUST NOT send any further data to the client. + The experimental `Structured Read` extension changes the set of + valid replies, in part to allow recovery after a partial read and + more efficient reads of sparse files; see below. + * `NBD_CMD_WRITE` (1) A write request. Length and offset define the location and amount of @@ -536,13 +556,16 @@ The following error values are defined: * `ENOMEM` (12), Cannot allocate memory. * `EINVAL` (22), Invalid argument. * `ENOSPC` (28), No space left on device. +* `EOVERFLOW` (75), Value too large. The server SHOULD return `ENOSPC` if it receives a write request including one or more sectors beyond the size of the device. It SHOULD return `EINVAL` if it receives a read or trim request including one or more sectors beyond the size of the device. It also SHOULD map the -`EDQUOT` and `EFBIG` errors to `ENOSPC`. Finally, it SHOULD return -`EPERM` if it receives a write or trim request on a read-only export. +`EDQUOT` and `EFBIG` errors to `ENOSPC`. It SHOULD return `EOVERFLOW` +on a request to send structured read data without fragmentation but +where the length is too large. Finally, it SHOULD return `EPERM` if +it receives a write or trim request on a read-only export. The server SHOULD return `EINVAL` if it receives an unknown command. @@ -579,7 +602,7 @@ To remedy this, a `SELECT` extension is envisioned. This extension adds two option requests and one error reply type, and extends one existing option reply type. -* `NBD_OPT_SELECT` +* `NBD_OPT_SELECT` (6) The client wishes to select the export with the given name for use in the transmission phase, but does not yet want to move to the @@ -613,7 +636,7 @@ option reply type. handle `NBD_REP_ERR_UNSUP`. In this case, they should fall back to using `NBD_OPT_EXPORT_NAME`. -* `NBD_OPT_GO` +* `NBD_OPT_GO` (7) The client wishes to terminate the negotiation phase and progress to the transmission phase. Possible replies from the server include: @@ -635,6 +658,235 @@ option reply type. message if they do not also send it as a reply to the `NBD_OPT_SELECT` message. +### `Structured Read` extension + +Some of the major downsides of the default reply to `NBD_CMD_READ` +(without structured replies) are as follows. First, it is not +possible to support partial reads (the command must succeed or fail as +a whole, either len bytes of data must be sent or the connection must +be closed). There is no way to efficiently skip over portions of a +sparse file that are known to contain all zeroes. Finally, it is not +possible to reliably decode the server traffic without also having +context of what pending read requests were sent by the client. + +To remedy this, a `Structured Read` extension is envisioned. This +extension adds a new option request, a new reply type during the +transmission phase, and a new command flag, and alters the set of +valid replies to an existing command. + +* `NBD_OPT_STRUCTURED_READ` (8) + + The client wishes to use structured reads during the transmission + phase. The option request has no additional data. + + The server replies with one of the following: + + - `NBD_REP_ACK`: Structured reads have been negotiated; the server + MUST use structured replies to `NBD_CMD_READ` + - `NBD_REP_UNSUP`: Structured reads are not available; the transmission + phase MUST remain the same as if the client had not attempted + `NBD_OPT_STRUCTURED_READ` + +* Transmission phase + + The transmission phase includes a third message type: the + structured reply, to be used for commands where the response must + include a data payload. The server MUST NOT send this reply type + unless the client has successfully negotiated an extension that + requires the use of a structured reply; this includes the + negotiation of Structured Reads via `NBD_OPT_STRUCTURED_READ`. + + A structured reply looks as follows: + + S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`) + S: 16 bits, flags + S: 16 bits, type + S: 64 bits, handle + S: 32 bits, length of payload (unsigned) + S: *length* bytes of payload data + + The use of *length* in the reply allows context-free division of + the overall server traffic into individual reply messages; the + *type* field describes how to further interpret the payload. + + While the server is permitted to send at most one normal reply (or + else close the connection), a command that uses structured replies + may document that the server is permitted to send mutiple replies, + all sharing the same handle, by using the `NBD_REPLY_FLAG_DONE` + (bit 0) to delineate the final reply. The server MAY interleave + intermediate replies to one structured command with replies + relating to a different handle. + + A server MUST NOT send a data payload in a normal reply if + Structured Reads are negotiated. It is envisioned that all future + extension commands that require a data payload in the response + will require independent option negotiation, and therefore, the + `NBD_CMD_READ` command is the only command that is allowed to use + the data payload of a normal reply, and only when Structured Reads + were not negotiated. However, for ease of implementation, a + server MAY close the connection rather than entering transmission + phase if, at the end of option haggling, the client has negotiated + another command that requires a structured reply but did not also + negotiate Structured Reads. + + * Structured Reply flags + + This field of 16 bits is sent by the server as part of every + structured reply. + + - bit 0, `NBD_REPLY_FLAG_DONE`; the server MUST clear this bit if + more structured replies will be sent for the same client + request, and MUST set this bit if this is the final reply. + Commands which are documented as using structured replies, but + not documented as sending multiple replies, MUST always set this + bit. + + The server MUST NOT set any other flags without first negotiating + the extension with the client. Clients that receive an + unrecognized flag SHOULD close the connection. + + * Structured Reply types + + These values are used in the "type" field of a structured reply. + Each type determines how to interpret the "length" bytes of data. + If the client receives an unknown or unexpected type, it SHOULD + close the connection. + + - `NBD_REPLY_TYPE_NONE` (0) + + *length* MUST be 0 (and the payload field omitted). This type + SHOULD be used only as the final reply (that is, when + `NBD_REPLY_FLAG_DONE` is set), and implies that the overall + client request was successfully completed. Valid as a reply to + `NBD_CMD_READ`. + + - `NBD_REPLY_TYPE_OFFSET_DATA` (1) + + *length* MUST be at least 9. The payload is structured as: + + 64 bits: offset (unsigned) + *length - 8* bytes: data + + This reply represents the contents of *length - 8* bytes of the + file, starting at *offset*. The data MUST lie within the + bounds of the original offset and length of the client's + request. Valid as a reply to `NBD_CMD_READ`. + + - `NBD_REPLY_TYPE_OFFSET_HOLE` (2) + + *length* MUST be exactly 12. The payload is structured as: + + 64 bits: offset (unsigned) + 32 bits: hole size (unsigned) + + This reply represents that *hole size* bytes of the file (which + MUST be non-zero), starting at *offset*, read as all zeroes. + The hole MUST lie within the bounds of the original offset and + length of the client's request. Valid as a reply to + `NBD_CMD_READ`. + + - `NBD_REPLY_TYPE_ERROR` (3) + + *length* MUST be exactly 4. The payload is structured as: + + 32 bits: error + + This reply represents that an error occurred, with no further + details as to the offset where the error occurred; and SHOULD be + used only as the final reply (that is, when + `NBD_REPLY_FLAG_DONE` is set). Valid as a reply to + `NBD_CMD_READ`. + + - `NBD_REPLY_TYPE_ERROR_OFFSET` (4) + + *length* MUST be exactly 12. The payload is structured as: + + 32 bits: error + 64 bits: offset (unsigned) + + This reply represents that an error occurred while handling the + given offset. *error* MUST be nonzero, and *offset* must lie + within the bounds of the original offset and length of the + client's request. Valid as a reply to `NBD_CMD_READ`. + +* `NBD_CMD_FLAG_DF` (bit 1) + + Valid during `NBD_CMD_READ`. SHOULD be set to 1 if the client + requires the server to send at most one data chunk in reply. MUST + NOT be set unless the client negotiated Structured Reads with the + server. + +* `NBD_CMD_READ` + + If `NBD_OPT_STRUCTURED_READ` was not negotiated, then a read + request MUST always be answered by a single non-structured + response, as documented above (using magic 0x67446698 + `NBD_REPLY_MAGIC`, and containing length bytes of data according + to the client's request, although those bytes MAY be invalid if an + error is returned, and the connection MUST if an error occurs + after a header claiming no error). + + If `NBD_OPT_STRUCTURED_READ` is negotiated, then a read request + MUST result in one or more structured replies (each using magic + 0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), with the following + additional constraints. + + The server MAY split the reply into any number of data chunks, + using reply types of `NBD_REPLY_TYPE_OFFSET_DATA` or + `NBD_REPLY_TYPE_OFFSET_HOLE`; each chunk MUST describe at least + one byte, although to minimize overhead, the server SHOULD use + chunks no smaller than 512 bytes where possible (the first and + last chunk of an unaligned read being the most obvious place for + an exception). The server MUST NOT send chunks that overlap, and + MUST NOT send chunks that describe data outside the offset and + length of the request, but MAY send the chunks in any order (the + client is responsible for reassembling chunks into the correct + order). Note that a request for more than 2^32 - 8 bytes MUST be + split into at least two chunks, so as not to overflow the length + field of a reply while still allowing space for the offset of each + chunk. + + If no error is detected, then the server MUST send enough chunks + to cover the bytes requested. The server MAY set the + `NBD_REPLY_FLAG_DONE` on the final data chunk, to minimize + traffic, but MUST NOT do so if it would still be possible to + detect an error while transmitting the chunk. If the last data + chunk is not the final reply, the server MUST use + `NBD_REPLY_TYPE_NONE` as the final reply to indicate success. + + If an error is detected, the server MUST send padding bytes to + complete the current chunk (if any), MUST report the error with a + reply type of either `NBD_REPLY_TYPE_ERROR` or + `NBD_REPLY_TYPE_ERROR_OFFSET`, and MAY end the sequence of replies + without sending the total number of bytes requested. If one or + more offset errors are reported, the client MAY assume that all + data in chunks not including the offset, and all data within the + affected chunk but prior to the offset, is valid; the client MAY + NOT assume anything about data validity if no offset is provided. + The server MAY send additional chunks or offset error replies, if + `NBD_REPLY_FLAG_DONE` was not set, but MUST ensure the final reply + also reports an error (that is, the final reply MUST NOT use + `NBD_REPLY_TYPE_NONE`), and MAY reuse an offset reported earlier + in constructing the final reply. A server SHOULD NOT mix + `NBD_REPLY_TYPE_ERROR` and `NBD_REPLY_TYPE_ERROR_OFFSET` replies + to the same request. + + A client MAY close the connection if it detects that the server + has sent invalid chunks (such as overlapping data, or not enough + data before claiming success). + + In order to avoid the burden of reassembly, the client MAY set the + `NBD_CMD_FLAG_DF` flag (bit 1), which instructs the server to not + fragment the reply. If this flag is set, the server MUST send at + most one `NBD_REPLY_TYPE_OFFSET_DATA` or + `NBD_REPLY_TYPE_OFFSET_HOLE`, although it MAY still send more than + reply (for error reporting, or a final `NBD_REPLY_TYPE_NONE`). If + the client's length request is larger than 65,536 bytes (or if a + later extension adds a way to negotiate a larger maximum fragment + size), the server MAY reject the command with `EOVERFLOW`. The + `EOVERFLOW` error MUST NOT be used if the `NBD_CMD_FLAG_DF` flag + was not set, or if the requested length is no larger than 65,536. + ## About this file This file tries to document the NBD protocol as it is currently -- 2.5.5 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [PATCH v2 3/3] doc: Propose Structured Read extension 2016-03-29 23:01 ` [Qemu-devel] [PATCH v2 3/3] doc: Propose Structured Read extension Eric Blake @ 2016-03-29 23:29 ` Eric Blake 2016-03-30 6:50 ` Alex Bligh 2016-03-30 20:44 ` [Qemu-devel] [Nbd] " Wouter Verhelst 2 siblings, 0 replies; 43+ messages in thread From: Eric Blake @ 2016-03-29 23:29 UTC (permalink / raw) To: nbd-general; +Cc: w, qemu-devel, alex [-- Attachment #1: Type: text/plain, Size: 1211 bytes --] On 03/29/2016 05:01 PM, Eric Blake wrote: > The existing transmission phase protocol is difficult to sniff, > because correct interpretation of the server stream requires > context from the client stream (or risks false positives if > data payloads happen to contain the protocol magic numbers). It > also prohibits the ability to do efficient sparse reads, or to > return a short read where an error is reported without also > sending length bytes of (bogus) data. > > +* `NBD_CMD_READ` > + > + If `NBD_OPT_STRUCTURED_READ` was not negotiated, then a read > + request MUST always be answered by a single non-structured > + response, as documented above (using magic 0x67446698 > + `NBD_REPLY_MAGIC`, and containing length bytes of data according > + to the client's request, although those bytes MAY be invalid if an > + error is returned, and the connection MUST if an error occurs MUST be closed (I hate it when I send patches before 'git commit --amend' on the final unsaved changes in my editor...) > + after a header claiming no error). > + -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [PATCH v2 3/3] doc: Propose Structured Read extension 2016-03-29 23:01 ` [Qemu-devel] [PATCH v2 3/3] doc: Propose Structured Read extension Eric Blake 2016-03-29 23:29 ` Eric Blake @ 2016-03-30 6:50 ` Alex Bligh 2016-03-30 17:45 ` Eric Blake 2016-03-30 20:44 ` [Qemu-devel] [Nbd] " Wouter Verhelst 2 siblings, 1 reply; 43+ messages in thread From: Alex Bligh @ 2016-03-30 6:50 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, Wouter Verhelst, qemu-devel, Alex Bligh Eric, There's a lot in common between our two proposals now (unsurprisingly). You've highlighted the differences in the other mail and I'll comment on them there. You may want to steal some of my wording as I think there are bits I've got that you haven't (as well as vice versa). But I'm inclined to use yours as a base unless you particularly like mine. Comments inline below. Alex On 30 Mar 2016, at 00:01, Eric Blake <eblake@redhat.com> wrote: ... > + While the server is permitted to send at most one normal reply (or > + else close the connection), a command that uses structured replies > + may document that the server is permitted to send mutiple replies, > + all sharing the same handle, The thought is fine, but the language is confusing. I think this is a single reply, made up of multiple parts (I called them chunks). You've called them multiple replies, which I think makes things less clear. Also below you've started using my 'chunk' language anyway! > by using the `NBD_REPLY_FLAG_DONE` > + (bit 0) to delineate the final reply. The server MAY interleave > + intermediate replies to one structured command with replies > + relating to a different handle. Neat. The argument against this route is that now there are essentially two ways to end a chain of chunks (with and without a NONE chunk) which is necessary for the reasons you set out. On balance I like it though. > + > + A server MUST NOT send a data payload in a normal reply if > + Structured Reads are negotiated. It is envisioned that all future > + extension commands that require a data payload in the response > + will require independent option negotiation, and therefore, the > + `NBD_CMD_READ` command is the only command that is allowed to use > + the data payload of a normal reply, and only when Structured Reads > + were not negotiated. See other email. > However, for ease of implementation, a > + server MAY close the connection rather than entering transmission > + phase if, at the end of option haggling, the client has negotiated > + another command that requires a structured reply but did not also > + negotiate Structured Reads. That's pretty yucky given a reconnect will achieve the same result and you'll end up in an infinite retry loop. Wouldn't a better route be simply to say that implementing certain commands (server or client sides) requires support of structured replies? > + - `NBD_REPLY_TYPE_NONE` (0) > + > + *length* MUST be 0 (and the payload field omitted). This type > + SHOULD be used only as the final reply (that is, when > + `NBD_REPLY_FLAG_DONE` is set), and implies that the overall > + client request was successfully completed. I think this would be clearer as 'SHOULD NOT be used other than as the final reply'. Because you are also saying (I think) that you need not have it as the final reply - it's just as good in a non-errored reply to have NBD_REPLY_FLAG_DONE set on the last data packet (provided you know it's not going to error before starting to send it). ... > + The server MAY split the reply into any number of data chunks, > + using reply types of `NBD_REPLY_TYPE_OFFSET_DATA` or > + `NBD_REPLY_TYPE_OFFSET_HOLE`; each chunk MUST describe at least > + one byte, although to minimize overhead, the server SHOULD use > + chunks no smaller than 512 bytes where possible (the first and This is a good idea, but rather than 'no smaller than 512 bytes', as it's a 'SHOULD', could we have 'the server SHOULD use chunks each an integer multiple of 512 bytes where possible' (you already have a carve out for the first and last). ... > + If no error is detected, then the server MUST send enough chunks > + to cover the bytes requested. The server MAY set the > + `NBD_REPLY_FLAG_DONE` on the final data chunk, In which case it MUST NOT send any further non-data chunks (e.g. an error chunk or a NONE chunk) > to minimize > + traffic, but MUST NOT do so if it would still be possible to > + detect an error while transmitting the chunk. If the last data > + chunk is not the final reply, the server MUST use > + `NBD_REPLY_TYPE_NONE` as the final reply to indicate success. or an error chunk to indicate an error, and these final chunk MUST have NBD_REPLY_FLAG_DONE set on it. > + If an error is detected, the server MUST send padding bytes to > + complete the current chunk (if any), MUST report the error with a > + reply type of either `NBD_REPLY_TYPE_ERROR` or > + `NBD_REPLY_TYPE_ERROR_OFFSET`, and MAY end the sequence of replies > + without sending the total number of bytes requested. If one or > + more offset errors are reported, the client MAY assume that all > + data in chunks not including the offset, "the offset(s)" > and all data within the > + affected chunk "within each affected chunk" > but prior to the offset, "prior to the relevant offset" > is valid; the client MAY > + NOT assume anything about data validity if no offset is provided. These multiple error chunks are neat. However, I suspect lazy implementors may just send an error without an offset. > + The server MAY send additional chunks or offset error replies, if > + `NBD_REPLY_FLAG_DONE` was not set, but MUST ensure the final reply > + also reports an error (that is, the final reply MUST NOT use > + `NBD_REPLY_TYPE_NONE`), and MAY reuse an offset reported earlier > + in constructing the final reply. I'm not sure I get that bit. Why don't you make an errorred reply simply one that contains one or more error chunks. An errorred reply need not contain all the data requested (though each chunk must be complete). A reply that isn't errorred needs not contain all the data requested. Why do you need anything stronger than that? So if you have a parallelised server which is simply sending several reads in parallel (think Ceph) it sends the result from each thread, possibly followed by an error packet, and some other thread notices when all of these have completed and sends a NBD_REPLY_TYPE_NONE packet (always, error or not) to close of use of the handle. This seems perfectly natural and no harder for the client to deal with, but you are prohibiting it. > A server SHOULD NOT mix > + `NBD_REPLY_TYPE_ERROR` and `NBD_REPLY_TYPE_ERROR_OFFSET` replies > + to the same request. > + > + A client MAY close the connection if it detects that the server > + has sent invalid chunks (such as overlapping data, or not enough > + data before claiming success). > + > + In order to avoid the burden of reassembly, the client MAY set the > + `NBD_CMD_FLAG_DF` flag (bit 1), which instructs the server to not > + fragment the reply. If this flag is set, the server MUST send at > + most one `NBD_REPLY_TYPE_OFFSET_DATA` or > + `NBD_REPLY_TYPE_OFFSET_HOLE`, although it MAY still send more than > + reply (for error reporting, or a final `NBD_REPLY_TYPE_NONE`). If "the flag is set and" > + the client's length request is larger than 65,536 bytes (or if a > + later extension adds a way to negotiate a larger maximum fragment > + size), the server MAY reject the command with `EOVERFLOW`. The > + `EOVERFLOW` error MUST NOT be used if the `NBD_CMD_FLAG_DF` flag > + was not set, or if the requested length is no larger than 65,536. > + > ## About this file > > This file tries to document the NBD protocol as it is currently > -- > 2.5.5 > > -- Alex Bligh ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [PATCH v2 3/3] doc: Propose Structured Read extension 2016-03-30 6:50 ` Alex Bligh @ 2016-03-30 17:45 ` Eric Blake 2016-03-30 19:51 ` [Qemu-devel] [Nbd] " Wouter Verhelst 2016-03-30 22:48 ` [Qemu-devel] " Alex Bligh 0 siblings, 2 replies; 43+ messages in thread From: Eric Blake @ 2016-03-30 17:45 UTC (permalink / raw) To: Alex Bligh; +Cc: nbd-general, Wouter Verhelst, qemu-devel [-- Attachment #1: Type: text/plain, Size: 11999 bytes --] On 03/30/2016 12:50 AM, Alex Bligh wrote: > Eric, > > There's a lot in common between our two proposals now (unsurprisingly). > You've highlighted the differences in the other mail and I'll > comment on them there. You may want to steal some of my wording as > I think there are bits I've got that you haven't (as well as vice versa). > But I'm inclined to use yours as a base unless you particularly > like mine. > > Comments inline below. > > Alex > > On 30 Mar 2016, at 00:01, Eric Blake <eblake@redhat.com> wrote: > ... > >> + While the server is permitted to send at most one normal reply (or >> + else close the connection), a command that uses structured replies >> + may document that the server is permitted to send mutiple replies, And just now noticing my typo on 'multiple' :) >> + all sharing the same handle, > > The thought is fine, but the language is confusing. I think this is > a single reply, made up of multiple parts (I called them chunks). You've > called them multiple replies, which I think makes things less clear. > Also below you've started using my 'chunk' language anyway! All right, for v3, I will try to go with the wording that every request has a single reply; but the reply will either be a 'simple reply' (single message), or a 'structured reply' (which may occupy multiple messages, where each message is called a 'chunk'). > >> by using the `NBD_REPLY_FLAG_DONE` >> + (bit 0) to delineate the final reply. The server MAY interleave >> + intermediate replies to one structured command with replies >> + relating to a different handle. > > Neat. > > The argument against this route is that now there are essentially > two ways to end a chain of chunks (with and without a NONE chunk) > which is necessary for the reasons you set out. On balance I like it though. Yeah; I didn't see any way around that. Always requiring a NONE chunk is more network overhead if the server can guarantee that the last data chunk is error-free; but at the same time, we shouldn't force servers to guarantee the last data chunk will be error-free. I don't think it is too much of a burden for a client to receive chunks in a loop until the FLAG_DONE bit is set, without regards to what chunk type came last. And for CMD_READ, we still have a nice delineation: if the last chunk is NONE, OFFSET_DATA, or OFFSET_HOLE, the command succeeded; if the last chunk is ERROR or ERROR_OFFSET, the command failed. > >> However, for ease of implementation, a >> + server MAY close the connection rather than entering transmission >> + phase if, at the end of option haggling, the client has negotiated >> + another command that requires a structured reply but did not also >> + negotiate Structured Reads. > > That's pretty yucky given a reconnect will achieve the same result > and you'll end up in an infinite retry loop. > > Wouldn't a better route be simply to say that implementing certain > commands (server or client sides) requires support of structured > replies? I think we're in agreement that the only command that (for back-compat reasons) can ever send data on a simple reply is CMD_READ. Therefore, if you negotiate any other command that can send data, that command will use a structured reply; the mere fact that you negotiated the command means that both client and server know about structured replies. I guess what I was trying to get at is that if you are using any structured replies, then it is a disservice to wire-sniffers if you did not also enable structured replies for CMD_READ. Technically, it would be feasible to use simple replies for reads while using structured replies for the other negotiated commands, but practically, a client that does that seems undesirable, which is why I said that a server could disconnect rather than talking to such a client. So taking your sentence at face value, yes there is another solution possible: require that any NBD_OPT_* haggling used to negotiate the use of any other command with a structured reply MUST fail if the client has not first negotiated NBD_OPT_STRUCTURED_READ, but leaves the connection open so the client can continue option haggling. That way, the only way to end option haggling with the new command enabled is to also enable structured reads. The burden is then on the client to haggle in the correct order, and on the server to track haggling state when deciding how to answer the option requests for the new commands. I'm a bit reluctant to put ordering requirements on option haggling (that option A must be turned on before option B), but then again, the SELECT extension requires NBD_OPT_SELECT to be haggled prior to NBD_OPT_GO, so there's precedent. I also am thinking of proposing an extension for option haggling to communicate minimum/preferred alignments and maximum don't-fragment sizing, and that option would have to be enabled before OPT_LIST/OPT_SELECT/OPT_EXPORT_NAME (because it would change the NBD_REP_SERVER layout in response to those option requests), which would be another case where option A affects the behavior of option B. > >> + - `NBD_REPLY_TYPE_NONE` (0) >> + >> + *length* MUST be 0 (and the payload field omitted). This type >> + SHOULD be used only as the final reply (that is, when >> + `NBD_REPLY_FLAG_DONE` is set), and implies that the overall >> + client request was successfully completed. > > I think this would be clearer as 'SHOULD NOT be used other than as the > final reply'. Because you are also saying (I think) that you need not > have it as the final reply - it's just as good in a non-errored > reply to have NBD_REPLY_FLAG_DONE set on the last data packet (provided > you know it's not going to error before starting to send it). Yes, that sounds slightly better. (Technically, there's no reason that it can't be used as an intermediate chunk, but there it is only a no-op filler that wastes bandwidth - hence the SHOULD and not MUST). > > ... > >> + The server MAY split the reply into any number of data chunks, >> + using reply types of `NBD_REPLY_TYPE_OFFSET_DATA` or >> + `NBD_REPLY_TYPE_OFFSET_HOLE`; each chunk MUST describe at least >> + one byte, although to minimize overhead, the server SHOULD use >> + chunks no smaller than 512 bytes where possible (the first and > > This is a good idea, but rather than 'no smaller than 512 bytes', as > it's a 'SHOULD', could we have 'the server SHOULD use chunks each > an integer multiple of 512 bytes where possible' (you already have > a carve out for the first and last). > > ... > >> + If no error is detected, then the server MUST send enough chunks >> + to cover the bytes requested. The server MAY set the >> + `NBD_REPLY_FLAG_DONE` on the final data chunk, > > In which case it MUST NOT send any further non-data chunks > (e.g. an error chunk or a NONE chunk) Well, yeah - once the FLAG_DONE is sent, no further chunks of any type are allowed; the reply is complete. > >> to minimize >> + traffic, but MUST NOT do so if it would still be possible to >> + detect an error while transmitting the chunk. If the last data >> + chunk is not the final reply, the server MUST use >> + `NBD_REPLY_TYPE_NONE` as the final reply to indicate success. > > or an error chunk to indicate an error, and these final chunk MUST have > NBD_REPLY_FLAG_DONE set on it. > >> + If an error is detected, the server MUST send padding bytes to >> + complete the current chunk (if any), MUST report the error with a >> + reply type of either `NBD_REPLY_TYPE_ERROR` or >> + `NBD_REPLY_TYPE_ERROR_OFFSET`, and MAY end the sequence of replies >> + without sending the total number of bytes requested. If one or >> + more offset errors are reported, the client MAY assume that all >> + data in chunks not including the offset, > > "the offset(s)" > >> and all data within the >> + affected chunk > > "within each affected chunk" > >> but prior to the offset, > > "prior to the relevant offset" > >> is valid; the client MAY >> + NOT assume anything about data validity if no offset is provided. > Yes, your tweaks help the flow. > These multiple error chunks are neat. However, I suspect lazy implementors > may just send an error without an offset. Any ideas are appreciated on how to word it to suggest that servers SHOULD try to give full details; but I think we want the fallback to the no-offset case because some situations will not have an offset. > >> + The server MAY send additional chunks or offset error replies, if >> + `NBD_REPLY_FLAG_DONE` was not set, but MUST ensure the final reply >> + also reports an error (that is, the final reply MUST NOT use >> + `NBD_REPLY_TYPE_NONE`), and MAY reuse an offset reported earlier >> + in constructing the final reply. > > I'm not sure I get that bit. Why don't you make an errorred reply simply > one that contains one or more error chunks. An errorred reply need not contain > all the data requested (though each chunk must be complete). A reply that > isn't errorred needs not contain all the data requested. Why do you > need anything stronger than that? So if you have a parallelised server which > is simply sending several reads in parallel (think Ceph) it sends the > result from each thread, possibly followed by an error packet, and some > other thread notices when all of these have completed and sends a > NBD_REPLY_TYPE_NONE packet (always, error or not) to close of use of the > handle. This seems perfectly natural and no harder for the client to deal > with, but you are prohibiting it. I was thinking that it's easier on the client if the final chunk always serves as a reliable indicator of success (OFFSET_DATA, OFFSET_HOLE, NONE) or error (ERROR, ERROR_OFFSET). But if you think that always allowing a concluding NONE, even on an errored reply due to an earlier chunk reporting the error, is reasonable, I can relax things along those lines. Or maybe we can state that the concluding chunk on an errored reply may be ERROR_OFFSET with 'error' and 'offset' fields set to 0, rather than forcing the server to replay one of the earlier offsets. > >> A server SHOULD NOT mix >> + `NBD_REPLY_TYPE_ERROR` and `NBD_REPLY_TYPE_ERROR_OFFSET` replies >> + to the same request. >> + >> + A client MAY close the connection if it detects that the server >> + has sent invalid chunks (such as overlapping data, or not enough >> + data before claiming success). >> + >> + In order to avoid the burden of reassembly, the client MAY set the >> + `NBD_CMD_FLAG_DF` flag (bit 1), which instructs the server to not >> + fragment the reply. If this flag is set, the server MUST send at >> + most one `NBD_REPLY_TYPE_OFFSET_DATA` or >> + `NBD_REPLY_TYPE_OFFSET_HOLE`, although it MAY still send more than >> + reply (for error reporting, or a final `NBD_REPLY_TYPE_NONE`). If > > "the flag is set and" > >> + the client's length request is larger than 65,536 bytes (or if a >> + later extension adds a way to negotiate a larger maximum fragment >> + size), the server MAY reject the command with `EOVERFLOW`. The >> + `EOVERFLOW` error MUST NOT be used if the `NBD_CMD_FLAG_DF` flag >> + was not set, or if the requested length is no larger than 65,536. I'm also wondering if the wording should be switched along the lines of: If the flag is set and the server deems the request to be too large, the server MAY reject the command with `EOVERFLOW`; however, the server MUST NOT reject a request that is no larger than 65,536 bytes in this manner. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension 2016-03-30 17:45 ` Eric Blake @ 2016-03-30 19:51 ` Wouter Verhelst 2016-03-30 20:54 ` Eric Blake 2016-03-30 22:48 ` [Qemu-devel] " Alex Bligh 1 sibling, 1 reply; 43+ messages in thread From: Wouter Verhelst @ 2016-03-30 19:51 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, qemu-devel, Alex Bligh Hi all, (side note: this seems to be mostly an NBD discussion at this point. Does qemu-devel need to be in the loop before we've finished that? I don't care either way, but then I don't want to bore or annoy people...) On Wed, Mar 30, 2016 at 11:45:04AM -0600, Eric Blake wrote: > On 03/30/2016 12:50 AM, Alex Bligh wrote: [...] > So taking your sentence at face value, yes there is another solution > possible: require that any NBD_OPT_* haggling used to negotiate the use > of any other command with a structured reply MUST fail if the client has > not first negotiated NBD_OPT_STRUCTURED_READ, but leaves the connection > open so the client can continue option haggling. That way, the only way > to end option haggling with the new command enabled is to also enable > structured reads. The burden is then on the client to haggle in the > correct order, and on the server to track haggling state when deciding > how to answer the option requests for the new commands. > > I'm a bit reluctant to put ordering requirements on option haggling > (that option A must be turned on before option B), Yes, me too. > but then again, the > SELECT extension requires NBD_OPT_SELECT to be haggled prior to > NBD_OPT_GO, so there's precedent. Yeah, but then, that's only because GO is meant to transition from negotiation to transmission, so it needs to be after *any* other negotiation; anything else would defeat its purpose. Requiring structured read after negotiating other structured commands could easily be done by saying that it's an error to leave the negotiation phase with "some other" structured command negotiated, but not structured read. On the other hand, I feel compelled to point out that we only find ourselves in this hole because we've coupled the structured reply with the read command. That may have looked like a good idea at the time, but obviously it isn't. If we just have it be "negotiate the structured reply format" rather than "the structured read reply", and state that once negotiated, the structured reply format is required for any command with a payload, we're done. Since only the read reply has a payload at this point in time, you don't really need to special-case it, anyway. > I also am thinking of proposing an extension for option haggling to > communicate minimum/preferred alignments and maximum don't-fragment > sizing, and that option would have to be enabled before > OPT_LIST/OPT_SELECT/OPT_EXPORT_NAME (because it would change the > NBD_REP_SERVER layout in response to those option requests), which > would be another case where option A affects the behavior of option B. I reused the NBD_REP_SERVER command in reply to the NBD_OPT_SELECT command since its purpose seems fairly similar ("send metadata about an export to the client"), but that may have been a mistake. It certainly wasn't meant that if you say NBD_OPT_SELECT first and then go NBD_OPT_LIST, that the NBD_REP_SERVER reply to NBD_OPT_LIST should be the one as specified in the SELECT extension. [...] > > These multiple error chunks are neat. However, I suspect lazy implementors > > may just send an error without an offset. > > Any ideas are appreciated on how to word it to suggest that servers > SHOULD try to give full details; but I think we want the fallback to the > no-offset case because some situations will not have an offset. I'm wary of making the spec too complicated. Adding such language might make it unclear. Since as you say it can't be a hard-and-fast rule, I'd prefer that we just trust implementor's judgement on this. Not sending an offset (even if it would be possible) can certainly be the better choice in some cases -- a protocol description can never know all the trade-offs and choices a particular implementor may want or need to make. > >> + The server MAY send additional chunks or offset error replies, if > >> + `NBD_REPLY_FLAG_DONE` was not set, but MUST ensure the final reply > >> + also reports an error (that is, the final reply MUST NOT use > >> + `NBD_REPLY_TYPE_NONE`), and MAY reuse an offset reported earlier > >> + in constructing the final reply. > > > > I'm not sure I get that bit. Why don't you make an errorred reply simply > > one that contains one or more error chunks. An errorred reply need not contain > > all the data requested (though each chunk must be complete). A reply that > > isn't errorred needs not contain all the data requested. Why do you > > need anything stronger than that? So if you have a parallelised server which > > is simply sending several reads in parallel (think Ceph) it sends the > > result from each thread, possibly followed by an error packet, and some > > other thread notices when all of these have completed and sends a > > NBD_REPLY_TYPE_NONE packet (always, error or not) to close of use of the > > handle. This seems perfectly natural and no harder for the client to deal > > with, but you are prohibiting it. > > I was thinking that it's easier on the client if the final chunk always > serves as a reliable indicator of success (OFFSET_DATA, OFFSET_HOLE, > NONE) or error (ERROR, ERROR_OFFSET). The client will already need to keep state to reassemble a fragmented and out-of-order read reply, anyway. If that's already the case, adding in the requirement to *also* keep track of error state (which could in the simplest case be a single bit for a client which doesn't care about offsets or error count) isn't that much more of an issue. I'm with Alex on this one. [...] > >> + the client's length request is larger than 65,536 bytes (or if a > >> + later extension adds a way to negotiate a larger maximum fragment > >> + size), the server MAY reject the command with `EOVERFLOW`. The > >> + `EOVERFLOW` error MUST NOT be used if the `NBD_CMD_FLAG_DF` flag > >> + was not set, or if the requested length is no larger than 65,536. > > I'm also wondering if the wording should be switched along the lines of: > > If the flag is set and the server deems the request to be too large, the > server MAY reject the command with `EOVERFLOW`; however, the server MUST > NOT reject a request that is no larger than 65,536 bytes in this manner. Yes, that seems better. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension 2016-03-30 19:51 ` [Qemu-devel] [Nbd] " Wouter Verhelst @ 2016-03-30 20:54 ` Eric Blake 2016-03-30 21:26 ` Wouter Verhelst 0 siblings, 1 reply; 43+ messages in thread From: Eric Blake @ 2016-03-30 20:54 UTC (permalink / raw) To: Wouter Verhelst; +Cc: nbd-general, qemu-devel, Alex Bligh [-- Attachment #1: Type: text/plain, Size: 6979 bytes --] On 03/30/2016 01:51 PM, Wouter Verhelst wrote: > (side note: this seems to be mostly an NBD discussion at this point. > Does qemu-devel need to be in the loop before we've finished that? I > don't care either way, but then I don't want to bore or annoy people...) Well, it stemmed out of qemu's desires to implement more efficient ways to both push and pull sparse files across NBD; and qemu will ultimately want to implement what we discuss. But I'm okay doing most of the churn off of the qemu list, and only adding qemu back in the loop when it is actually time to implement the final design, unless someone else speaks up asking to remain in on the conversation. >> I'm a bit reluctant to put ordering requirements on option haggling >> (that option A must be turned on before option B), > > Yes, me too. > >> but then again, the >> SELECT extension requires NBD_OPT_SELECT to be haggled prior to >> NBD_OPT_GO, so there's precedent. > > Yeah, but then, that's only because GO is meant to transition from > negotiation to transmission, so it needs to be after *any* other > negotiation; anything else would defeat its purpose. > > Requiring structured read after negotiating other structured commands > could easily be done by saying that it's an error to leave the > negotiation phase with "some other" structured command negotiated, but > not structured read. > > On the other hand, I feel compelled to point out that we only find > ourselves in this hole because we've coupled the structured reply with > the read command. That may have looked like a good idea at the time, but > obviously it isn't. If we just have it be "negotiate the structured > reply format" rather than "the structured read reply", and state that > once negotiated, the structured reply format is required for any command > with a payload, we're done. Well, I'm worried about the opposite - if the client does not negotiate structured replies, but does negotiate NBD_CMD_GET_LBA_STATUS (which has a variable-length payload to reply), then the server has three choices: 0) refuse the client (we already said this is a bit undesirable, as it may lead to the client retrying in an infloop - having a way to return an error message is better than dropping the connection); 1) the server must find a way to shoehorn the same data that would be sent in a structured reply to fit in the old-style unstructured reply (one nice thing about the structured reply is that we get a length for free; that length would have to be shoehorned into the old-style reply, different from CMD_READ where a length was implied from the request), 2) the server must fail the message Actually, having typed that, maybe choice 2 is not all that bad. It's fairly easy for a server to ALWAYS fail NBD_CMD_GET_LBA_STATUS with EINVAL if structured replies were not negotiated, and to document that a client that negotiates GET_LBA_STATUS MUST also negotiate structured replies for the command to be useful. For that matter, we just documented that servers SHOULD use EINVAL for an unrecognized (or out-of-context) command. The client can enable the two options in either order. And we'd have the following table: enabled enabled structured GET_LBA result of: replies read GET_LBA write ---------------------------------------------------------------- no no old reply EINVAL old reply yes no new reply EINVAL [*] no yes old reply EINVAL old reply yes yes new reply new reply [*] [*] Here, we're still debating whether it makes sense to allow/require a server to send new replies everywhere (and clients must handle both styles if they negotiate structured replies), or to require a server to send old replies (so that read is the only command where clients have to handle two styles, but where the results of negotiating pinpoint which style). > > Since only the read reply has a payload at this point in time, you don't > really need to special-case it, anyway. Okay, so in v1 I tried to negotiate STRUCTURED_REPLY; in v2 I was specific to STRUCTURED_READ; it sounds like we are leaning back towards STRUCTURED_REPLY and just a caveat that any new command that sends payload SHOULD/MUST fail if STRUCTURED_REPLY was not also negotiated. I guess that also makes it easier to argue for a server that uses a structured reply for ALL messages (REPLY_TYPE_NONE for success changes 16 bytes into 20, and REPLY_TYPE_ERROR for errors changes 16 bytes into 24). > >> I also am thinking of proposing an extension for option haggling to >> communicate minimum/preferred alignments and maximum don't-fragment >> sizing, and that option would have to be enabled before >> OPT_LIST/OPT_SELECT/OPT_EXPORT_NAME (because it would change the >> NBD_REP_SERVER layout in response to those option requests), which >> would be another case where option A affects the behavior of option B. > > I reused the NBD_REP_SERVER command in reply to the NBD_OPT_SELECT > command since its purpose seems fairly similar ("send metadata about an > export to the client"), but that may have been a mistake. It certainly > wasn't meant that if you say NBD_OPT_SELECT first and then go > NBD_OPT_LIST, that the NBD_REP_SERVER reply to NBD_OPT_LIST should be > the one as specified in the SELECT extension. Ah, well, then it's evidence that improved wording will help. >> I was thinking that it's easier on the client if the final chunk always >> serves as a reliable indicator of success (OFFSET_DATA, OFFSET_HOLE, >> NONE) or error (ERROR, ERROR_OFFSET). > > The client will already need to keep state to reassemble a fragmented > and out-of-order read reply, anyway. If that's already the case, adding > in the requirement to *also* keep track of error state (which could in > the simplest case be a single bit for a client which doesn't care about > offsets or error count) isn't that much more of an issue. For a client that is copying NBD_CMD_READ into a local file, dumping directly via pwrite() as each chunk comes in doesn't require the client track any state (the client can just assume that by the end of the command, all the bytes will have been covered); while a client using pwritev() will have to construct an iovec that visits the chunks in the correct order (not necessarily in the order received). Clients that really don't want to do much work have the DF flag to forbid fragmentation. But I think you've swayed me - I will make sure v3 allows an error at any point in the chain of chunks, and that the wording on the final TYPE_NONE chunk does NOT imply success unless no earlier error chunks were sent. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension 2016-03-30 20:54 ` Eric Blake @ 2016-03-30 21:26 ` Wouter Verhelst 0 siblings, 0 replies; 43+ messages in thread From: Wouter Verhelst @ 2016-03-30 21:26 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, qemu-devel On Wed, Mar 30, 2016 at 02:54:41PM -0600, Eric Blake wrote: > On 03/30/2016 01:51 PM, Wouter Verhelst wrote: > > > (side note: this seems to be mostly an NBD discussion at this point. > > Does qemu-devel need to be in the loop before we've finished that? I > > don't care either way, but then I don't want to bore or annoy people...) > > Well, it stemmed out of qemu's desires to implement more efficient ways > to both push and pull sparse files across NBD; and qemu will ultimately > want to implement what we discuss. But I'm okay doing most of the churn > off of the qemu list, and only adding qemu back in the loop when it is > actually time to implement the final design, unless someone else speaks > up asking to remain in on the conversation. Sure. I was just asking the question... [...] > Well, I'm worried about the opposite - if the client does not negotiate > structured replies, but does negotiate NBD_CMD_GET_LBA_STATUS (which has > a variable-length payload to reply), then the server has three choices: > 0) refuse the client (we already said this is a bit undesirable, as it > may lead to the client retrying in an infloop - having a way to return > an error message is better than dropping the connection); 1) the server > must find a way to shoehorn the same data that would be sent in a > structured reply to fit in the old-style unstructured reply (one nice > thing about the structured reply is that we get a length for free; that > length would have to be shoehorned into the old-style reply, different > from CMD_READ where a length was implied from the request), 2) the > server must fail the message > > Actually, having typed that, maybe choice 2 is not all that bad. It isn't. > It's fairly easy for a server to ALWAYS fail NBD_CMD_GET_LBA_STATUS > with EINVAL if structured replies were not negotiated, and to document > that a client that negotiates GET_LBA_STATUS MUST also negotiate > structured replies for the command to be useful. For that matter, we > just documented that servers SHOULD use EINVAL for an unrecognized (or > out-of-context) command. The client can enable the two options in > either order. And we'd have the following table: > > enabled enabled > structured GET_LBA result of: > replies read GET_LBA write > ---------------------------------------------------------------- > no no old reply EINVAL old reply > yes no new reply EINVAL [*] > no yes old reply EINVAL old reply > yes yes new reply new reply [*] Right. It would also be reasonable to say that if you don't negotiate an option but do end up using it, you end up squarely in undefined behaviour-land. The server could send EINVAL, it could honour your request in ways that you may not expect (including structured replies when you didn't ask for them), or it could cause nasal demons. > [*] Here, we're still debating whether it makes sense to allow/require a > server to send new replies everywhere (and clients must handle both > styles if they negotiate structured replies), or to require a server to > send old replies (so that read is the only command where clients have to > handle two styles, but where the results of negotiating pinpoint which > style). I'm still in favour of having it be "old reply" for the whole of that "write" column. I'm just not convinced there's a downside to that, while there is an upside. > > Since only the read reply has a payload at this point in time, you don't > > really need to special-case it, anyway. > > Okay, so in v1 I tried to negotiate STRUCTURED_REPLY; in v2 I was > specific to STRUCTURED_READ; it sounds like we are leaning back towards > STRUCTURED_REPLY and just a caveat that any new command that sends > payload SHOULD/MUST fail if STRUCTURED_REPLY was not also negotiated. You could formulate it that way. Alternatively, you could formulate it so that any command that sends payload may fail if STRUCTURED_REPLY was not also negotiated, with caveat that there is this backwards compatibility thing for READ. (i.e., make READ be the exception rather than the rule) > I guess that also makes it easier to argue for a server that uses a > structured reply for ALL messages (REPLY_TYPE_NONE for success changes > 16 bytes into 20, and REPLY_TYPE_ERROR for errors changes 16 bytes > into 24). Perhaps. [...] > > The client will already need to keep state to reassemble a fragmented > > and out-of-order read reply, anyway. If that's already the case, adding > > in the requirement to *also* keep track of error state (which could in > > the simplest case be a single bit for a client which doesn't care about > > offsets or error count) isn't that much more of an issue. > > For a client that is copying NBD_CMD_READ into a local file, dumping > directly via pwrite() as each chunk comes in doesn't require the client > track any state (the client can just assume that by the end of the > command, all the bytes will have been covered); while a client using > pwritev() will have to construct an iovec that visits the chunks in the > correct order (not necessarily in the order received). Ah yes, good point. > Clients that really don't want to do much work have the DF flag to > forbid fragmentation. But I think you've swayed me - I will make sure > v3 allows an error at any point in the chain of chunks, and that the > wording on the final TYPE_NONE chunk does NOT imply success unless no > earlier error chunks were sent. Okay, thanks. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [PATCH v2 3/3] doc: Propose Structured Read extension 2016-03-30 17:45 ` Eric Blake 2016-03-30 19:51 ` [Qemu-devel] [Nbd] " Wouter Verhelst @ 2016-03-30 22:48 ` Alex Bligh 1 sibling, 0 replies; 43+ messages in thread From: Alex Bligh @ 2016-03-30 22:48 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, Wouter Verhelst, qemu-devel, Alex Bligh [-- Attachment #1: Type: text/plain, Size: 5831 bytes --] Eric, >>> However, for ease of implementation, a >>> + server MAY close the connection rather than entering transmission >>> + phase if, at the end of option haggling, the client has negotiated >>> + another command that requires a structured reply but did not also >>> + negotiate Structured Reads. >> >> That's pretty yucky given a reconnect will achieve the same result >> and you'll end up in an infinite retry loop. >> >> Wouldn't a better route be simply to say that implementing certain >> commands (server or client sides) requires support of structured >> replies? > > I think we're in agreement that the only command that (for back-compat > reasons) can ever send data on a simple reply is CMD_READ. Therefore, > if you negotiate any other command that can send data, that command will > use a structured reply; the mere fact that you negotiated the command > means that both client and server know about structured replies. > > I guess what I was trying to get at is that if you are using any > structured replies, then it is a disservice to wire-sniffers if you did > not also enable structured replies for CMD_READ. Agree > Technically, it would > be feasible to use simple replies for reads while using structured > replies for the other negotiated commands, Agree > but practically, a client > that does that seems undesirable, which is why I said that a server > could disconnect rather than talking to such a client. I would prefer to make that a protocol breach, i.e. the client MUST NOT do that. > So taking your sentence at face value, yes there is another solution > possible: require that any NBD_OPT_* haggling used to negotiate the use > of any other command with a structured reply MUST fail if the client has > not first negotiated NBD_OPT_STRUCTURED_READ, but leaves the connection > open so the client can continue option haggling. That way, the only way > to end option haggling with the new command enabled is to also enable > structured reads. The burden is then on the client to haggle in the > correct order, and on the server to track haggling state when deciding > how to answer the option requests for the new commands. I think I'm saying something simpler (unless I'm missing something) which is: The client MUST NOT propose NBD_OPT_FOO unless it has previously proposed and the server accepted NBD_OPT_BAR. In this case FOO is e.g. the thing to find holes, BAR is NBD_OPT_STRUCTURED_READ. So it's not a question of leaving the connection open or closing it, it's simply that the client can't propose X unless it's already negotiated Y. If it does, all bets are off, so of course it can close the connection. But this makes it explicit that the client if proposing X should first have successfully negotiated Y. > I'm a bit reluctant to put ordering requirements on option haggling > (that option A must be turned on before option B), but then again, the > SELECT extension requires NBD_OPT_SELECT to be haggled prior to > NBD_OPT_GO, so there's precedent. Resisting such a change would resist the possibility in the future that option X requires option Y. Even if you can get around that this time, I'll bet my bottom dollar it will come up again. > I also am thinking of proposing an > extension for option haggling to communicate minimum/preferred > alignments and maximum don't-fragment sizing, and that option would have > to be enabled before OPT_LIST/OPT_SELECT/OPT_EXPORT_NAME (because it > would change the NBD_REP_SERVER layout in response to those option > requests), which would be another case where option A affects the > behavior of option B. ... and as if by magic! Yeah I was going to suggest a similar option, including a blocksize alignment, a maximum size for a read/write, and a maximum DF size. I'm busy writing an 'example' nbd server in golang and this is the first thing I ran into. The purpose of this BTW is to serve as an example implementation for structured replies. >> These multiple error chunks are neat. However, I suspect lazy implementors >> may just send an error without an offset. > > Any ideas are appreciated on how to word it to suggest that servers > SHOULD try to give full details; but I think we want the fallback to the > no-offset case because some situations will not have an offset. Indeed. Perhaps we can promise them Oreos. > I was thinking that it's easier on the client if the final chunk always > serves as a reliable indicator of success (OFFSET_DATA, OFFSET_HOLE, > NONE) or error (ERROR, ERROR_OFFSET). But if you think that always > allowing a concluding NONE, even on an errored reply due to an earlier > chunk reporting the error, is reasonable, I can relax things along those > lines. Or maybe we can state that the concluding chunk on an errored > reply may be ERROR_OFFSET with 'error' and 'offset' fields set to 0, > rather than forcing the server to replay one of the earlier offsets. I think I'd prefer just that the last chunk has the relevant bit set, and comes before (or is) the error chunk(s). That's lots easier for the server and no more difficult for the client. >>> + the client's length request is larger than 65,536 bytes (or if a >>> + later extension adds a way to negotiate a larger maximum fragment >>> + size), the server MAY reject the command with `EOVERFLOW`. The >>> + `EOVERFLOW` error MUST NOT be used if the `NBD_CMD_FLAG_DF` flag >>> + was not set, or if the requested length is no larger than 65,536. > > I'm also wondering if the wording should be switched along the lines of: > > If the flag is set and the server deems the request to be too large, the > server MAY reject the command with `EOVERFLOW`; however, the server MUST > NOT reject a request that is no larger than 65,536 bytes in this manner. That's clearer. -- Alex Bligh [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 842 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension 2016-03-29 23:01 ` [Qemu-devel] [PATCH v2 3/3] doc: Propose Structured Read extension Eric Blake 2016-03-29 23:29 ` Eric Blake 2016-03-30 6:50 ` Alex Bligh @ 2016-03-30 20:44 ` Wouter Verhelst 2 siblings, 0 replies; 43+ messages in thread From: Wouter Verhelst @ 2016-03-30 20:44 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, qemu-devel So, On Tue, Mar 29, 2016 at 05:01:00PM -0600, Eric Blake wrote: [...] > +- `NBD_OPT_STRUCTURED_READ` (8) > + > + Defined by the experimental `Structured Read` extension; see below. (detail: that makes the "structured read" extension be typographically different from everything else. Either make it all caps, or not monocase.) [...] > A write request. Length and offset define the location and amount of > @@ -536,13 +556,16 @@ The following error values are defined: > * `ENOMEM` (12), Cannot allocate memory. > * `EINVAL` (22), Invalid argument. > * `ENOSPC` (28), No space left on device. > +* `EOVERFLOW` (75), Value too large. > > The server SHOULD return `ENOSPC` if it receives a write request > including one or more sectors beyond the size of the device. It SHOULD > return `EINVAL` if it receives a read or trim request including one or > more sectors beyond the size of the device. It also SHOULD map the > -`EDQUOT` and `EFBIG` errors to `ENOSPC`. Finally, it SHOULD return > -`EPERM` if it receives a write or trim request on a read-only export. > +`EDQUOT` and `EFBIG` errors to `ENOSPC`. It SHOULD return `EOVERFLOW` > +on a request to send structured read data without fragmentation but > +where the length is too large. Finally, it SHOULD return `EPERM` if > +it receives a write or trim request on a read-only export. I'd like some more explicit language here that makes it clear EOVERFLOW can *only* be used on structured replies. We reduced the set of valid error numbers a while back for good reason; it would be bad if we accidentally increase it for existing replies now. > The server SHOULD return `EINVAL` if it receives an unknown command. > > @@ -579,7 +602,7 @@ To remedy this, a `SELECT` extension is envisioned. This extension adds > two option requests and one error reply type, and extends one existing > option reply type. > > -* `NBD_OPT_SELECT` > +* `NBD_OPT_SELECT` (6) NAK. The spec is currently consistent in associating numbers to constants in only *one* place. This is no accident, and I'd like to keep it that way. (at least I think it is; if not, that's a bug) > The client wishes to select the export with the given name for use > in the transmission phase, but does not yet want to move to the > @@ -613,7 +636,7 @@ option reply type. > handle `NBD_REP_ERR_UNSUP`. In this case, they should fall back to > using `NBD_OPT_EXPORT_NAME`. > > -* `NBD_OPT_GO` > +* `NBD_OPT_GO` (7) same. > The client wishes to terminate the negotiation phase and progress to > the transmission phase. Possible replies from the server include: > @@ -635,6 +658,235 @@ option reply type. > message if they do not also send it as a reply to the > `NBD_OPT_SELECT` message. > > +### `Structured Read` extension > + > +Some of the major downsides of the default reply to `NBD_CMD_READ` > +(without structured replies) are as follows. First, it is not > +possible to support partial reads (the command must succeed or fail as > +a whole, either len bytes of data must be sent or the connection must > +be closed). There is no way to efficiently skip over portions of a > +sparse file that are known to contain all zeroes. Finally, it is not > +possible to reliably decode the server traffic without also having > +context of what pending read requests were sent by the client. > + > +To remedy this, a `Structured Read` extension is envisioned. This > +extension adds a new option request, a new reply type during the > +transmission phase, and a new command flag, and alters the set of > +valid replies to an existing command. > + > +* `NBD_OPT_STRUCTURED_READ` (8) same > + The client wishes to use structured reads during the transmission > + phase. The option request has no additional data. > + > + The server replies with one of the following: > + > + - `NBD_REP_ACK`: Structured reads have been negotiated; the server > + MUST use structured replies to `NBD_CMD_READ` > + - `NBD_REP_UNSUP`: Structured reads are not available; the transmission ^ ERR_ (however, see below ;-) > + phase MUST remain the same as if the client had not attempted > + `NBD_OPT_STRUCTURED_READ` This makes it seem as though NBD_REP_ERR_UNSUP has a different meaning here than it usually does. I think the spec should just say that the server should reply with NBD_REP_ACK, and then mention that for backwards compatibility the client should be prepared to handle NBD_REP_UNSUP too (as is done elsewhere). That is, if the server implements structured replies, it should allow it (it makes no sense for the server to disallow structured reads if it knows about them) [...] > + A server MUST NOT send a data payload in a normal reply if > + Structured Reads are negotiated. It is envisioned that all future > + extension commands that require a data payload in the response > + will require independent option negotiation, and therefore, the > + `NBD_CMD_READ` command is the only command that is allowed to use > + the data payload of a normal reply, and only when Structured Reads > + were not negotiated. However, for ease of implementation, a > + server MAY close the connection rather than entering transmission > + phase if, at the end of option haggling, the client has negotiated > + another command that requires a structured reply but did not also > + negotiate Structured Reads. (see my comments on this downthread) [...] > +* `NBD_CMD_FLAG_DF` (bit 1) > + > + Valid during `NBD_CMD_READ`. SHOULD be set to 1 if the client > + requires the server to send at most one data chunk in reply. MUST > + NOT be set unless the client negotiated Structured Reads with the > + server. I realize I'm flip-flopping on whether or not to use a flag bit for this, but bear with me on this one for a moment. There is an ioctl NBD_SET_FLAGS which just expects the per-export flags to be passed to the kernel. By reusing that, there is no need for an extra kernel call to specify the options the kernel-side client can use. That still leaves the ability for userspace to detect whether the client can interpret structured replies, but this could easily be signalled using a sysfs flag (or similar). If the client can do something along the lines of: check sysfs (or whatever) send NBD_OPT_STRUCTURED_READ to server get flags from server call NBD_SET_FLAGS ioctl and that signals *everything* to both the kernel and the server, then we don't need any extra uapi calls to be defined. This is probably a good thing. However, in order for that to be possible, the per-export flags field needs to have a bit set to signal the server's understanding of, and client's permission to use, NBD_CMD_FLAG_DF. As such, I think we need an extra flag in the per-export flags field of the protocol, even though we've already negotiated structured reads and I expressed the preference that this shouldn't be decoupled. Yes, that's slightly ugly. Thinking about this, I suppose it makes sense to rename the "global" flags field as the "negotiation flags field" which signals incompatible changes in the negotiation phase, and the "per-export" flags field as the "transmission flags field" which signals features the client can use during transmission, or something along those lines. Thoughts? [...] no further comments (other than what's already been said) -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] [Nbd] [PATCH v2 0/3] NBD Structured Read 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 0/3] NBD Structured Read Eric Blake ` (2 preceding siblings ...) 2016-03-29 23:01 ` [Qemu-devel] [PATCH v2 3/3] doc: Propose Structured Read extension Eric Blake @ 2016-03-30 8:09 ` Wouter Verhelst 3 siblings, 0 replies; 43+ messages in thread From: Wouter Verhelst @ 2016-03-30 8:09 UTC (permalink / raw) To: Eric Blake; +Cc: nbd-general, qemu-devel Hi Eric, On Tue, Mar 29, 2016 at 05:00:57PM -0600, Eric Blake wrote: > I wrote this in parallel with Alex's strawman proposals, so I > may have picked up on some of his ideas, while diverging in > other places. > > Changes since v1: rebase, resend some pre-req patches, switch > from global/client flag negotiation over to option negotiation, > document a flags/type scheme in all structured replies, use > ONLY structured replies in response to a structured read, make > the server stream fully context-free (thanks to the type scheme), > go into more details about error reporting by using two different > structured errors (multiple errors each with offset, or single > error with no offset). > > Eric Blake (2): > doc: Mention proper use of handle > doc: Propose Structured Read extension > > Pavel Borzenkov (1): > NBD proto: add "Command flags" section Thanks. 1 and 2 applied, 3 not yet (until resolution of discussion) (not yet pushed yet, will do so later) -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2016-03-30 22:48 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-03-28 13:59 [Qemu-devel] [PATCH] doc: Mention proper use of handle Eric Blake 2016-03-29 3:56 ` [Qemu-devel] [PATCH 2/1] doc: More details on flag negotiation Eric Blake 2016-03-29 3:56 ` [Qemu-devel] [PATCH 3/1] doc: Propose Structured Replies extension Eric Blake 2016-03-29 7:33 ` [Qemu-devel] [Nbd] " Wouter Verhelst 2016-03-29 8:24 ` Alex Bligh 2016-03-29 14:21 ` Eric Blake 2016-03-29 14:37 ` Alex Bligh 2016-03-29 15:12 ` Eric Blake 2016-03-29 16:37 ` Wouter Verhelst 2016-03-29 17:34 ` Alex Bligh 2016-03-29 17:45 ` Eric Blake 2016-03-29 18:03 ` Wouter Verhelst 2016-03-29 18:07 ` Eric Blake 2016-03-29 18:19 ` Wouter Verhelst 2016-03-29 18:25 ` Eric Blake 2016-03-29 18:09 ` Alex Bligh 2016-03-29 17:53 ` Wouter Verhelst 2016-03-29 18:23 ` Eric Blake 2016-03-29 18:51 ` Wouter Verhelst 2016-03-29 19:06 ` Wouter Verhelst 2016-03-29 19:39 ` Alex Bligh 2016-03-29 20:00 ` Eric Blake 2016-03-29 20:18 ` Alex Bligh 2016-03-29 20:44 ` Alex Bligh 2016-03-29 21:05 ` Wouter Verhelst 2016-03-29 22:05 ` Alex Bligh 2016-03-29 22:45 ` Wouter Verhelst 2016-03-29 22:53 ` Alex Bligh 2016-03-29 7:11 ` [Qemu-devel] [PATCH] doc: Mention proper use of handle Wouter Verhelst 2016-03-29 13:59 ` Eric Blake 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 0/3] NBD Structured Read Eric Blake 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 1/3] NBD proto: add "Command flags" section Eric Blake 2016-03-29 23:00 ` [Qemu-devel] [PATCH v2 2/3] doc: Mention proper use of handle Eric Blake 2016-03-29 23:01 ` [Qemu-devel] [PATCH v2 3/3] doc: Propose Structured Read extension Eric Blake 2016-03-29 23:29 ` Eric Blake 2016-03-30 6:50 ` Alex Bligh 2016-03-30 17:45 ` Eric Blake 2016-03-30 19:51 ` [Qemu-devel] [Nbd] " Wouter Verhelst 2016-03-30 20:54 ` Eric Blake 2016-03-30 21:26 ` Wouter Verhelst 2016-03-30 22:48 ` [Qemu-devel] " Alex Bligh 2016-03-30 20:44 ` [Qemu-devel] [Nbd] " Wouter Verhelst 2016-03-30 8:09 ` [Qemu-devel] [Nbd] [PATCH v2 0/3] NBD Structured Read Wouter Verhelst
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.