Hi, back from my easter vacation. A bit surprised to find 200 mails in the NBD mailing list ;). On Friday 25 March 2016 09:49:29 Wouter Verhelst wrote: > On Thu, Mar 24, 2016 at 04:08:13PM -0600, Eric Blake wrote: > > On 03/23/2016 08:16 AM, Denis V. Lunev wrote: > > > From: Pavel Borzenkov > > > > > > With the availability of sparse storage formats, it is often needed to > > > query status of a particular LBA range and read only those blocks of > > > data that are actually present on the block device. > > > > > > To provide such information, the patch adds GET_LBA_STATUS extension > > > with one new NBD_CMD_GET_LBA_STATUS command. > > > > > > > > +* `NBD_CMD_GET_LBA_STATUS` (7) > > > + > > > + An LBA range status query request. Length and offset define the range > > > + of interest. The server MUST reply with a reply header, followed > > > + immediately by the following data: > > > + > > > + - 32 bits, length of parameter data that follow (unsigned) > > > + - zero or more LBA status descriptors, each having the following > > > + structure: > > > + > > > + * 64 bits, offset (unsigned) > > > + * 32 bits, length (unsigned) > > > + * 16 bits, status (unsigned) > > > + > > > + unless an error condition has occurred. > > > > To date, only the NBD_CMD_READ command caused the server to send data > > after the handle in its reply. This would be the second command with a > > data field in the response, but it is returning a variable amount of > > data, not directly tied to the client's length - but at least you did > > make it structured so that the client knows how much to read. However, > > your patch is incomplete; you'll need to edit the "Transmission" section > > of the document to mention the rules on the server sending data, as the > > existing text would now be wrong: > > > > > The server replies with: > > > > > > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC) > > > S: 32 bits, error > > > S: 64 bits, handle > > > S: (length bytes of data if the request is of type NBD_CMD_READ) > > > > You may also want to add a rule that for all future extensions, any > > command that requires data in the server response (other than the server > > response to NBD_CMD_READ) must include a 32-bit length as the first > > field of its data payload, so that the server reply is always structured. > > Right. > > > Hmm, I still think it would be hard to write a wireshark dissector for > > server replies, since there is no indication whether a given reply will > > include data or not. > > Well, a wireshark dissector already exists. However, it is very old; it > doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't > support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't > deal with negotiation at all. It was written when newstyle negotiation > didn't exist yet, and scanning for INIT_PASSWD at the time was too hard, > according to the guy who wrote it (don't remember who that was). > > > The _client_ knows (well, any well-written client > > that uses a different value for 'handle' for every command sent to the > > server), because it can read the returned 'handle' field to know what > > command it matches to and therefore what data to expect; but a > > third-party observer seeing _just_ the server messages has no idea which > > server responses should have payload. > > It can if it knows enough about the protocol, but granted, that makes it > harder for us to extend the protocol without breaking the dissector. > > > Scanning the stream and assuming that a new reply starts each time > > NBD_REPLY_MAGIC is encountered is a close approximation, but hits > > false positives if the data being returned for NBD_CMD_READ contains > > the same magic number as part of its contents. > > Indeed. > > > Obviously, back-compat says we can't change the response to > > NBD_CMD_READ, but that means that a wireshark dissector has to either > > maintain context, or hunt through returned data looking for potential > > magic numbers and possibly hitting false positives. > > > > That said, maybe we want to allow global flag negotiation prior to > > transmission to add a new flag on both server and client side - the > > server would advertise that it is capable of a new reply mode, and the > > client then has to reply that it wants to use the new reply mode; in > > Global flag negotiation is already possible. There is a client flags > field, which is sent before option haggling, that could be used for > negotiation of such backwards-incompatible features. > > > that mode, all server replies starting with magic 0x67446698 > > (NBD_REPLY_MAGIC) will never have a data payload, and all commands that > > cause a reply with payload (both NBD_CMD_READ and the new > > NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it) > > will reply with a NEW magic number: > > > > S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2) > > S: 32 bits, error > > S: 64 bits, handle > > S: 32 bits, length > > S: length bytes of data > > I like this. However, before committing to it, I would like to see > Markus' feedback on that (explicit Cc added, even though he's on the > list, too). > > We'd also need a way to communicate the ability to speak this protocol > from the kernel to userspace before telling the server that the client > understands something which maybe its kernel doesn't. > > Markus? > > > so that the server's data stream is fully structured without having to > > maintain context of the client's requests. That is, a dissector can now > > merely scan for both magic numbers; and on a stream between a client and > > server that have negotiated the new mode, the old magic number will > > never have a payload, and the new magic number will always be > > accompanied with a payload that describes how much data to read to the > > boundary of the next reply. > > > > For that matter, right now, NBD_CMD_READ is required to either end the > > session or return length bytes of data even when error is non-zero (but > > where those bytes MAY be invalid); but by adding a negotiated flag for > > structured length replies, it would be possible to allow for short reads > > and/or returning an error with 0 bytes of payload but still keeping the > > connection to the client open, without having to send wasted bytes over > > the wire. > > Yes. This has been discussed on the nbd-general list in the past. There > is also the (significant) problem of the server having maybe already > sent out the header before discovering there is an error, at which point > it can *only* drop the connection. I am still not through all the new mails on the list, so there may be some more discussions about this. But I will answer here simply. I generally like the idea of having this new kind of reply. Is this solving our issues where we want to "stream" data directly from a filedescriptor into a tcp-stream? Does it make sense to extend this for reads with an offset? This way we could not only send in chunks but also order them randomly. Is there any use-case where it does make sense to read data not sequentially? Best Regards, Markus -- Pengutronix e.K. | | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |