Hi,

back from my easter vacation. A bit surprised to find 200 mails in the
NBD mailing list ;).

On Friday 25 March 2016 09:49:29 Wouter Verhelst wrote:
> On Thu, Mar 24, 2016 at 04:08:13PM -0600, Eric Blake wrote:
> > On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > > 
> > > With the availability of sparse storage formats, it is often needed to
> > > query status of a particular LBA range and read only those blocks of
> > > data that are actually present on the block device.
> > > 
> > > To provide such information, the patch adds GET_LBA_STATUS extension
> > > with one new NBD_CMD_GET_LBA_STATUS command.
> > > 
> > 
> > > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > > +
> > > +    An LBA range status query request. Length and offset define the range
> > > +    of interest. The server MUST reply with a reply header, followed
> > > +    immediately by the following data:
> > > +
> > > +      - 32 bits, length of parameter data that follow (unsigned)
> > > +      - zero or more LBA status descriptors, each having the following
> > > +        structure:
> > > +
> > > +        * 64 bits, offset (unsigned)
> > > +        * 32 bits, length (unsigned)
> > > +        * 16 bits, status (unsigned)
> > > +
> > > +    unless an error condition has occurred.
> > 
> > To date, only the NBD_CMD_READ command caused the server to send data
> > after the handle in its reply.  This would be the second command with a
> > data field in the response, but it is returning a variable amount of
> > data, not directly tied to the client's length - but at least you did
> > make it structured so that the client knows how much to read.  However,
> > your patch is incomplete; you'll need to edit the "Transmission" section
> > of the document to mention the rules on the server sending data, as the
> > existing text would now be wrong:
> > 
> > > The server replies with:
> > > 
> > > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC)
> > > S: 32 bits, error
> > > S: 64 bits, handle
> > > S: (length bytes of data if the request is of type NBD_CMD_READ)
> > 
> > You may also want to add a rule that for all future extensions, any
> > command that requires data in the server response (other than the server
> > response to NBD_CMD_READ) must include a 32-bit length as the first
> > field of its data payload, so that the server reply is always structured.
> 
> Right.
> 
> > Hmm, I still think it would be hard to write a wireshark dissector for
> > server replies, since there is no indication whether a given reply will
> > include data or not.
> 
> Well, a wireshark dissector already exists. However, it is very old; it
> doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't
> support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't
> deal with negotiation at all. It was written when newstyle negotiation
> didn't exist yet, and scanning for INIT_PASSWD at the time was too hard,
> according to the guy who wrote it (don't remember who that was).
> 
> > The _client_ knows (well, any well-written client
> > that uses a different value for 'handle' for every command sent to the
> > server), because it can read the returned 'handle' field to know what
> > command it matches to and therefore what data to expect; but a
> > third-party observer seeing _just_ the server messages has no idea which
> > server responses should have payload.
> 
> It can if it knows enough about the protocol, but granted, that makes it
> harder for us to extend the protocol without breaking the dissector.
> 
> > Scanning the stream and assuming that a new reply starts each time
> > NBD_REPLY_MAGIC is encountered is a close approximation, but hits
> > false positives if the data being returned for NBD_CMD_READ contains
> > the same magic number as part of its contents.
> 
> Indeed.
> 
> > Obviously, back-compat says we can't change the response to
> > NBD_CMD_READ, but that means that a wireshark dissector has to either
> > maintain context, or hunt through returned data looking for potential
> > magic numbers and possibly hitting false positives.
> >
> > That said, maybe we want to allow global flag negotiation prior to
> > transmission to add a new flag on both server and client side - the
> > server would advertise that it is capable of a new reply mode, and the
> > client then has to reply that it wants to use the new reply mode; in
> 
> Global flag negotiation is already possible. There is a client flags
> field, which is sent before option haggling, that could be used for
> negotiation of such backwards-incompatible features.
> 
> > that mode, all server replies starting with magic 0x67446698
> > (NBD_REPLY_MAGIC) will never have a data payload, and all commands that
> > cause a reply with payload (both NBD_CMD_READ and the new
> > NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it)
> > will reply with a NEW magic number:
> > 
> > S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2)
> > S: 32 bits, error
> > S: 64 bits, handle
> > S: 32 bits, length
> > S: length bytes of data
> 
> I like this. However, before committing to it, I would like to see
> Markus' feedback on that (explicit Cc added, even though he's on the
> list, too).
> 
> We'd also need a way to communicate the ability to speak this protocol
> from the kernel to userspace before telling the server that the client
> understands something which maybe its kernel doesn't.
> 
> Markus?
> 
> > so that the server's data stream is fully structured without having to
> > maintain context of the client's requests.  That is, a dissector can now
> > merely scan for both magic numbers; and on a stream between a client and
> > server that have negotiated the new mode, the old magic number will
> > never have a payload, and the new magic number will always be
> > accompanied with a payload that describes how much data to read to the
> > boundary of the next reply.
> > 
> > For that matter, right now, NBD_CMD_READ is required to either end the
> > session or return length bytes of data even when error is non-zero (but
> > where those bytes MAY be invalid); but by adding a negotiated flag for
> > structured length replies, it would be possible to allow for short reads
> > and/or returning an error with 0 bytes of payload but still keeping the
> > connection to the client open, without having to send wasted bytes over
> > the wire.
> 
> Yes. This has been discussed on the nbd-general list in the past. There
> is also the (significant) problem of the server having maybe already
> sent out the header before discovering there is an error, at which point
> it can *only* drop the connection.

I am still not through all the new mails on the list, so there may be
some more discussions about this. But I will answer here simply.

I generally like the idea of having this new kind of reply. Is this
solving our issues where we want to "stream" data directly from a
filedescriptor into a tcp-stream?

Does it make sense to extend this for reads with an offset? This way we
could not only send in chunks but also order them randomly. Is there any
use-case where it does make sense to read data not sequentially?

Best Regards,

Markus

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |