From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50932) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1an1bD-0000kz-8Q for qemu-devel@nongnu.org; Mon, 04 Apr 2016 06:19:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1an1b9-0000dk-U1 for qemu-devel@nongnu.org; Mon, 04 Apr 2016 06:19:11 -0400 Received: from metis.ext.4.pengutronix.de ([92.198.50.35]:50313) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1an1b9-0000dS-KB for qemu-devel@nongnu.org; Mon, 04 Apr 2016 06:19:07 -0400 From: Markus Pargmann Date: Mon, 04 Apr 2016 12:18:37 +0200 Message-ID: <1580868.I2qeg0bkmp@adelgunde> In-Reply-To: <20160325084929.GA2671@grep.be> References: <1458742562-30624-1-git-send-email-den@openvz.org> <56F4654D.2020700@redhat.com> <20160325084929.GA2671@grep.be> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart5638902.FJ9QWIOBTD"; micalg="pgp-sha256"; protocol="application/pgp-signature" Subject: Re: [Qemu-devel] [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Wouter Verhelst Cc: nbd-general@lists.sourceforge.net, Kevin Wolf , qemu-devel@nongnu.org, Stefan Hajnoczi , Paolo Bonzini , "Denis V. Lunev" --nextPart5638902.FJ9QWIOBTD Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="us-ascii" Hi, back from my easter vacation. A bit surprised to find 200 mails in the NBD mailing list ;). On Friday 25 March 2016 09:49:29 Wouter Verhelst wrote: > On Thu, Mar 24, 2016 at 04:08:13PM -0600, Eric Blake wrote: > > On 03/23/2016 08:16 AM, Denis V. Lunev wrote: > > > From: Pavel Borzenkov > > >=20 > > > With the availability of sparse storage formats, it is often need= ed to > > > query status of a particular LBA range and read only those blocks= of > > > data that are actually present on the block device. > > >=20 > > > To provide such information, the patch adds GET_LBA_STATUS extens= ion > > > with one new NBD_CMD_GET_LBA_STATUS command. > > >=20 > >=20 > > > +* `NBD_CMD_GET_LBA_STATUS` (7) > > > + > > > + An LBA range status query request. Length and offset define = the range > > > + of interest. The server MUST reply with a reply header, foll= owed > > > + immediately by the following data: > > > + > > > + - 32 bits, length of parameter data that follow (unsigned)= > > > + - zero or more LBA status descriptors, each having the fol= lowing > > > + structure: > > > + > > > + * 64 bits, offset (unsigned) > > > + * 32 bits, length (unsigned) > > > + * 16 bits, status (unsigned) > > > + > > > + unless an error condition has occurred. > >=20 > > To date, only the NBD_CMD_READ command caused the server to send da= ta > > after the handle in its reply. This would be the second command wi= th a > > data field in the response, but it is returning a variable amount o= f > > data, not directly tied to the client's length - but at least you d= id > > make it structured so that the client knows how much to read. Howe= ver, > > your patch is incomplete; you'll need to edit the "Transmission" se= ction > > of the document to mention the rules on the server sending data, as= the > > existing text would now be wrong: > >=20 > > > The server replies with: > > >=20 > > > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC) > > > S: 32 bits, error > > > S: 64 bits, handle > > > S: (length bytes of data if the request is of type NBD_CMD_READ) > >=20 > > You may also want to add a rule that for all future extensions, any= > > command that requires data in the server response (other than the s= erver > > response to NBD_CMD_READ) must include a 32-bit length as the first= > > field of its data payload, so that the server reply is always struc= tured. >=20 > Right. >=20 > > Hmm, I still think it would be hard to write a wireshark dissector = for > > server replies, since there is no indication whether a given reply = will > > include data or not. >=20 > Well, a wireshark dissector already exists. However, it is very old; = it > doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn'= t > support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't > deal with negotiation at all. It was written when newstyle negotiatio= n > didn't exist yet, and scanning for INIT_PASSWD at the time was too ha= rd, > according to the guy who wrote it (don't remember who that was). >=20 > > The _client_ knows (well, any well-written client > > that uses a different value for 'handle' for every command sent to = the > > server), because it can read the returned 'handle' field to know wh= at > > command it matches to and therefore what data to expect; but a > > third-party observer seeing _just_ the server messages has no idea = which > > server responses should have payload. >=20 > It can if it knows enough about the protocol, but granted, that makes= it > harder for us to extend the protocol without breaking the dissector. >=20 > > Scanning the stream and assuming that a new reply starts each time > > NBD_REPLY_MAGIC is encountered is a close approximation, but hits > > false positives if the data being returned for NBD_CMD_READ contain= s > > the same magic number as part of its contents. >=20 > Indeed. >=20 > > Obviously, back-compat says we can't change the response to > > NBD_CMD_READ, but that means that a wireshark dissector has to eith= er > > maintain context, or hunt through returned data looking for potenti= al > > magic numbers and possibly hitting false positives. > > > > That said, maybe we want to allow global flag negotiation prior to > > transmission to add a new flag on both server and client side - the= > > server would advertise that it is capable of a new reply mode, and = the > > client then has to reply that it wants to use the new reply mode; i= n >=20 > Global flag negotiation is already possible. There is a client flags > field, which is sent before option haggling, that could be used for > negotiation of such backwards-incompatible features. >=20 > > that mode, all server replies starting with magic 0x67446698 > > (NBD_REPLY_MAGIC) will never have a data payload, and all commands = that > > cause a reply with payload (both NBD_CMD_READ and the new > > NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it= ) > > will reply with a NEW magic number: > >=20 > > S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2) > > S: 32 bits, error > > S: 64 bits, handle > > S: 32 bits, length > > S: length bytes of data >=20 > I like this. However, before committing to it, I would like to see > Markus' feedback on that (explicit Cc added, even though he's on the > list, too). >=20 > We'd also need a way to communicate the ability to speak this protoco= l > from the kernel to userspace before telling the server that the clien= t > understands something which maybe its kernel doesn't. >=20 > Markus? >=20 > > so that the server's data stream is fully structured without having= to > > maintain context of the client's requests. That is, a dissector ca= n now > > merely scan for both magic numbers; and on a stream between a clien= t and > > server that have negotiated the new mode, the old magic number will= > > never have a payload, and the new magic number will always be > > accompanied with a payload that describes how much data to read to = the > > boundary of the next reply. > >=20 > > For that matter, right now, NBD_CMD_READ is required to either end = the > > session or return length bytes of data even when error is non-zero = (but > > where those bytes MAY be invalid); but by adding a negotiated flag = for > > structured length replies, it would be possible to allow for short = reads > > and/or returning an error with 0 bytes of payload but still keeping= the > > connection to the client open, without having to send wasted bytes = over > > the wire. >=20 > Yes. This has been discussed on the nbd-general list in the past. The= re > is also the (significant) problem of the server having maybe already > sent out the header before discovering there is an error, at which po= int > it can *only* drop the connection. I am still not through all the new mails on the list, so there may be some more discussions about this. But I will answer here simply. I generally like the idea of having this new kind of reply. Is this solving our issues where we want to "stream" data directly from a filedescriptor into a tcp-stream? Does it make sense to extend this for reads with an offset? This way we= could not only send in chunks but also order them randomly. Is there an= y use-case where it does make sense to read data not sequentially? Best Regards, Markus =2D-=20 Pengutronix e.K. | = | Industrial Linux Solutions | http://www.pengutronix.de/= | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 = | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-555= 5 | --nextPart5638902.FJ9QWIOBTD Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJXAj+DAAoJEEpcgKtcEGQQBGwP/iqqBiG5sINJqua9ysukrHz9 HAm17tBt+gVCnmt14/ZsE0ZVdycyNFpRDikS3sms9GVrLdAaJf6ORi2K0bvuaAYQ uTMEBt6u3uhYsDjPWtlNS9rmBZJS/i3pTpg1t2hR4DTdK0JKTkEI/fBeeb5LRYi5 7oqR9rhS+IdrUEN5YN+4P8G4VqRJCiWecD759u93dgr7q3J52JeGHD5oJ4PVYMJm J1B+HrQE2szi558fmRTgr0+HzyTOBevEtHku966ouXA/kz36P5GGngQ4fqqBQYoX OtdopYWRUtA7xmJO5KgNhLlFUWLFB31nVSpW8zmvAblXwydiBtAjG9Nafc6EpqO5 DODf8KUXfyLWxmd/S6SmJBZN54BhhzNkQ/AIsYBFwQ1nRFXZVhXz2FJnG3VBdC75 iepCX/Mevr65IZEyT+veb2jL2nmmG0FPXyJi/WKyazfXm3PZRXbeEg5PG5Z7kTKj l6GzmTkCQ/AyJy/VbmodXJz9ExOMs1kstySh5bjR0sU1dsoCMCAMcMVYtyIRXsz+ XOzlGn8Aw3NYLtGT0Zc2aZfoKg1bheD0F+YXNlEHvZmCt5qAwtmr6sxN/LXN9q0C b1gWXobDjPmBWd8keou+pNy+ANcbGTnCFst5V27EsBGc3I2ukYQ2Sy4Ip8bLRQ9c uff/bST5uFJCLxNrc66K =5iUW -----END PGP SIGNATURE----- --nextPart5638902.FJ9QWIOBTD--