From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: [DOC v1] Xen transport for 9pfs Date: Fri, 2 Dec 2016 09:54:09 +0100 Message-ID: <1480668849.3445.62.camel@citrix.com> References: <1480589808.3445.15.camel@citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6326053293994595637==" Return-path: Received: from mail6.bemta6.messagelabs.com ([193.109.254.103]) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cCjbn-0007Zz-U4 for xen-devel@lists.xenproject.org; Fri, 02 Dec 2016 08:54:20 +0000 In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" To: Stefano Stabellini Cc: xen-devel@lists.xenproject.org, wei.liu2@citrix.com List-Id: xen-devel@lists.xenproject.org --===============6326053293994595637== Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="=-80PUhRW4DduGRW8tqfNP" --=-80PUhRW4DduGRW8tqfNP Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, 2016-12-01 at 15:14 -0800, Stefano Stabellini wrote: > On Thu, 1 Dec 2016, Dario Faggioli wrote: > >=20 > > On Tue, 2016-11-29 at 15:34 -0800, Stefano Stabellini wrote: > > >=C2=A0 > > > =C2=A0=C2=A0=C2=A0=C2=A0ring-ref- (ring-ref-0, ring-ref-1, etc) > > >=20 > > blkif uses ring-ref%u, rather than ring-ref-%u (i.e., no dash > > before > > the index). Not a big deal, I guess, but I thought it could be nice > > to > > be a bit more uniform. >=20 > Sure, but in this case each ring-ref-%u is used to map a different > ring. > Yeah, right. So it may even be a good thing to differentiate, indeed... > That said, I can make the change. >=20 I don't know. I, FWIW, thought it would be good, now I'm not so sure any longer. Yours and maintainers' call, I guess. :-) > > If it is, what's the typical envisioned use of these multiple > > rings, if > > I can ask? >=20 > They are used to handle multiple read/write requests in parallel. > Let's > assume that we configure the rings to be 8K each. Given that the data > is > transmitted over the ring, each ring can hold only one outstanding 4K > write request (there is an header for each write request). >=20 Ok. > With two 8K rings, we can have two outstanding 4K write requests, > each > of them processed in parallel on a different vcpu. >=20 > The system is completely configurable in terms of number and size of > rings, so a user can configure it to only export one 4K ring for > example or go as far as several 2MB rings. >=20 Right. So, it is indeed similar to blkif multiqueueing, with which it also shares the idea/objective of exploiting parallelism at the (v)CPU level, but without (quite obviously, in this case) any direct link to hardware queues in disk controllers, and without the protocol itself giving any direction or indication of how to actually use all this. Got it. Nice. FWIW, I think a few words --just a shorter version of what you just said-- may be useful if present in this document. > > > =C2=A0=C2=A0=C2=A0=C2=A0/* not actually C compliant (ring_order chang= es from socket > > > to > > > socket) */ > > > =C2=A0=C2=A0=C2=A0=C2=A0struct ring_data { > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0char in[((1 << ring_o= rder) << PAGE_SHIFT) / 2]; > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0char out[((1 << ring_= order) << PAGE_SHIFT) / 2]; > > > =C2=A0=C2=A0=C2=A0=C2=A0}; > > >=20 > > Sorry, what does "ring_order changes from socket to socket" mean? >=20 > Woops, copy/paste error from PVCalls. I meant "ring_order changes > from > ring to ring". >=20 Ah, yes, now it makes sense. :-) BTW, what's the reason for putting ring_order inside=C2=A0xen_9pfs_intf, instead of having a=C2=A0ring-page-order (well, actually, a ring-%u-page-order) xenstore key? > > > The binary layout of `struct xen_9pfs_intf` follows: > > >=20 > > > =C2=A0=C2=A0=C2=A0=C2=A00=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A04=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A08=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A012=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A016=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A02= 0 > > > =C2=A0=C2=A0=C2=A0=C2=A0+---------+---------+---------+---------+----= -----+ > > > =C2=A0=C2=A0=C2=A0=C2=A0| in_cons | in_prod |out_cons |out_prod |ring= _orde| > > > =C2=A0=C2=A0=C2=A0=C2=A0+---------+---------+---------+---------+----= -----+ > > >=20 > > > =C2=A0=C2=A0=C2=A0=C2=A020=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A024=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A026=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A04092=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A04096 > > > =C2=A0=C2=A0=C2=A0=C2=A0+---------+---------+----//---+---------+ > > > =C2=A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A0ref[0] |=C2=A0=C2=A0ref[1] |=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A0ref[N] | > > > =C2=A0=C2=A0=C2=A0=C2=A0+---------+---------+----//---+---------+ > > >=20 > > > **N.B** For one page, N is maximum 1019 ((4096-20)/4), but given > > > that > > > N > > > needs to be a power of two, actually max N is 512. > > >=20 > > It may again be me being still too naive, but I'd quickly add at > > least > > another example, with the value of N computed for a multiple pages > > ring. Something like: "For 4 pages (i.e., ring_orfer=3D2), N is..." >=20 > For 4 pages, N is 4. N is the number of pages that make up the ring. >=20 > Maybe there is a misunderstanding, let me try to explain it again: > each > page shared via xenstore contains information to handle one new ring, > including grant references for the pages making up the multipage ring > itself. I'll repeat: pages shared via xenstore are not used as a > ring, > they are used to setup the rings, each page has the info to setup a > new > ring. >=20 Right, I got this. And indeed I expressed myself very badly above. So, the descriptor of 1 ring is one page. Such page contains, in signle page rings, the reference to another page, which is the actual ring. If the ring is multi-page, the descriptor page contains an array of page references which, together, are the actual ring. Such array --of which N is, in the diagram above, the last index-- can be, as you say, up to 1019 elements big (the available space in a ring descriptor page). Therefore, the math I was asking about is really the relationship between N and max-ring-page-order. That is, a ring can have at most 2^max-ring-page-order pages, and N can be at most 1019 (well, I think it's 1018 if, as in diagram above, you count from 0, but that does not matter much); so: =C2=A02^max-ring-page-order <=3D N =C2=A0lb(2^max-ring-page-order) <=3D lb(N) =C2=A0//lb(): base 2 logarithm =C2=A0max-ring-page-order <=3D lb(N) and, considering that max-ring-page-order must be a natural number: =C2=A0max-ring-page-order <=3D floor(lb(N)) =C2=A0max-ring-page-order <=3D floor(lb(1018)) =C2=A0max-ring-page-order <=3D floor(9.9915) =C2=A0max-ring-page-order <=3D 9 so a ring can be at most 2^9 pages big, which indeed matches with your own calculations, and bring us to the fact that the maximum size of a ring is 512*4Kb=3D2Mb So, to recap (sorry for being so long!), I think that saying: "**N.B** For one page, N is maximum 1019 ((4096-20)/4), but given that N needs to be a power of two, actually max N is 512." is indeed correct, and probably makes it enough clear that the maximum ring size is 2MB. It's not equally easy, IMO, to map this back to the fact that this also mean max-ring-page-order must be at most 9, and that is not spelled out anywhere else, AFAICT. Therefore, an example of how things look with a couple of different values of ring_order, or some shorter and less boring version of this reasoning and calculations may help with that. That's what I'm trying to say. :-) > The structure of these "setup pages" is `struct xen_9pfs_intf`. Each > page is completely separate and independent from the others. Given > that > one page is just 4096 bytes, it can contain max 512 grant refs (see > calculation above). So the max size of one multipage ring is 512 > pages =3D > 2MB. >=20 > Does it make sense? > It does, and this is probably me being a mix of, not too used to this, and too picky... If that's the case, sorry for the noise. :-D Thanks and Regards, Dario --=20 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-80PUhRW4DduGRW8tqfNP Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJYQTayAAoJEBZCeImluHPu2HoP/2qmAtBBSf9HXAcP3MRtVAoX d6ksXPMnsoMmbE+xJ1TpnuvJKjpdUx5fH+gQBKUJGiw9m2dDozd+h+FXdtkNlb+G u/xC1+a0QQcvYOxPF09Vr2y1wThvFlnnM8XjAu1WCNhcX01Ir7uJ7TPKwl1FF9XM kUfZjcaffEw1+toOFkMdTs8Im+3JWElDdgU46QfGhXS8NV1JAAR/eKW7PgTgpZ94 jCIPvgWCPi9VOqPae6Nrlo61Z9pDzf+IjXPUHihujoa4cOa8D3NGKoXBw3pS7ZjJ lraPuf8pg8tIspj8pENNTs+7Y98DCJmXOLmgwMVlHO4Y356mK04ahhLoM8VU+2tf KPpZoEM4tMPVTT6+4CQ7tm9IU334oNGVn81Jz1HZTophJ6Vq9PK5ef1f81abuUT6 4dLZGng5/SvZyt27tMUdTIeWr49Kkf8QugIjfL3mgr3RXa2Qj+599b4lo2vLHtcz IPXfmVfGlsQ4AgNUuxODZNo/a07xsIRY5H8qJD7j2rPLLymx3f6ucN2rz31SXDNO xpfcXxPzu6zzljlckBn7tzhyVtj0qKJBki5/zoHYeaW9091GDpHPi7fawkG9rcPv FKyGql3lzPQu3ZWsNzbDau+hTBWyQV4bQFyxEAaMAhgkL/Fiy+ptU9x9K8jy7QBe jgu7RwjQVVwNwHrtPvUm =twke -----END PGP SIGNATURE----- --=-80PUhRW4DduGRW8tqfNP-- --===============6326053293994595637== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v cmcveGVuLWRldmVsCg== --===============6326053293994595637==--