* RFC: XenSock brainstorming [not found] ` <CAAe9sUHsKXsvD5aK9PHeTYRwq8+0Q9yXK2sPY+Fk=5kErBri8A@mail.gmail.com> @ 2016-06-06 9:33 ` Stefano Stabellini 2016-06-06 9:57 ` Andrew Cooper ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Stefano Stabellini @ 2016-06-06 9:33 UTC (permalink / raw) To: xen-devel; +Cc: stefano, joao.m.martins, wei.liu2, roger.pau Hi all, a couple of months ago I started working on a new PV protocol for virtualizing syscalls. I named it XenSock, as its main purpose is to allow the implementation of the POSIX socket API in a domain other than the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc to be implemented directly in Dom0. In a way this is conceptually similar to virtio-9pfs, but for sockets rather than filesystem APIs. See this diagram as reference: https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing The frontends and backends could live either in userspace or kernel space, with different trade-offs. My current prototype is based on Linux kernel drivers but it would be nice to have userspace drivers too. Discussing where the drivers could be implemented it's beyond the scope of this email. # Goals The goal of the protocol is to provide networking capabilities to any guests, with the following added benefits: * guest networking should work out of the box with VPNs, wireless networks and any other complex network configurations in Dom0 * guest services should listen on ports bound directly to Dom0 IP addresses, fitting naturally in a Docker based workflow, where guests are Docker containers * Dom0 should have full visibility on the guest behavior and should be able to perform inexpensive filtering and manipulation of guest calls * XenSock should provide excellent performance. Unoptimized early code reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3 streams. # Status I would like to get feedback on the high level architecture, the data path and the ring formats. Beware that protocol and drivers are in their very early days. I don't have all the information to write a design document yet. The ABI is neither complete nor stable. The code is not ready for xen-devel yet, but I would be happy to push a git branch if somebody is interested in contributing to the project. # Design and limitations The frontend connects to the backend following the traditional xenstore based exchange of information. Frontend and backend setup an event channel and shared ring. The ring is used by the frontend to forward socket API calls to the backend. I am referring to this ring as command ring. This is an example of the ring format: #define XENSOCK_CONNECT 0 #define XENSOCK_RELEASE 3 #define XENSOCK_BIND 4 #define XENSOCK_LISTEN 5 #define XENSOCK_ACCEPT 6 #define XENSOCK_POLL 7 struct xen_xensock_request { uint32_t id; /* private to guest, echoed in response */ uint32_t cmd; /* command to execute */ uint64_t sockid; /* id of the socket */ union { struct xen_xensock_connect { uint8_t addr[28]; uint32_t len; uint32_t flags; grant_ref_t ref[XENSOCK_DATARING_PAGES]; uint32_t evtchn; } connect; struct xen_xensock_bind { uint8_t addr[28]; /* ipv6 ready */ uint32_t len; } bind; struct xen_xensock_accept { grant_ref_t ref[XENSOCK_DATARING_PAGES]; uint32_t evtchn; uint64_t sockid; } accept; } u; }; struct xen_xensock_response { uint32_t id; uint32_t cmd; uint64_t sockid; int32_t ret; }; DEFINE_RING_TYPES(xen_xensock, struct xen_xensock_request, struct xen_xensock_response); Connect and accept lead to the creation of new active sockets. Today each active socket has its own event channel and ring for sending and receiving data. Data rings have the following format: #define XENSOCK_DATARING_ORDER 2 #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER) #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT) typedef uint32_t XENSOCK_RING_IDX; struct xensock_ring_intf { char in[XENSOCK_DATARING_SIZE/4]; char out[XENSOCK_DATARING_SIZE/2]; XENSOCK_RING_IDX in_cons, in_prod; XENSOCK_RING_IDX out_cons, out_prod; int32_t in_error, out_error; }; The ring works like the Xen console ring (see xen/include/public/io/console.h). Data is copied to/from the ring by both frontend and backend. in_error, out_error are used to report errors. This simple design works well, but it requires at least 1 page per active socket. To get good performance (~20 Gbit/sec single stream), we need buffers of at least 64K, so actually we are looking at about 64 pages per ring (order 6). I am currently investigating the usage of AVX2 to perform the data copy. # Brainstorming Are 64 pages per active socket a reasonable amount in the context of modern OS level networking? I believe that regular Linux tcp sockets allocate something in that order of magnitude. If that's too much, I spent some time thinking about ways to reduce it. Some ideas follow. We could split up send and receive into two different data structures. I am thinking of introducing a single ring for all active sockets with variable size messages for sending data. Something like the following: struct xensock_ring_entry { uint64_t sockid; /* identifies a socket */ uint32_t len; /* length of data to follow */ uint8_t data[]; /* variable length data */ }; One ring would be dedicated to holding xensock_ring_entry structures, one after another in a classic circular fashion. Two indexes, out_cons and out_prod, would still be used the same way the are used in the console ring, but I would place them on a separate page for clarity: struct xensock_ring_intf { XENSOCK_RING_IDX out_cons, out_prod; }; The frontend, that is the producer, writes a new struct xensock_ring_entry to the ring, careful not to exceed the remaining free bytes available. Then it increments out_prod by the written amount. The backend, that is the consumer, reads the new struct xensock_ring_entry, reading as much data as specified by "len". Then it increments out_cons by the size of the struct xensock_ring_entry read. I think this could work. Theoretically we could do the same thing for receive: a separate single ring shared by all active sockets. We could even reuse struct xensock_ring_entry. However I have doubts that this model could work well for receive. When sending data, all sockets on the frontend side copy buffers onto this single ring. If there is no room, the frontend returns ENOBUFS. The backend picks up the data from the ring and calls sendmsg, which can also return ENOBUFS. In that case we don't increment out_cons, leaving the data on the ring. The backend will try again in the near future. Error messages would have to go on a separate data structure which I haven't finalized yet. When receiving from a socket, the backend copies data to the ring as soon as data is available, perhaps before the frontend requests the data. Buffers are copied to the ring not necessarily in the order that the frontend might want to read them. Thus the frontend would have to copy them out of the common ring into private per-socket dynamic buffers just to free the ring as soon as possible and consume the next xensock_ring_entry. It doesn't look very advantageous in terms of memory consumption and performance. Alternatively, the frontend would have to leave the data on the ring if the application didn't ask for it yet. In that case the frontend could look ahead without incrementing the in_cons pointer. It would have to keep track of which entries have been consumed and which entries have not been consumed. Only when the ring is full, the frontend would have no other choice but to copy the data out of the ring into temporary buffers. I am not sure how well this could work in practice. As a compromise, we could use a single shared ring for sending data, and 1 ring per active socket to receive data. This would cut the per-socket memory consumption in half (maybe to a quarter, moving out the indexes from the shared data ring into a separate page) and might be an acceptable trade-off. Any feedback or ideas? Many thanks, Stefano _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: RFC: XenSock brainstorming 2016-06-06 9:33 ` RFC: XenSock brainstorming Stefano Stabellini @ 2016-06-06 9:57 ` Andrew Cooper 2016-06-06 10:16 ` Paul Durrant 2016-06-06 10:25 ` Stefano Stabellini 2016-06-23 16:03 ` Stefano Stabellini 2016-06-23 16:28 ` David Vrabel 2 siblings, 2 replies; 9+ messages in thread From: Andrew Cooper @ 2016-06-06 9:57 UTC (permalink / raw) To: Stefano Stabellini, xen-devel; +Cc: joao.m.martins, wei.liu2, roger.pau On 06/06/16 10:33, Stefano Stabellini wrote: > Hi all, > > a couple of months ago I started working on a new PV protocol for > virtualizing syscalls. I named it XenSock, as its main purpose is to > allow the implementation of the POSIX socket API in a domain other than > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc > to be implemented directly in Dom0. In a way this is conceptually > similar to virtio-9pfs, but for sockets rather than filesystem APIs. > See this diagram as reference: > > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing > > The frontends and backends could live either in userspace or kernel > space, with different trade-offs. My current prototype is based on Linux > kernel drivers but it would be nice to have userspace drivers too. > Discussing where the drivers could be implemented it's beyond the scope > of this email. Just to confirm, you are intending to create a cross-domain transport for all AF_ socket types, or just some? > > > # Goals > > The goal of the protocol is to provide networking capabilities to any > guests, with the following added benefits: Throughout, s/Dom0/the backend/ I expect running the backend in dom0 will be the overwhelmingly common configuration, but you should avoid designing the protocol for just this usecase. > > * guest networking should work out of the box with VPNs, wireless > networks and any other complex network configurations in Dom0 > > * guest services should listen on ports bound directly to Dom0 IP > addresses, fitting naturally in a Docker based workflow, where guests > are Docker containers > > * Dom0 should have full visibility on the guest behavior and should be > able to perform inexpensive filtering and manipulation of guest calls > > * XenSock should provide excellent performance. Unoptimized early code > reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3 > streams. What happens if domU tries to open an AF_INET socket, and the domain has both sockfront and netfront ? What happens if a domain has multiple sockfronts? ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: RFC: XenSock brainstorming 2016-06-06 9:57 ` Andrew Cooper @ 2016-06-06 10:16 ` Paul Durrant 2016-06-06 10:48 ` Stefano Stabellini 2016-06-06 10:25 ` Stefano Stabellini 1 sibling, 1 reply; 9+ messages in thread From: Paul Durrant @ 2016-06-06 10:16 UTC (permalink / raw) To: Andrew Cooper, Stefano Stabellini, xen-devel Cc: joao.m.martins, Wei Liu, Roger Pau Monne > -----Original Message----- > From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of > Andrew Cooper > Sent: 06 June 2016 10:58 > To: Stefano Stabellini; xen-devel@lists.xenproject.org > Cc: joao.m.martins@oracle.com; Wei Liu; Roger Pau Monne > Subject: Re: [Xen-devel] RFC: XenSock brainstorming > > On 06/06/16 10:33, Stefano Stabellini wrote: > > Hi all, > > > > a couple of months ago I started working on a new PV protocol for > > virtualizing syscalls. I named it XenSock, as its main purpose is to > > allow the implementation of the POSIX socket API in a domain other than > > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc > > to be implemented directly in Dom0. In a way this is conceptually > > similar to virtio-9pfs, but for sockets rather than filesystem APIs. > > See this diagram as reference: > > > > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ- > Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing > > > > The frontends and backends could live either in userspace or kernel > > space, with different trade-offs. My current prototype is based on Linux > > kernel drivers but it would be nice to have userspace drivers too. > > Discussing where the drivers could be implemented it's beyond the scope > > of this email. > > Just to confirm, you are intending to create a cross-domain transport > for all AF_ socket types, or just some? > > > > > > > # Goals > > > > The goal of the protocol is to provide networking capabilities to any > > guests, with the following added benefits: > > Throughout, s/Dom0/the backend/ > > I expect running the backend in dom0 will be the overwhelmingly common > configuration, but you should avoid designing the protocol for just this > usecase. > > > > > * guest networking should work out of the box with VPNs, wireless > > networks and any other complex network configurations in Dom0 > > > > * guest services should listen on ports bound directly to Dom0 IP > > addresses, fitting naturally in a Docker based workflow, where guests > > are Docker containers > > > > * Dom0 should have full visibility on the guest behavior and should be > > able to perform inexpensive filtering and manipulation of guest calls > > > > * XenSock should provide excellent performance. Unoptimized early code > > reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3 > > streams. > > What happens if domU tries to open an AF_INET socket, and the domain has > both sockfront and netfront ? What happens if a domain has multiple > sockfronts? > This sounds awfully like a class of problem that the open onload (http://www.openonload.org/) stack had to solve, and it involved having to track updates to various kernel tables involved in inet routing and having to keep a 'standard' inet socket in hand even when setting up an intercepted (read 'PV' for this connect ) socket since, until connect, you don’t know what the far end is or how to get to it. Having your own AF is definitely a much easier starting point. It also means you get to define all the odd corner-case semantics rather than having to emulate Linux/BSD/Solaris/etc. quirks. Paul > ~Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: RFC: XenSock brainstorming 2016-06-06 10:16 ` Paul Durrant @ 2016-06-06 10:48 ` Stefano Stabellini 0 siblings, 0 replies; 9+ messages in thread From: Stefano Stabellini @ 2016-06-06 10:48 UTC (permalink / raw) To: Paul Durrant Cc: Wei Liu, Andrew Cooper, Stefano Stabellini, xen-devel, joao.m.martins, Roger Pau Monne [-- Attachment #1: Type: TEXT/PLAIN, Size: 3571 bytes --] On Mon, 6 Jun 2016, Paul Durrant wrote: > > -----Original Message----- > > From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of > > Andrew Cooper > > Sent: 06 June 2016 10:58 > > To: Stefano Stabellini; xen-devel@lists.xenproject.org > > Cc: joao.m.martins@oracle.com; Wei Liu; Roger Pau Monne > > Subject: Re: [Xen-devel] RFC: XenSock brainstorming > > > > On 06/06/16 10:33, Stefano Stabellini wrote: > > > Hi all, > > > > > > a couple of months ago I started working on a new PV protocol for > > > virtualizing syscalls. I named it XenSock, as its main purpose is to > > > allow the implementation of the POSIX socket API in a domain other than > > > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc > > > to be implemented directly in Dom0. In a way this is conceptually > > > similar to virtio-9pfs, but for sockets rather than filesystem APIs. > > > See this diagram as reference: > > > > > > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ- > > Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing > > > > > > The frontends and backends could live either in userspace or kernel > > > space, with different trade-offs. My current prototype is based on Linux > > > kernel drivers but it would be nice to have userspace drivers too. > > > Discussing where the drivers could be implemented it's beyond the scope > > > of this email. > > > > Just to confirm, you are intending to create a cross-domain transport > > for all AF_ socket types, or just some? > > > > > > > > > > > # Goals > > > > > > The goal of the protocol is to provide networking capabilities to any > > > guests, with the following added benefits: > > > > Throughout, s/Dom0/the backend/ > > > > I expect running the backend in dom0 will be the overwhelmingly common > > configuration, but you should avoid designing the protocol for just this > > usecase. > > > > > > > > * guest networking should work out of the box with VPNs, wireless > > > networks and any other complex network configurations in Dom0 > > > > > > * guest services should listen on ports bound directly to Dom0 IP > > > addresses, fitting naturally in a Docker based workflow, where guests > > > are Docker containers > > > > > > * Dom0 should have full visibility on the guest behavior and should be > > > able to perform inexpensive filtering and manipulation of guest calls > > > > > > * XenSock should provide excellent performance. Unoptimized early code > > > reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3 > > > streams. > > > > What happens if domU tries to open an AF_INET socket, and the domain has > > both sockfront and netfront ? What happens if a domain has multiple > > sockfronts? > > > > This sounds awfully like a class of problem that the open onload (http://www.openonload.org/) stack had to solve, and it involved having to track updates to various kernel tables involved in inet routing and having to keep a 'standard' inet socket in hand even when setting up an intercepted (read 'PV' for this connect ) socket since, until connect, you don’t know what the far end is or how to get to it. > > Having your own AF is definitely a much easier starting point. It also means you get to define all the odd corner-case semantics rather than having to emulate Linux/BSD/Solaris/etc. quirks. Thanks for the pointer, I'll have a look. Other related work include: VirtuOS http://people.cs.vt.edu/~gback/papers/sosp13final.pdf Virtio-vsock http://events.linuxfoundation.org/sites/events/files/slides/stefanha-kvm-forum-2015.pdf [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: RFC: XenSock brainstorming 2016-06-06 9:57 ` Andrew Cooper 2016-06-06 10:16 ` Paul Durrant @ 2016-06-06 10:25 ` Stefano Stabellini 1 sibling, 0 replies; 9+ messages in thread From: Stefano Stabellini @ 2016-06-06 10:25 UTC (permalink / raw) To: Andrew Cooper Cc: Stefano Stabellini, xen-devel, joao.m.martins, wei.liu2, roger.pau On Mon, 6 Jun 2016, Andrew Cooper wrote: > On 06/06/16 10:33, Stefano Stabellini wrote: > > Hi all, > > > > a couple of months ago I started working on a new PV protocol for > > virtualizing syscalls. I named it XenSock, as its main purpose is to > > allow the implementation of the POSIX socket API in a domain other than > > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc > > to be implemented directly in Dom0. In a way this is conceptually > > similar to virtio-9pfs, but for sockets rather than filesystem APIs. > > See this diagram as reference: > > > > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing > > > > The frontends and backends could live either in userspace or kernel > > space, with different trade-offs. My current prototype is based on Linux > > kernel drivers but it would be nice to have userspace drivers too. > > Discussing where the drivers could be implemented it's beyond the scope > > of this email. > > Just to confirm, you are intending to create a cross-domain transport > for all AF_ socket types, or just some? My use case is for AF_INET, so that's what I intend to implement. If somebody wanted to come along and implement AF_IPX for example, I would be fine with that and I would welcome the effort. > > # Goals > > > > The goal of the protocol is to provide networking capabilities to any > > guests, with the following added benefits: > > Throughout, s/Dom0/the backend/ > > I expect running the backend in dom0 will be the overwhelmingly common > configuration, but you should avoid designing the protocol for just this > usecase. As always I am happy to make this as generic and reusable as possible. The goals stated here are my goals with this protocol and I hope many readers will share some of them with me. Although I don't have an interest for running the backend in a domain other than Dom0, there is nothing in the current design (or even my early code) that would prevent driver domains from working. > > * guest networking should work out of the box with VPNs, wireless > > networks and any other complex network configurations in Dom0 > > > > * guest services should listen on ports bound directly to Dom0 IP > > addresses, fitting naturally in a Docker based workflow, where guests > > are Docker containers > > > > * Dom0 should have full visibility on the guest behavior and should be > > able to perform inexpensive filtering and manipulation of guest calls > > > > * XenSock should provide excellent performance. Unoptimized early code > > reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3 > > streams. > > What happens if domU tries to open an AF_INET socket, and the domain has > both sockfront and netfront ? I wouldn't encourage this configuration. However it works more naturally than one would expect: depending on how DomU is configured, if the AF_INET socket calls are routed to the XenSock frontend, then they are going to appear to come out from Dom0, otherwise they will be routed as usual. So for example if the frontend is implemented in userspace, for example in a modified libc library, then if applications in the guest use the library, their data go through XenSock, otherwise they go through netfront. > What happens if a domain has multiple sockfronts? I don't think it should be a valid configuration. I cannot think of a case where one would want something like that. But if somebody comes up with a valid scenario on why and how this should work, I would be happy to work with her to make it happen. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: RFC: XenSock brainstorming 2016-06-06 9:33 ` RFC: XenSock brainstorming Stefano Stabellini 2016-06-06 9:57 ` Andrew Cooper @ 2016-06-23 16:03 ` Stefano Stabellini 2016-06-23 16:57 ` Stefano Stabellini 2016-06-23 16:28 ` David Vrabel 2 siblings, 1 reply; 9+ messages in thread From: Stefano Stabellini @ 2016-06-23 16:03 UTC (permalink / raw) To: Stefano Stabellini; +Cc: xen-devel, joao.m.martins, wei.liu2, roger.pau Now that Xen 4.7 is out of the door, any more feedback on this? On Mon, 6 Jun 2016, Stefano Stabellini wrote: > Hi all, > > a couple of months ago I started working on a new PV protocol for > virtualizing syscalls. I named it XenSock, as its main purpose is to > allow the implementation of the POSIX socket API in a domain other than > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc > to be implemented directly in Dom0. In a way this is conceptually > similar to virtio-9pfs, but for sockets rather than filesystem APIs. > See this diagram as reference: > > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing > > The frontends and backends could live either in userspace or kernel > space, with different trade-offs. My current prototype is based on Linux > kernel drivers but it would be nice to have userspace drivers too. > Discussing where the drivers could be implemented it's beyond the scope > of this email. > > > # Goals > > The goal of the protocol is to provide networking capabilities to any > guests, with the following added benefits: > > * guest networking should work out of the box with VPNs, wireless > networks and any other complex network configurations in Dom0 > > * guest services should listen on ports bound directly to Dom0 IP > addresses, fitting naturally in a Docker based workflow, where guests > are Docker containers > > * Dom0 should have full visibility on the guest behavior and should be > able to perform inexpensive filtering and manipulation of guest calls > > * XenSock should provide excellent performance. Unoptimized early code > reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3 > streams. > > > # Status > > I would like to get feedback on the high level architecture, the data > path and the ring formats. > > Beware that protocol and drivers are in their very early days. I don't > have all the information to write a design document yet. The ABI is > neither complete nor stable. > > The code is not ready for xen-devel yet, but I would be happy to push a > git branch if somebody is interested in contributing to the project. > > > # Design and limitations > > The frontend connects to the backend following the traditional xenstore > based exchange of information. > > Frontend and backend setup an event channel and shared ring. The ring is > used by the frontend to forward socket API calls to the backend. I am > referring to this ring as command ring. This is an example of the ring > format: > > #define XENSOCK_CONNECT 0 > #define XENSOCK_RELEASE 3 > #define XENSOCK_BIND 4 > #define XENSOCK_LISTEN 5 > #define XENSOCK_ACCEPT 6 > #define XENSOCK_POLL 7 > > struct xen_xensock_request { > uint32_t id; /* private to guest, echoed in response */ > uint32_t cmd; /* command to execute */ > uint64_t sockid; /* id of the socket */ > union { > struct xen_xensock_connect { > uint8_t addr[28]; > uint32_t len; > uint32_t flags; > grant_ref_t ref[XENSOCK_DATARING_PAGES]; > uint32_t evtchn; > } connect; > struct xen_xensock_bind { > uint8_t addr[28]; /* ipv6 ready */ > uint32_t len; > } bind; > struct xen_xensock_accept { > grant_ref_t ref[XENSOCK_DATARING_PAGES]; > uint32_t evtchn; > uint64_t sockid; > } accept; > } u; > }; > > struct xen_xensock_response { > uint32_t id; > uint32_t cmd; > uint64_t sockid; > int32_t ret; > }; > > DEFINE_RING_TYPES(xen_xensock, struct xen_xensock_request, > struct xen_xensock_response); > > > Connect and accept lead to the creation of new active sockets. Today > each active socket has its own event channel and ring for sending and > receiving data. Data rings have the following format: > > #define XENSOCK_DATARING_ORDER 2 > #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER) > #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT) > > typedef uint32_t XENSOCK_RING_IDX; > > struct xensock_ring_intf { > char in[XENSOCK_DATARING_SIZE/4]; > char out[XENSOCK_DATARING_SIZE/2]; > XENSOCK_RING_IDX in_cons, in_prod; > XENSOCK_RING_IDX out_cons, out_prod; > int32_t in_error, out_error; > }; > > The ring works like the Xen console ring (see > xen/include/public/io/console.h). Data is copied to/from the ring by > both frontend and backend. in_error, out_error are used to report > errors. This simple design works well, but it requires at least 1 page > per active socket. To get good performance (~20 Gbit/sec single stream), > we need buffers of at least 64K, so actually we are looking at about 64 > pages per ring (order 6). > > I am currently investigating the usage of AVX2 to perform the data copy. > > > # Brainstorming > > Are 64 pages per active socket a reasonable amount in the context of > modern OS level networking? I believe that regular Linux tcp sockets > allocate something in that order of magnitude. > > If that's too much, I spent some time thinking about ways to reduce it. > Some ideas follow. > > > We could split up send and receive into two different data structures. I > am thinking of introducing a single ring for all active sockets with > variable size messages for sending data. Something like the following: > > struct xensock_ring_entry { > uint64_t sockid; /* identifies a socket */ > uint32_t len; /* length of data to follow */ > uint8_t data[]; /* variable length data */ > }; > > One ring would be dedicated to holding xensock_ring_entry structures, > one after another in a classic circular fashion. Two indexes, out_cons > and out_prod, would still be used the same way the are used in the > console ring, but I would place them on a separate page for clarity: > > struct xensock_ring_intf { > XENSOCK_RING_IDX out_cons, out_prod; > }; > > The frontend, that is the producer, writes a new struct > xensock_ring_entry to the ring, careful not to exceed the remaining free > bytes available. Then it increments out_prod by the written amount. The > backend, that is the consumer, reads the new struct xensock_ring_entry, > reading as much data as specified by "len". Then it increments out_cons > by the size of the struct xensock_ring_entry read. > > I think this could work. Theoretically we could do the same thing for > receive: a separate single ring shared by all active sockets. We could > even reuse struct xensock_ring_entry. > > > However I have doubts that this model could work well for receive. When > sending data, all sockets on the frontend side copy buffers onto this > single ring. If there is no room, the frontend returns ENOBUFS. The > backend picks up the data from the ring and calls sendmsg, which can > also return ENOBUFS. In that case we don't increment out_cons, leaving > the data on the ring. The backend will try again in the near future. > Error messages would have to go on a separate data structure which I > haven't finalized yet. > > When receiving from a socket, the backend copies data to the ring as > soon as data is available, perhaps before the frontend requests the > data. Buffers are copied to the ring not necessarily in the order that > the frontend might want to read them. Thus the frontend would have to > copy them out of the common ring into private per-socket dynamic buffers > just to free the ring as soon as possible and consume the next > xensock_ring_entry. It doesn't look very advantageous in terms of memory > consumption and performance. > > Alternatively, the frontend would have to leave the data on the ring if > the application didn't ask for it yet. In that case the frontend could > look ahead without incrementing the in_cons pointer. It would have to > keep track of which entries have been consumed and which entries have > not been consumed. Only when the ring is full, the frontend would have > no other choice but to copy the data out of the ring into temporary > buffers. I am not sure how well this could work in practice. > > As a compromise, we could use a single shared ring for sending data, and > 1 ring per active socket to receive data. This would cut the per-socket > memory consumption in half (maybe to a quarter, moving out the indexes > from the shared data ring into a separate page) and might be an > acceptable trade-off. > > Any feedback or ideas? > > > Many thanks, > > Stefano > _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: RFC: XenSock brainstorming 2016-06-23 16:03 ` Stefano Stabellini @ 2016-06-23 16:57 ` Stefano Stabellini 0 siblings, 0 replies; 9+ messages in thread From: Stefano Stabellini @ 2016-06-23 16:57 UTC (permalink / raw) To: Stefano Stabellini; +Cc: xen-devel, joao.m.martins, wei.liu2, roger.pau Although discussing the goals is fun, feedback on the design of the protocol is particularly welcome. On Thu, 23 Jun 2016, Stefano Stabellini wrote: > Now that Xen 4.7 is out of the door, any more feedback on this? > > On Mon, 6 Jun 2016, Stefano Stabellini wrote: > > Hi all, > > > > a couple of months ago I started working on a new PV protocol for > > virtualizing syscalls. I named it XenSock, as its main purpose is to > > allow the implementation of the POSIX socket API in a domain other than > > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc > > to be implemented directly in Dom0. In a way this is conceptually > > similar to virtio-9pfs, but for sockets rather than filesystem APIs. > > See this diagram as reference: > > > > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing > > > > The frontends and backends could live either in userspace or kernel > > space, with different trade-offs. My current prototype is based on Linux > > kernel drivers but it would be nice to have userspace drivers too. > > Discussing where the drivers could be implemented it's beyond the scope > > of this email. > > > > > > # Goals > > > > The goal of the protocol is to provide networking capabilities to any > > guests, with the following added benefits: > > > > * guest networking should work out of the box with VPNs, wireless > > networks and any other complex network configurations in Dom0 > > > > * guest services should listen on ports bound directly to Dom0 IP > > addresses, fitting naturally in a Docker based workflow, where guests > > are Docker containers > > > > * Dom0 should have full visibility on the guest behavior and should be > > able to perform inexpensive filtering and manipulation of guest calls > > > > * XenSock should provide excellent performance. Unoptimized early code > > reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3 > > streams. > > > > > > # Status > > > > I would like to get feedback on the high level architecture, the data > > path and the ring formats. > > > > Beware that protocol and drivers are in their very early days. I don't > > have all the information to write a design document yet. The ABI is > > neither complete nor stable. > > > > The code is not ready for xen-devel yet, but I would be happy to push a > > git branch if somebody is interested in contributing to the project. > > > > > > # Design and limitations > > > > The frontend connects to the backend following the traditional xenstore > > based exchange of information. > > > > Frontend and backend setup an event channel and shared ring. The ring is > > used by the frontend to forward socket API calls to the backend. I am > > referring to this ring as command ring. This is an example of the ring > > format: > > > > #define XENSOCK_CONNECT 0 > > #define XENSOCK_RELEASE 3 > > #define XENSOCK_BIND 4 > > #define XENSOCK_LISTEN 5 > > #define XENSOCK_ACCEPT 6 > > #define XENSOCK_POLL 7 > > > > struct xen_xensock_request { > > uint32_t id; /* private to guest, echoed in response */ > > uint32_t cmd; /* command to execute */ > > uint64_t sockid; /* id of the socket */ > > union { > > struct xen_xensock_connect { > > uint8_t addr[28]; > > uint32_t len; > > uint32_t flags; > > grant_ref_t ref[XENSOCK_DATARING_PAGES]; > > uint32_t evtchn; > > } connect; > > struct xen_xensock_bind { > > uint8_t addr[28]; /* ipv6 ready */ > > uint32_t len; > > } bind; > > struct xen_xensock_accept { > > grant_ref_t ref[XENSOCK_DATARING_PAGES]; > > uint32_t evtchn; > > uint64_t sockid; > > } accept; > > } u; > > }; > > > > struct xen_xensock_response { > > uint32_t id; > > uint32_t cmd; > > uint64_t sockid; > > int32_t ret; > > }; > > > > DEFINE_RING_TYPES(xen_xensock, struct xen_xensock_request, > > struct xen_xensock_response); > > > > > > Connect and accept lead to the creation of new active sockets. Today > > each active socket has its own event channel and ring for sending and > > receiving data. Data rings have the following format: > > > > #define XENSOCK_DATARING_ORDER 2 > > #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER) > > #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT) > > > > typedef uint32_t XENSOCK_RING_IDX; > > > > struct xensock_ring_intf { > > char in[XENSOCK_DATARING_SIZE/4]; > > char out[XENSOCK_DATARING_SIZE/2]; > > XENSOCK_RING_IDX in_cons, in_prod; > > XENSOCK_RING_IDX out_cons, out_prod; > > int32_t in_error, out_error; > > }; > > > > The ring works like the Xen console ring (see > > xen/include/public/io/console.h). Data is copied to/from the ring by > > both frontend and backend. in_error, out_error are used to report > > errors. This simple design works well, but it requires at least 1 page > > per active socket. To get good performance (~20 Gbit/sec single stream), > > we need buffers of at least 64K, so actually we are looking at about 64 > > pages per ring (order 6). > > > > I am currently investigating the usage of AVX2 to perform the data copy. > > > > > > # Brainstorming > > > > Are 64 pages per active socket a reasonable amount in the context of > > modern OS level networking? I believe that regular Linux tcp sockets > > allocate something in that order of magnitude. > > > > If that's too much, I spent some time thinking about ways to reduce it. > > Some ideas follow. > > > > > > We could split up send and receive into two different data structures. I > > am thinking of introducing a single ring for all active sockets with > > variable size messages for sending data. Something like the following: > > > > struct xensock_ring_entry { > > uint64_t sockid; /* identifies a socket */ > > uint32_t len; /* length of data to follow */ > > uint8_t data[]; /* variable length data */ > > }; > > > > One ring would be dedicated to holding xensock_ring_entry structures, > > one after another in a classic circular fashion. Two indexes, out_cons > > and out_prod, would still be used the same way the are used in the > > console ring, but I would place them on a separate page for clarity: > > > > struct xensock_ring_intf { > > XENSOCK_RING_IDX out_cons, out_prod; > > }; > > > > The frontend, that is the producer, writes a new struct > > xensock_ring_entry to the ring, careful not to exceed the remaining free > > bytes available. Then it increments out_prod by the written amount. The > > backend, that is the consumer, reads the new struct xensock_ring_entry, > > reading as much data as specified by "len". Then it increments out_cons > > by the size of the struct xensock_ring_entry read. > > > > I think this could work. Theoretically we could do the same thing for > > receive: a separate single ring shared by all active sockets. We could > > even reuse struct xensock_ring_entry. > > > > > > However I have doubts that this model could work well for receive. When > > sending data, all sockets on the frontend side copy buffers onto this > > single ring. If there is no room, the frontend returns ENOBUFS. The > > backend picks up the data from the ring and calls sendmsg, which can > > also return ENOBUFS. In that case we don't increment out_cons, leaving > > the data on the ring. The backend will try again in the near future. > > Error messages would have to go on a separate data structure which I > > haven't finalized yet. > > > > When receiving from a socket, the backend copies data to the ring as > > soon as data is available, perhaps before the frontend requests the > > data. Buffers are copied to the ring not necessarily in the order that > > the frontend might want to read them. Thus the frontend would have to > > copy them out of the common ring into private per-socket dynamic buffers > > just to free the ring as soon as possible and consume the next > > xensock_ring_entry. It doesn't look very advantageous in terms of memory > > consumption and performance. > > > > Alternatively, the frontend would have to leave the data on the ring if > > the application didn't ask for it yet. In that case the frontend could > > look ahead without incrementing the in_cons pointer. It would have to > > keep track of which entries have been consumed and which entries have > > not been consumed. Only when the ring is full, the frontend would have > > no other choice but to copy the data out of the ring into temporary > > buffers. I am not sure how well this could work in practice. > > > > As a compromise, we could use a single shared ring for sending data, and > > 1 ring per active socket to receive data. This would cut the per-socket > > memory consumption in half (maybe to a quarter, moving out the indexes > > from the shared data ring into a separate page) and might be an > > acceptable trade-off. > > > > Any feedback or ideas? > > > > > > Many thanks, > > > > Stefano > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel > _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: RFC: XenSock brainstorming 2016-06-06 9:33 ` RFC: XenSock brainstorming Stefano Stabellini 2016-06-06 9:57 ` Andrew Cooper 2016-06-23 16:03 ` Stefano Stabellini @ 2016-06-23 16:28 ` David Vrabel 2016-06-23 16:49 ` Stefano Stabellini 2 siblings, 1 reply; 9+ messages in thread From: David Vrabel @ 2016-06-23 16:28 UTC (permalink / raw) To: Stefano Stabellini, xen-devel; +Cc: joao.m.martins, wei.liu2, roger.pau On 06/06/16 10:33, Stefano Stabellini wrote: > # Goals > > The goal of the protocol is to provide networking capabilities to any > guests, with the following added benefits: > > * guest networking should work out of the box with VPNs, wireless > networks and any other complex network configurations in Dom0 > > * guest services should listen on ports bound directly to Dom0 IP > addresses, fitting naturally in a Docker based workflow, where guests > are Docker containers > > * Dom0 should have full visibility on the guest behavior and should be > able to perform inexpensive filtering and manipulation of guest calls > > * XenSock should provide excellent performance. Unoptimized early code > reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3 > streams. I think it looks a bit odd to isolate the workload into a VM and then blow a hole in the isolation by providing a "fat" RPC interface directly to the privileged dom0 kernel. I think you could probably present a regular VIF to the guest and use SDN (e.g., openvswitch) to get your docker-like semantics. David _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: RFC: XenSock brainstorming 2016-06-23 16:28 ` David Vrabel @ 2016-06-23 16:49 ` Stefano Stabellini 0 siblings, 0 replies; 9+ messages in thread From: Stefano Stabellini @ 2016-06-23 16:49 UTC (permalink / raw) To: David Vrabel Cc: Stefano Stabellini, xen-devel, joao.m.martins, wei.liu2, roger.pau On Thu, 23 Jun 2016, David Vrabel wrote: > On 06/06/16 10:33, Stefano Stabellini wrote: > > # Goals > > > > The goal of the protocol is to provide networking capabilities to any > > guests, with the following added benefits: > > > > * guest networking should work out of the box with VPNs, wireless > > networks and any other complex network configurations in Dom0 > > > > * guest services should listen on ports bound directly to Dom0 IP > > addresses, fitting naturally in a Docker based workflow, where guests > > are Docker containers > > > > * Dom0 should have full visibility on the guest behavior and should be > > able to perform inexpensive filtering and manipulation of guest calls > > > > * XenSock should provide excellent performance. Unoptimized early code > > reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3 > > streams. > > I think it looks a bit odd to isolate the workload into a VM and then > blow a hole in the isolation by providing a "fat" RPC interface directly > to the privileged dom0 kernel. It might look odd but this is exactly the goal of the project. The vast majority of the syscalls will be run entirely within the VM. The ones that are allowed to reach dom0 are only very few, less then 10 today in fact. It is a big win from a security perspective compared to containers. And it is a big win compared to VMs in terms of performance. In my last test I reached 84 gbit/sec with 4 tcp streams. Monitoring the behavior of the guest becomes extremely cheap and easy as one can just keep track of the syscalls forwarded to dom0. It would be trivial to figure out if your NGINX container unexpectedly tried to open port 22 for example. One would have to employ complex firewall rules or VM introspection to do this otherwise. In addition one can still use all the traditional filtering techniques for these syscalls in dom0, such as seccomp. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-06-23 16:57 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <alpine.DEB.2.10.1606021429410.16603@sstabellini-ThinkPad-X260> [not found] ` <CAAe9sUHsKXsvD5aK9PHeTYRwq8+0Q9yXK2sPY+Fk=5kErBri8A@mail.gmail.com> 2016-06-06 9:33 ` RFC: XenSock brainstorming Stefano Stabellini 2016-06-06 9:57 ` Andrew Cooper 2016-06-06 10:16 ` Paul Durrant 2016-06-06 10:48 ` Stefano Stabellini 2016-06-06 10:25 ` Stefano Stabellini 2016-06-23 16:03 ` Stefano Stabellini 2016-06-23 16:57 ` Stefano Stabellini 2016-06-23 16:28 ` David Vrabel 2016-06-23 16:49 ` Stefano Stabellini
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).