RFC: XenSock brainstorming

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* RFC: XenSock brainstorming
       [not found] ` <CAAe9sUHsKXsvD5aK9PHeTYRwq8+0Q9yXK2sPY+Fk=5kErBri8A@mail.gmail.com>
@ 2016-06-06  9:33   ` Stefano Stabellini
  2016-06-06  9:57     ` Andrew Cooper
                       ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Stefano Stabellini @ 2016-06-06  9:33 UTC (permalink / raw)
  To: xen-devel; +Cc: stefano, joao.m.martins, wei.liu2, roger.pau

Hi all,

a couple of months ago I started working on a new PV protocol for
virtualizing syscalls. I named it XenSock, as its main purpose is to
allow the implementation of the POSIX socket API in a domain other than
the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc
to be implemented directly in Dom0. In a way this is conceptually
similar to virtio-9pfs, but for sockets rather than filesystem APIs.
See this diagram as reference:

https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing

The frontends and backends could live either in userspace or kernel
space, with different trade-offs. My current prototype is based on Linux
kernel drivers but it would be nice to have userspace drivers too.
Discussing where the drivers could be implemented it's beyond the scope
of this email.

# Goals

The goal of the protocol is to provide networking capabilities to any
guests, with the following added benefits:

* guest networking should work out of the box with VPNs, wireless
  networks and any other complex network configurations in Dom0

* guest services should listen on ports bound directly to Dom0 IP
  addresses, fitting naturally in a Docker based workflow, where guests
  are Docker containers

* Dom0 should have full visibility on the guest behavior and should be
  able to perform inexpensive filtering and manipulation of guest calls

* XenSock should provide excellent performance. Unoptimized early code
  reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3
  streams.

# Status

I would like to get feedback on the high level architecture, the data
path and the ring formats.

Beware that protocol and drivers are in their very early days. I don't
have all the information to write a design document yet. The ABI is
neither complete nor stable.

The code is not ready for xen-devel yet, but I would be happy to push a
git branch if somebody is interested in contributing to the project.

# Design and limitations

The frontend connects to the backend following the traditional xenstore
based exchange of information.

Frontend and backend setup an event channel and shared ring. The ring is
used by the frontend to forward socket API calls to the backend. I am
referring to this ring as command ring. This is an example of the ring
format:

#define XENSOCK_CONNECT        0
#define XENSOCK_RELEASE        3
#define XENSOCK_BIND           4
#define XENSOCK_LISTEN         5
#define XENSOCK_ACCEPT         6
#define XENSOCK_POLL           7

struct xen_xensock_request {
	uint32_t id;     /* private to guest, echoed in response */
	uint32_t cmd;    /* command to execute */
	uint64_t sockid; /* id of the socket */
	union {
		struct xen_xensock_connect {
			uint8_t addr[28];
			uint32_t len;
			uint32_t flags;
			grant_ref_t ref[XENSOCK_DATARING_PAGES];
			uint32_t evtchn;
		} connect;
		struct xen_xensock_bind {
			uint8_t addr[28]; /* ipv6 ready */
			uint32_t len;
		} bind;
		struct xen_xensock_accept {
			grant_ref_t ref[XENSOCK_DATARING_PAGES];
			uint32_t evtchn;
			uint64_t sockid;
		} accept;
	} u;
};

struct xen_xensock_response {
	uint32_t id;
	uint32_t cmd;
	uint64_t sockid;
	int32_t ret;
};

DEFINE_RING_TYPES(xen_xensock, struct xen_xensock_request,
		  struct xen_xensock_response);

Connect and accept lead to the creation of new active sockets. Today
each active socket has its own event channel and ring for sending and
receiving data. Data rings have the following format:

#define XENSOCK_DATARING_ORDER 2
#define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
#define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)

typedef uint32_t XENSOCK_RING_IDX;

struct xensock_ring_intf {
	char in[XENSOCK_DATARING_SIZE/4];
	char out[XENSOCK_DATARING_SIZE/2];
	XENSOCK_RING_IDX in_cons, in_prod;
	XENSOCK_RING_IDX out_cons, out_prod;
	int32_t in_error, out_error;
};

The ring works like the Xen console ring (see
xen/include/public/io/console.h). Data is copied to/from the ring by
both frontend and backend. in_error, out_error are used to report
errors. This simple design works well, but it requires at least 1 page
per active socket. To get good performance (~20 Gbit/sec single stream),
we need buffers of at least 64K, so actually we are looking at about 64
pages per ring (order 6).

I am currently investigating the usage of AVX2 to perform the data copy.

# Brainstorming

Are 64 pages per active socket a reasonable amount in the context of
modern OS level networking? I believe that regular Linux tcp sockets
allocate something in that order of magnitude.

If that's too much, I spent some time thinking about ways to reduce it.
Some ideas follow.

We could split up send and receive into two different data structures. I
am thinking of introducing a single ring for all active sockets with
variable size messages for sending data. Something like the following:

struct xensock_ring_entry {
	uint64_t sockid; /* identifies a socket */
	uint32_t len;    /* length of data to follow */
	uint8_t data[];  /* variable length data */
};

One ring would be dedicated to holding xensock_ring_entry structures,
one after another in a classic circular fashion. Two indexes, out_cons
and out_prod, would still be used the same way the are used in the
console ring, but I would place them on a separate page for clarity:

struct xensock_ring_intf {
	XENSOCK_RING_IDX out_cons, out_prod;
};

The frontend, that is the producer, writes a new struct
xensock_ring_entry to the ring, careful not to exceed the remaining free
bytes available. Then it increments out_prod by the written amount. The
backend, that is the consumer, reads the new struct xensock_ring_entry,
reading as much data as specified by "len". Then it increments out_cons
by the size of the struct xensock_ring_entry read.

I think this could work. Theoretically we could do the same thing for
receive: a separate single ring shared by all active sockets. We could
even reuse struct xensock_ring_entry.

However I have doubts that this model could work well for receive. When
sending data, all sockets on the frontend side copy buffers onto this
single ring. If there is no room, the frontend returns ENOBUFS. The
backend picks up the data from the ring and calls sendmsg, which can
also return ENOBUFS. In that case we don't increment out_cons, leaving
the data on the ring. The backend will try again in the near future.
Error messages would have to go on a separate data structure which I
haven't finalized yet.

When receiving from a socket, the backend copies data to the ring as
soon as data is available, perhaps before the frontend requests the
data. Buffers are copied to the ring not necessarily in the order that
the frontend might want to read them. Thus the frontend would have to
copy them out of the common ring into private per-socket dynamic buffers
just to free the ring as soon as possible and consume the next
xensock_ring_entry. It doesn't look very advantageous in terms of memory
consumption and performance.

Alternatively, the frontend would have to leave the data on the ring if
the application didn't ask for it yet. In that case the frontend could
look ahead without incrementing the in_cons pointer. It would have to
keep track of which entries have been consumed and which entries have
not been consumed. Only when the ring is full, the frontend would have
no other choice but to copy the data out of the ring into temporary
buffers. I am not sure how well this could work in practice.

As a compromise, we could use a single shared ring for sending data, and
1 ring per active socket to receive data. This would cut the per-socket
memory consumption in half (maybe to a quarter, moving out the indexes
from the shared data ring into a separate page) and might be an
acceptable trade-off.

Any feedback or ideas?

Many thanks,

Stefano

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: XenSock brainstorming
  2016-06-06  9:33   ` RFC: XenSock brainstorming Stefano Stabellini
@ 2016-06-06  9:57     ` Andrew Cooper
  2016-06-06 10:16       ` Paul Durrant
  2016-06-06 10:25       ` Stefano Stabellini
  2016-06-23 16:03     ` Stefano Stabellini
  2016-06-23 16:28     ` David Vrabel
  2 siblings, 2 replies; 9+ messages in thread
From: Andrew Cooper @ 2016-06-06  9:57 UTC (permalink / raw)
  To: Stefano Stabellini, xen-devel; +Cc: joao.m.martins, wei.liu2, roger.pau

On 06/06/16 10:33, Stefano Stabellini wrote:
> Hi all,
>
> a couple of months ago I started working on a new PV protocol for
> virtualizing syscalls. I named it XenSock, as its main purpose is to
> allow the implementation of the POSIX socket API in a domain other than
> the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc
> to be implemented directly in Dom0. In a way this is conceptually
> similar to virtio-9pfs, but for sockets rather than filesystem APIs.
> See this diagram as reference:
>
> https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing
>
> The frontends and backends could live either in userspace or kernel
> space, with different trade-offs. My current prototype is based on Linux
> kernel drivers but it would be nice to have userspace drivers too.
> Discussing where the drivers could be implemented it's beyond the scope
> of this email.

Just to confirm, you are intending to create a cross-domain transport
for all AF_ socket types, or just some?

>
>
> # Goals
>
> The goal of the protocol is to provide networking capabilities to any
> guests, with the following added benefits:

Throughout, s/Dom0/the backend/

I expect running the backend in dom0 will be the overwhelmingly common
configuration, but you should avoid designing the protocol for just this
usecase.

>
> * guest networking should work out of the box with VPNs, wireless
>   networks and any other complex network configurations in Dom0
>
> * guest services should listen on ports bound directly to Dom0 IP
>   addresses, fitting naturally in a Docker based workflow, where guests
>   are Docker containers
>
> * Dom0 should have full visibility on the guest behavior and should be
>   able to perform inexpensive filtering and manipulation of guest calls
>
> * XenSock should provide excellent performance. Unoptimized early code
>   reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3
>   streams.

What happens if domU tries to open an AF_INET socket, and the domain has
both sockfront and netfront ?  What happens if a domain has multiple
sockfronts?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: XenSock brainstorming
  2016-06-06  9:57     ` Andrew Cooper
@ 2016-06-06 10:16       ` Paul Durrant
  2016-06-06 10:48         ` Stefano Stabellini
  2016-06-06 10:25       ` Stefano Stabellini
  1 sibling, 1 reply; 9+ messages in thread
From: Paul Durrant @ 2016-06-06 10:16 UTC (permalink / raw)
  To: Andrew Cooper, Stefano Stabellini, xen-devel
  Cc: joao.m.martins, Wei Liu, Roger Pau Monne

> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
> Andrew Cooper
> Sent: 06 June 2016 10:58
> To: Stefano Stabellini; xen-devel@lists.xenproject.org
> Cc: joao.m.martins@oracle.com; Wei Liu; Roger Pau Monne
> Subject: Re: [Xen-devel] RFC: XenSock brainstorming
> 
> On 06/06/16 10:33, Stefano Stabellini wrote:
> > Hi all,
> >
> > a couple of months ago I started working on a new PV protocol for
> > virtualizing syscalls. I named it XenSock, as its main purpose is to
> > allow the implementation of the POSIX socket API in a domain other than
> > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc
> > to be implemented directly in Dom0. In a way this is conceptually
> > similar to virtio-9pfs, but for sockets rather than filesystem APIs.
> > See this diagram as reference:
> >
> > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-
> Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing
> >
> > The frontends and backends could live either in userspace or kernel
> > space, with different trade-offs. My current prototype is based on Linux
> > kernel drivers but it would be nice to have userspace drivers too.
> > Discussing where the drivers could be implemented it's beyond the scope
> > of this email.
> 
> Just to confirm, you are intending to create a cross-domain transport
> for all AF_ socket types, or just some?
> 
> >
> >
> > # Goals
> >
> > The goal of the protocol is to provide networking capabilities to any
> > guests, with the following added benefits:
> 
> Throughout, s/Dom0/the backend/
> 
> I expect running the backend in dom0 will be the overwhelmingly common
> configuration, but you should avoid designing the protocol for just this
> usecase.
> 
> >
> > * guest networking should work out of the box with VPNs, wireless
> >   networks and any other complex network configurations in Dom0
> >
> > * guest services should listen on ports bound directly to Dom0 IP
> >   addresses, fitting naturally in a Docker based workflow, where guests
> >   are Docker containers
> >
> > * Dom0 should have full visibility on the guest behavior and should be
> >   able to perform inexpensive filtering and manipulation of guest calls
> >
> > * XenSock should provide excellent performance. Unoptimized early code
> >   reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3
> >   streams.
> 
> What happens if domU tries to open an AF_INET socket, and the domain has
> both sockfront and netfront ?  What happens if a domain has multiple
> sockfronts?
> 

This sounds awfully like a class of problem that the open onload (http://www.openonload.org/) stack had to solve, and it involved having to track updates to various kernel tables involved in inet routing and having to keep a 'standard' inet socket in hand even when setting up an intercepted (read 'PV' for this connect ) socket since, until connect, you don’t know what the far end is or how to get to it.

Having your own AF is definitely a much easier starting point. It also means you get to define all the odd corner-case semantics rather than having to emulate Linux/BSD/Solaris/etc. quirks.

  Paul

> ~Andrew
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: XenSock brainstorming
  2016-06-06 10:16       ` Paul Durrant
@ 2016-06-06 10:48         ` Stefano Stabellini
  0 siblings, 0 replies; 9+ messages in thread
From: Stefano Stabellini @ 2016-06-06 10:48 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Wei Liu, Andrew Cooper, Stefano Stabellini, xen-devel,
	joao.m.martins, Roger Pau Monne

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3571 bytes --]

On Mon, 6 Jun 2016, Paul Durrant wrote:
> > -----Original Message-----
> > From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
> > Andrew Cooper
> > Sent: 06 June 2016 10:58
> > To: Stefano Stabellini; xen-devel@lists.xenproject.org
> > Cc: joao.m.martins@oracle.com; Wei Liu; Roger Pau Monne
> > Subject: Re: [Xen-devel] RFC: XenSock brainstorming
> > 
> > On 06/06/16 10:33, Stefano Stabellini wrote:
> > > Hi all,
> > >
> > > a couple of months ago I started working on a new PV protocol for
> > > virtualizing syscalls. I named it XenSock, as its main purpose is to
> > > allow the implementation of the POSIX socket API in a domain other than
> > > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc
> > > to be implemented directly in Dom0. In a way this is conceptually
> > > similar to virtio-9pfs, but for sockets rather than filesystem APIs.
> > > See this diagram as reference:
> > >
> > > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-
> > Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing
> > >
> > > The frontends and backends could live either in userspace or kernel
> > > space, with different trade-offs. My current prototype is based on Linux
> > > kernel drivers but it would be nice to have userspace drivers too.
> > > Discussing where the drivers could be implemented it's beyond the scope
> > > of this email.
> > 
> > Just to confirm, you are intending to create a cross-domain transport
> > for all AF_ socket types, or just some?
> > 
> > >
> > >
> > > # Goals
> > >
> > > The goal of the protocol is to provide networking capabilities to any
> > > guests, with the following added benefits:
> > 
> > Throughout, s/Dom0/the backend/
> > 
> > I expect running the backend in dom0 will be the overwhelmingly common
> > configuration, but you should avoid designing the protocol for just this
> > usecase.
> > 
> > >
> > > * guest networking should work out of the box with VPNs, wireless
> > >   networks and any other complex network configurations in Dom0
> > >
> > > * guest services should listen on ports bound directly to Dom0 IP
> > >   addresses, fitting naturally in a Docker based workflow, where guests
> > >   are Docker containers
> > >
> > > * Dom0 should have full visibility on the guest behavior and should be
> > >   able to perform inexpensive filtering and manipulation of guest calls
> > >
> > > * XenSock should provide excellent performance. Unoptimized early code
> > >   reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3
> > >   streams.
> > 
> > What happens if domU tries to open an AF_INET socket, and the domain has
> > both sockfront and netfront ?  What happens if a domain has multiple
> > sockfronts?
> > 
> 
> This sounds awfully like a class of problem that the open onload (http://www.openonload.org/) stack had to solve, and it involved having to track updates to various kernel tables involved in inet routing and having to keep a 'standard' inet socket in hand even when setting up an intercepted (read 'PV' for this connect ) socket since, until connect, you don’t know what the far end is or how to get to it.
> 
> Having your own AF is definitely a much easier starting point. It also means you get to define all the odd corner-case semantics rather than having to emulate Linux/BSD/Solaris/etc. quirks.

Thanks for the pointer, I'll have a look.

Other related work include:
VirtuOS http://people.cs.vt.edu/~gback/papers/sosp13final.pdf
Virtio-vsock http://events.linuxfoundation.org/sites/events/files/slides/stefanha-kvm-forum-2015.pdf

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: XenSock brainstorming
  2016-06-06  9:57     ` Andrew Cooper
  2016-06-06 10:16       ` Paul Durrant
@ 2016-06-06 10:25       ` Stefano Stabellini
  1 sibling, 0 replies; 9+ messages in thread
From: Stefano Stabellini @ 2016-06-06 10:25 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Stefano Stabellini, xen-devel, joao.m.martins, wei.liu2, roger.pau

On Mon, 6 Jun 2016, Andrew Cooper wrote:
> On 06/06/16 10:33, Stefano Stabellini wrote:
> > Hi all,
> >
> > a couple of months ago I started working on a new PV protocol for
> > virtualizing syscalls. I named it XenSock, as its main purpose is to
> > allow the implementation of the POSIX socket API in a domain other than
> > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc
> > to be implemented directly in Dom0. In a way this is conceptually
> > similar to virtio-9pfs, but for sockets rather than filesystem APIs.
> > See this diagram as reference:
> >
> > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing
> >
> > The frontends and backends could live either in userspace or kernel
> > space, with different trade-offs. My current prototype is based on Linux
> > kernel drivers but it would be nice to have userspace drivers too.
> > Discussing where the drivers could be implemented it's beyond the scope
> > of this email.
> 
> Just to confirm, you are intending to create a cross-domain transport
> for all AF_ socket types, or just some?

My use case is for AF_INET, so that's what I intend to implement. If
somebody wanted to come along and implement AF_IPX for example, I would
be fine with that and I would welcome the effort.

> > # Goals
> >
> > The goal of the protocol is to provide networking capabilities to any
> > guests, with the following added benefits:
> 
> Throughout, s/Dom0/the backend/
> 
> I expect running the backend in dom0 will be the overwhelmingly common
> configuration, but you should avoid designing the protocol for just this
> usecase.

As always I am happy to make this as generic and reusable as possible.
The goals stated here are my goals with this protocol and I hope many
readers will share some of them with me. Although I don't have an
interest for running the backend in a domain other than Dom0, there is
nothing in the current design (or even my early code) that would prevent
driver domains from working.

> > * guest networking should work out of the box with VPNs, wireless
> >   networks and any other complex network configurations in Dom0
> >
> > * guest services should listen on ports bound directly to Dom0 IP
> >   addresses, fitting naturally in a Docker based workflow, where guests
> >   are Docker containers
> >
> > * Dom0 should have full visibility on the guest behavior and should be
> >   able to perform inexpensive filtering and manipulation of guest calls
> >
> > * XenSock should provide excellent performance. Unoptimized early code
> >   reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3
> >   streams.
> 
> What happens if domU tries to open an AF_INET socket, and the domain has
> both sockfront and netfront ?

I wouldn't encourage this configuration. However it works more naturally
than one would expect: depending on how DomU is configured, if the
AF_INET socket calls are routed to the XenSock frontend, then they are
going to appear to come out from Dom0, otherwise they will be routed as
usual. So for example if the frontend is implemented in userspace, for
example in a modified libc library, then if applications in the guest
use the library, their data go through XenSock, otherwise they go
through netfront.

>  What happens if a domain has multiple sockfronts?

I don't think it should be a valid configuration. I cannot think of a
case where one would want something like that. But if somebody comes up
with a valid scenario on why and how this should work, I would be happy
to work with her to make it happen.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: XenSock brainstorming
  2016-06-06  9:33   ` RFC: XenSock brainstorming Stefano Stabellini
  2016-06-06  9:57     ` Andrew Cooper
@ 2016-06-23 16:03     ` Stefano Stabellini
  2016-06-23 16:57       ` Stefano Stabellini
  2016-06-23 16:28     ` David Vrabel
  2 siblings, 1 reply; 9+ messages in thread
From: Stefano Stabellini @ 2016-06-23 16:03 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, joao.m.martins, wei.liu2, roger.pau

Now that Xen 4.7 is out of the door, any more feedback on this?

On Mon, 6 Jun 2016, Stefano Stabellini wrote:
> Hi all,
> 
> a couple of months ago I started working on a new PV protocol for
> virtualizing syscalls. I named it XenSock, as its main purpose is to
> allow the implementation of the POSIX socket API in a domain other than
> the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc
> to be implemented directly in Dom0. In a way this is conceptually
> similar to virtio-9pfs, but for sockets rather than filesystem APIs.
> See this diagram as reference:
> 
> https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing
> 
> The frontends and backends could live either in userspace or kernel
> space, with different trade-offs. My current prototype is based on Linux
> kernel drivers but it would be nice to have userspace drivers too.
> Discussing where the drivers could be implemented it's beyond the scope
> of this email.
> 
> 
> # Goals
> 
> The goal of the protocol is to provide networking capabilities to any
> guests, with the following added benefits:
> 
> * guest networking should work out of the box with VPNs, wireless
>   networks and any other complex network configurations in Dom0
> 
> * guest services should listen on ports bound directly to Dom0 IP
>   addresses, fitting naturally in a Docker based workflow, where guests
>   are Docker containers
> 
> * Dom0 should have full visibility on the guest behavior and should be
>   able to perform inexpensive filtering and manipulation of guest calls
> 
> * XenSock should provide excellent performance. Unoptimized early code
>   reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3
>   streams.
> 
> 
> # Status
> 
> I would like to get feedback on the high level architecture, the data
> path and the ring formats.
> 
> Beware that protocol and drivers are in their very early days. I don't
> have all the information to write a design document yet. The ABI is
> neither complete nor stable.
> 
> The code is not ready for xen-devel yet, but I would be happy to push a
> git branch if somebody is interested in contributing to the project.
> 
> 
> # Design and limitations
> 
> The frontend connects to the backend following the traditional xenstore
> based exchange of information.
> 
> Frontend and backend setup an event channel and shared ring. The ring is
> used by the frontend to forward socket API calls to the backend. I am
> referring to this ring as command ring. This is an example of the ring
> format:
> 
> #define XENSOCK_CONNECT        0
> #define XENSOCK_RELEASE        3
> #define XENSOCK_BIND           4
> #define XENSOCK_LISTEN         5
> #define XENSOCK_ACCEPT         6
> #define XENSOCK_POLL           7
> 
> struct xen_xensock_request {
> 	uint32_t id;     /* private to guest, echoed in response */
> 	uint32_t cmd;    /* command to execute */
> 	uint64_t sockid; /* id of the socket */
> 	union {
> 		struct xen_xensock_connect {
> 			uint8_t addr[28];
> 			uint32_t len;
> 			uint32_t flags;
> 			grant_ref_t ref[XENSOCK_DATARING_PAGES];
> 			uint32_t evtchn;
> 		} connect;
> 		struct xen_xensock_bind {
> 			uint8_t addr[28]; /* ipv6 ready */
> 			uint32_t len;
> 		} bind;
> 		struct xen_xensock_accept {
> 			grant_ref_t ref[XENSOCK_DATARING_PAGES];
> 			uint32_t evtchn;
> 			uint64_t sockid;
> 		} accept;
> 	} u;
> };
> 
> struct xen_xensock_response {
> 	uint32_t id;
> 	uint32_t cmd;
> 	uint64_t sockid;
> 	int32_t ret;
> };
> 
> DEFINE_RING_TYPES(xen_xensock, struct xen_xensock_request,
> 		  struct xen_xensock_response);
> 
> 
> Connect and accept lead to the creation of new active sockets. Today
> each active socket has its own event channel and ring for sending and
> receiving data. Data rings have the following format:
> 
> #define XENSOCK_DATARING_ORDER 2
> #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
> #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
> 
> typedef uint32_t XENSOCK_RING_IDX;
> 
> struct xensock_ring_intf {
> 	char in[XENSOCK_DATARING_SIZE/4];
> 	char out[XENSOCK_DATARING_SIZE/2];
> 	XENSOCK_RING_IDX in_cons, in_prod;
> 	XENSOCK_RING_IDX out_cons, out_prod;
> 	int32_t in_error, out_error;
> };
> 
> The ring works like the Xen console ring (see
> xen/include/public/io/console.h). Data is copied to/from the ring by
> both frontend and backend. in_error, out_error are used to report
> errors. This simple design works well, but it requires at least 1 page
> per active socket. To get good performance (~20 Gbit/sec single stream),
> we need buffers of at least 64K, so actually we are looking at about 64
> pages per ring (order 6).
> 
> I am currently investigating the usage of AVX2 to perform the data copy.
> 
> 
> # Brainstorming
> 
> Are 64 pages per active socket a reasonable amount in the context of
> modern OS level networking? I believe that regular Linux tcp sockets
> allocate something in that order of magnitude.
> 
> If that's too much, I spent some time thinking about ways to reduce it.
> Some ideas follow.
> 
> 
> We could split up send and receive into two different data structures. I
> am thinking of introducing a single ring for all active sockets with
> variable size messages for sending data. Something like the following:
> 
> struct xensock_ring_entry {
> 	uint64_t sockid; /* identifies a socket */
> 	uint32_t len;    /* length of data to follow */
> 	uint8_t data[];  /* variable length data */
> };
>  
> One ring would be dedicated to holding xensock_ring_entry structures,
> one after another in a classic circular fashion. Two indexes, out_cons
> and out_prod, would still be used the same way the are used in the
> console ring, but I would place them on a separate page for clarity:
> 
> struct xensock_ring_intf {
> 	XENSOCK_RING_IDX out_cons, out_prod;
> };
> 
> The frontend, that is the producer, writes a new struct
> xensock_ring_entry to the ring, careful not to exceed the remaining free
> bytes available. Then it increments out_prod by the written amount. The
> backend, that is the consumer, reads the new struct xensock_ring_entry,
> reading as much data as specified by "len". Then it increments out_cons
> by the size of the struct xensock_ring_entry read.
> 
> I think this could work. Theoretically we could do the same thing for
> receive: a separate single ring shared by all active sockets. We could
> even reuse struct xensock_ring_entry.
> 
> 
> However I have doubts that this model could work well for receive. When
> sending data, all sockets on the frontend side copy buffers onto this
> single ring. If there is no room, the frontend returns ENOBUFS. The
> backend picks up the data from the ring and calls sendmsg, which can
> also return ENOBUFS. In that case we don't increment out_cons, leaving
> the data on the ring. The backend will try again in the near future.
> Error messages would have to go on a separate data structure which I
> haven't finalized yet.
> 
> When receiving from a socket, the backend copies data to the ring as
> soon as data is available, perhaps before the frontend requests the
> data. Buffers are copied to the ring not necessarily in the order that
> the frontend might want to read them. Thus the frontend would have to
> copy them out of the common ring into private per-socket dynamic buffers
> just to free the ring as soon as possible and consume the next
> xensock_ring_entry. It doesn't look very advantageous in terms of memory
> consumption and performance.
> 
> Alternatively, the frontend would have to leave the data on the ring if
> the application didn't ask for it yet. In that case the frontend could
> look ahead without incrementing the in_cons pointer. It would have to
> keep track of which entries have been consumed and which entries have
> not been consumed. Only when the ring is full, the frontend would have
> no other choice but to copy the data out of the ring into temporary
> buffers. I am not sure how well this could work in practice.
> 
> As a compromise, we could use a single shared ring for sending data, and
> 1 ring per active socket to receive data. This would cut the per-socket
> memory consumption in half (maybe to a quarter, moving out the indexes
> from the shared data ring into a separate page) and might be an
> acceptable trade-off.
> 
> Any feedback or ideas?
> 
> 
> Many thanks,
> 
> Stefano
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: XenSock brainstorming
  2016-06-23 16:03     ` Stefano Stabellini
@ 2016-06-23 16:57       ` Stefano Stabellini
  0 siblings, 0 replies; 9+ messages in thread
From: Stefano Stabellini @ 2016-06-23 16:57 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, joao.m.martins, wei.liu2, roger.pau

Although discussing the goals is fun, feedback on the design of the
protocol is particularly welcome.

On Thu, 23 Jun 2016, Stefano Stabellini wrote:
> Now that Xen 4.7 is out of the door, any more feedback on this?
> 
> On Mon, 6 Jun 2016, Stefano Stabellini wrote:
> > Hi all,
> > 
> > a couple of months ago I started working on a new PV protocol for
> > virtualizing syscalls. I named it XenSock, as its main purpose is to
> > allow the implementation of the POSIX socket API in a domain other than
> > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc
> > to be implemented directly in Dom0. In a way this is conceptually
> > similar to virtio-9pfs, but for sockets rather than filesystem APIs.
> > See this diagram as reference:
> > 
> > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing
> > 
> > The frontends and backends could live either in userspace or kernel
> > space, with different trade-offs. My current prototype is based on Linux
> > kernel drivers but it would be nice to have userspace drivers too.
> > Discussing where the drivers could be implemented it's beyond the scope
> > of this email.
> > 
> > 
> > # Goals
> > 
> > The goal of the protocol is to provide networking capabilities to any
> > guests, with the following added benefits:
> > 
> > * guest networking should work out of the box with VPNs, wireless
> >   networks and any other complex network configurations in Dom0
> > 
> > * guest services should listen on ports bound directly to Dom0 IP
> >   addresses, fitting naturally in a Docker based workflow, where guests
> >   are Docker containers
> > 
> > * Dom0 should have full visibility on the guest behavior and should be
> >   able to perform inexpensive filtering and manipulation of guest calls
> > 
> > * XenSock should provide excellent performance. Unoptimized early code
> >   reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3
> >   streams.
> > 
> > 
> > # Status
> > 
> > I would like to get feedback on the high level architecture, the data
> > path and the ring formats.
> > 
> > Beware that protocol and drivers are in their very early days. I don't
> > have all the information to write a design document yet. The ABI is
> > neither complete nor stable.
> > 
> > The code is not ready for xen-devel yet, but I would be happy to push a
> > git branch if somebody is interested in contributing to the project.
> > 
> > 
> > # Design and limitations
> > 
> > The frontend connects to the backend following the traditional xenstore
> > based exchange of information.
> > 
> > Frontend and backend setup an event channel and shared ring. The ring is
> > used by the frontend to forward socket API calls to the backend. I am
> > referring to this ring as command ring. This is an example of the ring
> > format:
> > 
> > #define XENSOCK_CONNECT        0
> > #define XENSOCK_RELEASE        3
> > #define XENSOCK_BIND           4
> > #define XENSOCK_LISTEN         5
> > #define XENSOCK_ACCEPT         6
> > #define XENSOCK_POLL           7
> > 
> > struct xen_xensock_request {
> > 	uint32_t id;     /* private to guest, echoed in response */
> > 	uint32_t cmd;    /* command to execute */
> > 	uint64_t sockid; /* id of the socket */
> > 	union {
> > 		struct xen_xensock_connect {
> > 			uint8_t addr[28];
> > 			uint32_t len;
> > 			uint32_t flags;
> > 			grant_ref_t ref[XENSOCK_DATARING_PAGES];
> > 			uint32_t evtchn;
> > 		} connect;
> > 		struct xen_xensock_bind {
> > 			uint8_t addr[28]; /* ipv6 ready */
> > 			uint32_t len;
> > 		} bind;
> > 		struct xen_xensock_accept {
> > 			grant_ref_t ref[XENSOCK_DATARING_PAGES];
> > 			uint32_t evtchn;
> > 			uint64_t sockid;
> > 		} accept;
> > 	} u;
> > };
> > 
> > struct xen_xensock_response {
> > 	uint32_t id;
> > 	uint32_t cmd;
> > 	uint64_t sockid;
> > 	int32_t ret;
> > };
> > 
> > DEFINE_RING_TYPES(xen_xensock, struct xen_xensock_request,
> > 		  struct xen_xensock_response);
> > 
> > 
> > Connect and accept lead to the creation of new active sockets. Today
> > each active socket has its own event channel and ring for sending and
> > receiving data. Data rings have the following format:
> > 
> > #define XENSOCK_DATARING_ORDER 2
> > #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
> > #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
> > 
> > typedef uint32_t XENSOCK_RING_IDX;
> > 
> > struct xensock_ring_intf {
> > 	char in[XENSOCK_DATARING_SIZE/4];
> > 	char out[XENSOCK_DATARING_SIZE/2];
> > 	XENSOCK_RING_IDX in_cons, in_prod;
> > 	XENSOCK_RING_IDX out_cons, out_prod;
> > 	int32_t in_error, out_error;
> > };
> > 
> > The ring works like the Xen console ring (see
> > xen/include/public/io/console.h). Data is copied to/from the ring by
> > both frontend and backend. in_error, out_error are used to report
> > errors. This simple design works well, but it requires at least 1 page
> > per active socket. To get good performance (~20 Gbit/sec single stream),
> > we need buffers of at least 64K, so actually we are looking at about 64
> > pages per ring (order 6).
> > 
> > I am currently investigating the usage of AVX2 to perform the data copy.
> > 
> > 
> > # Brainstorming
> > 
> > Are 64 pages per active socket a reasonable amount in the context of
> > modern OS level networking? I believe that regular Linux tcp sockets
> > allocate something in that order of magnitude.
> > 
> > If that's too much, I spent some time thinking about ways to reduce it.
> > Some ideas follow.
> > 
> > 
> > We could split up send and receive into two different data structures. I
> > am thinking of introducing a single ring for all active sockets with
> > variable size messages for sending data. Something like the following:
> > 
> > struct xensock_ring_entry {
> > 	uint64_t sockid; /* identifies a socket */
> > 	uint32_t len;    /* length of data to follow */
> > 	uint8_t data[];  /* variable length data */
> > };
> >  
> > One ring would be dedicated to holding xensock_ring_entry structures,
> > one after another in a classic circular fashion. Two indexes, out_cons
> > and out_prod, would still be used the same way the are used in the
> > console ring, but I would place them on a separate page for clarity:
> > 
> > struct xensock_ring_intf {
> > 	XENSOCK_RING_IDX out_cons, out_prod;
> > };
> > 
> > The frontend, that is the producer, writes a new struct
> > xensock_ring_entry to the ring, careful not to exceed the remaining free
> > bytes available. Then it increments out_prod by the written amount. The
> > backend, that is the consumer, reads the new struct xensock_ring_entry,
> > reading as much data as specified by "len". Then it increments out_cons
> > by the size of the struct xensock_ring_entry read.
> > 
> > I think this could work. Theoretically we could do the same thing for
> > receive: a separate single ring shared by all active sockets. We could
> > even reuse struct xensock_ring_entry.
> > 
> > 
> > However I have doubts that this model could work well for receive. When
> > sending data, all sockets on the frontend side copy buffers onto this
> > single ring. If there is no room, the frontend returns ENOBUFS. The
> > backend picks up the data from the ring and calls sendmsg, which can
> > also return ENOBUFS. In that case we don't increment out_cons, leaving
> > the data on the ring. The backend will try again in the near future.
> > Error messages would have to go on a separate data structure which I
> > haven't finalized yet.
> > 
> > When receiving from a socket, the backend copies data to the ring as
> > soon as data is available, perhaps before the frontend requests the
> > data. Buffers are copied to the ring not necessarily in the order that
> > the frontend might want to read them. Thus the frontend would have to
> > copy them out of the common ring into private per-socket dynamic buffers
> > just to free the ring as soon as possible and consume the next
> > xensock_ring_entry. It doesn't look very advantageous in terms of memory
> > consumption and performance.
> > 
> > Alternatively, the frontend would have to leave the data on the ring if
> > the application didn't ask for it yet. In that case the frontend could
> > look ahead without incrementing the in_cons pointer. It would have to
> > keep track of which entries have been consumed and which entries have
> > not been consumed. Only when the ring is full, the frontend would have
> > no other choice but to copy the data out of the ring into temporary
> > buffers. I am not sure how well this could work in practice.
> > 
> > As a compromise, we could use a single shared ring for sending data, and
> > 1 ring per active socket to receive data. This would cut the per-socket
> > memory consumption in half (maybe to a quarter, moving out the indexes
> > from the shared data ring into a separate page) and might be an
> > acceptable trade-off.
> > 
> > Any feedback or ideas?
> > 
> > 
> > Many thanks,
> > 
> > Stefano
> > 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: XenSock brainstorming
  2016-06-06  9:33   ` RFC: XenSock brainstorming Stefano Stabellini
  2016-06-06  9:57     ` Andrew Cooper
  2016-06-23 16:03     ` Stefano Stabellini
@ 2016-06-23 16:28     ` David Vrabel
  2016-06-23 16:49       ` Stefano Stabellini
  2 siblings, 1 reply; 9+ messages in thread
From: David Vrabel @ 2016-06-23 16:28 UTC (permalink / raw)
  To: Stefano Stabellini, xen-devel; +Cc: joao.m.martins, wei.liu2, roger.pau

On 06/06/16 10:33, Stefano Stabellini wrote:
> # Goals
> 
> The goal of the protocol is to provide networking capabilities to any
> guests, with the following added benefits:
> 
> * guest networking should work out of the box with VPNs, wireless
>   networks and any other complex network configurations in Dom0
> 
> * guest services should listen on ports bound directly to Dom0 IP
>   addresses, fitting naturally in a Docker based workflow, where guests
>   are Docker containers
> 
> * Dom0 should have full visibility on the guest behavior and should be
>   able to perform inexpensive filtering and manipulation of guest calls
> 
> * XenSock should provide excellent performance. Unoptimized early code
>   reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3
>   streams.

I think it looks a bit odd to isolate the workload into a VM and then
blow a hole in the isolation by providing a "fat" RPC interface directly
to the privileged dom0 kernel.

I think you could probably present a regular VIF to the guest and use
SDN (e.g., openvswitch) to get your docker-like semantics.

David


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: XenSock brainstorming
  2016-06-23 16:28     ` David Vrabel
@ 2016-06-23 16:49       ` Stefano Stabellini
  0 siblings, 0 replies; 9+ messages in thread
From: Stefano Stabellini @ 2016-06-23 16:49 UTC (permalink / raw)
  To: David Vrabel
  Cc: Stefano Stabellini, xen-devel, joao.m.martins, wei.liu2, roger.pau

On Thu, 23 Jun 2016, David Vrabel wrote:
> On 06/06/16 10:33, Stefano Stabellini wrote:
> > # Goals
> > 
> > The goal of the protocol is to provide networking capabilities to any
> > guests, with the following added benefits:
> > 
> > * guest networking should work out of the box with VPNs, wireless
> >   networks and any other complex network configurations in Dom0
> > 
> > * guest services should listen on ports bound directly to Dom0 IP
> >   addresses, fitting naturally in a Docker based workflow, where guests
> >   are Docker containers
> > 
> > * Dom0 should have full visibility on the guest behavior and should be
> >   able to perform inexpensive filtering and manipulation of guest calls
> > 
> > * XenSock should provide excellent performance. Unoptimized early code
> >   reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3
> >   streams.
> 
> I think it looks a bit odd to isolate the workload into a VM and then
> blow a hole in the isolation by providing a "fat" RPC interface directly
> to the privileged dom0 kernel.

It might look odd but this is exactly the goal of the project. The vast
majority of the syscalls will be run entirely within the VM. The ones
that are allowed to reach dom0 are only very few, less then 10 today in
fact. It is a big win from a security perspective compared to
containers. And it is a big win compared to VMs in terms of performance.
In my last test I reached 84 gbit/sec with 4 tcp streams.

Monitoring the behavior of the guest becomes extremely cheap and easy as
one can just keep track of the syscalls forwarded to dom0. It would be
trivial to figure out if your NGINX container unexpectedly tried to open
port 22 for example. One would have to employ complex firewall rules or
VM introspection to do this otherwise. In addition one can still use all
the traditional filtering techniques for these syscalls in dom0, such as
seccomp.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-06-23 16:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <alpine.DEB.2.10.1606021429410.16603@sstabellini-ThinkPad-X260>
     [not found] ` <CAAe9sUHsKXsvD5aK9PHeTYRwq8+0Q9yXK2sPY+Fk=5kErBri8A@mail.gmail.com>
2016-06-06  9:33   ` RFC: XenSock brainstorming Stefano Stabellini
2016-06-06  9:57     ` Andrew Cooper
2016-06-06 10:16       ` Paul Durrant
2016-06-06 10:48         ` Stefano Stabellini
2016-06-06 10:25       ` Stefano Stabellini
2016-06-23 16:03     ` Stefano Stabellini
2016-06-23 16:57       ` Stefano Stabellini
2016-06-23 16:28     ` David Vrabel
2016-06-23 16:49       ` Stefano Stabellini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).