RFC: XenSock brainstorming

* RFC: XenSock brainstorming
       [not found] ` <CAAe9sUHsKXsvD5aK9PHeTYRwq8+0Q9yXK2sPY+Fk=5kErBri8A@mail.gmail.com>
@ 2016-06-06  9:33   ` Stefano Stabellini
  2016-06-06  9:57     ` Andrew Cooper
                       ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Stefano Stabellini @ 2016-06-06  9:33 UTC (permalink / raw)
  To: xen-devel; +Cc: stefano, joao.m.martins, wei.liu2, roger.pau

Hi all,

a couple of months ago I started working on a new PV protocol for
virtualizing syscalls. I named it XenSock, as its main purpose is to
allow the implementation of the POSIX socket API in a domain other than
the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc
to be implemented directly in Dom0. In a way this is conceptually
similar to virtio-9pfs, but for sockets rather than filesystem APIs.
See this diagram as reference:

https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing

The frontends and backends could live either in userspace or kernel
space, with different trade-offs. My current prototype is based on Linux
kernel drivers but it would be nice to have userspace drivers too.
Discussing where the drivers could be implemented it's beyond the scope
of this email.

# Goals

The goal of the protocol is to provide networking capabilities to any
guests, with the following added benefits:

* guest networking should work out of the box with VPNs, wireless
  networks and any other complex network configurations in Dom0

* guest services should listen on ports bound directly to Dom0 IP
  addresses, fitting naturally in a Docker based workflow, where guests
  are Docker containers

* Dom0 should have full visibility on the guest behavior and should be
  able to perform inexpensive filtering and manipulation of guest calls

* XenSock should provide excellent performance. Unoptimized early code
  reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3
  streams.

# Status

I would like to get feedback on the high level architecture, the data
path and the ring formats.

Beware that protocol and drivers are in their very early days. I don't
have all the information to write a design document yet. The ABI is
neither complete nor stable.

The code is not ready for xen-devel yet, but I would be happy to push a
git branch if somebody is interested in contributing to the project.

# Design and limitations

The frontend connects to the backend following the traditional xenstore
based exchange of information.

Frontend and backend setup an event channel and shared ring. The ring is
used by the frontend to forward socket API calls to the backend. I am
referring to this ring as command ring. This is an example of the ring
format:

#define XENSOCK_CONNECT        0
#define XENSOCK_RELEASE        3
#define XENSOCK_BIND           4
#define XENSOCK_LISTEN         5
#define XENSOCK_ACCEPT         6
#define XENSOCK_POLL           7

struct xen_xensock_request {
	uint32_t id;     /* private to guest, echoed in response */
	uint32_t cmd;    /* command to execute */
	uint64_t sockid; /* id of the socket */
	union {
		struct xen_xensock_connect {
			uint8_t addr[28];
			uint32_t len;
			uint32_t flags;
			grant_ref_t ref[XENSOCK_DATARING_PAGES];
			uint32_t evtchn;
		} connect;
		struct xen_xensock_bind {
			uint8_t addr[28]; /* ipv6 ready */
			uint32_t len;
		} bind;
		struct xen_xensock_accept {
			grant_ref_t ref[XENSOCK_DATARING_PAGES];
			uint32_t evtchn;
			uint64_t sockid;
		} accept;
	} u;
};

struct xen_xensock_response {
	uint32_t id;
	uint32_t cmd;
	uint64_t sockid;
	int32_t ret;
};

DEFINE_RING_TYPES(xen_xensock, struct xen_xensock_request,
		  struct xen_xensock_response);

Connect and accept lead to the creation of new active sockets. Today
each active socket has its own event channel and ring for sending and
receiving data. Data rings have the following format:

#define XENSOCK_DATARING_ORDER 2
#define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
#define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)

typedef uint32_t XENSOCK_RING_IDX;

struct xensock_ring_intf {
	char in[XENSOCK_DATARING_SIZE/4];
	char out[XENSOCK_DATARING_SIZE/2];
	XENSOCK_RING_IDX in_cons, in_prod;
	XENSOCK_RING_IDX out_cons, out_prod;
	int32_t in_error, out_error;
};

The ring works like the Xen console ring (see
xen/include/public/io/console.h). Data is copied to/from the ring by
both frontend and backend. in_error, out_error are used to report
errors. This simple design works well, but it requires at least 1 page
per active socket. To get good performance (~20 Gbit/sec single stream),
we need buffers of at least 64K, so actually we are looking at about 64
pages per ring (order 6).

I am currently investigating the usage of AVX2 to perform the data copy.

# Brainstorming

Are 64 pages per active socket a reasonable amount in the context of
modern OS level networking? I believe that regular Linux tcp sockets
allocate something in that order of magnitude.

If that's too much, I spent some time thinking about ways to reduce it.
Some ideas follow.

We could split up send and receive into two different data structures. I
am thinking of introducing a single ring for all active sockets with
variable size messages for sending data. Something like the following:

struct xensock_ring_entry {
	uint64_t sockid; /* identifies a socket */
	uint32_t len;    /* length of data to follow */
	uint8_t data[];  /* variable length data */
};

One ring would be dedicated to holding xensock_ring_entry structures,
one after another in a classic circular fashion. Two indexes, out_cons
and out_prod, would still be used the same way the are used in the
console ring, but I would place them on a separate page for clarity:

struct xensock_ring_intf {
	XENSOCK_RING_IDX out_cons, out_prod;
};

The frontend, that is the producer, writes a new struct
xensock_ring_entry to the ring, careful not to exceed the remaining free
bytes available. Then it increments out_prod by the written amount. The
backend, that is the consumer, reads the new struct xensock_ring_entry,
reading as much data as specified by "len". Then it increments out_cons
by the size of the struct xensock_ring_entry read.

I think this could work. Theoretically we could do the same thing for
receive: a separate single ring shared by all active sockets. We could
even reuse struct xensock_ring_entry.

However I have doubts that this model could work well for receive. When
sending data, all sockets on the frontend side copy buffers onto this
single ring. If there is no room, the frontend returns ENOBUFS. The
backend picks up the data from the ring and calls sendmsg, which can
also return ENOBUFS. In that case we don't increment out_cons, leaving
the data on the ring. The backend will try again in the near future.
Error messages would have to go on a separate data structure which I
haven't finalized yet.

When receiving from a socket, the backend copies data to the ring as
soon as data is available, perhaps before the frontend requests the
data. Buffers are copied to the ring not necessarily in the order that
the frontend might want to read them. Thus the frontend would have to
copy them out of the common ring into private per-socket dynamic buffers
just to free the ring as soon as possible and consume the next
xensock_ring_entry. It doesn't look very advantageous in terms of memory
consumption and performance.

Alternatively, the frontend would have to leave the data on the ring if
the application didn't ask for it yet. In that case the frontend could
look ahead without incrementing the in_cons pointer. It would have to
keep track of which entries have been consumed and which entries have
not been consumed. Only when the ring is full, the frontend would have
no other choice but to copy the data out of the ring into temporary
buffers. I am not sure how well this could work in practice.

As a compromise, we could use a single shared ring for sending data, and
1 ring per active socket to receive data. This would cut the per-socket
memory consumption in half (maybe to a quarter, moving out the indexes
from the shared data ring into a separate page) and might be an
acceptable trade-off.

Any feedback or ideas?

Many thanks,

Stefano

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread