xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [DRAFT 1] XenSock protocol design document
@ 2016-07-08 11:23 Stefano Stabellini
  2016-07-08 12:14 ` Juergen Gross
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Stefano Stabellini @ 2016-07-08 11:23 UTC (permalink / raw)
  To: xen-devel
  Cc: jgross, lars.kurth, wei.liu2, stefano, david.vrabel,
	joao.m.martins, boris.ostrovsky, roger.pau

[-- Attachment #1: Type: TEXT/PLAIN, Size: 20995 bytes --]

Hi all,

as promised, this is the design document for the XenSock protocol I
mentioned here:

http://marc.info/?l=xen-devel&m=146520572428581

It is still in its early days but should give you a good idea of how it
looks like and how it is supposed to work. Let me know if you find gaps
in the document and I'll fill them in the next version.

You can find prototypes of the Linux frontend and backend drivers here:

git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git xensock-1

To use them, make sure to enable CONFIG_XENSOCK in your kernel config
and add "xensock=1" to the command line of your DomU Linux kernel. You
also need the toolstack to create the initial xenstore nodes for the
protocol. To do that, please apply the attached patch to libxl (the
patch is based on Xen 4.7.0-rc3) and add "xensock=1" to your DomU config
file.

Feel free to try them out! Please be kind, they are only prototypes with
a few known issues :-) But they should work well enough to run simple
tests. If you find something missing, let me know or, even better, write
a patch!

I'll follow up with a separate document to cover the design of my
particular implementation of the protocol.

Cheers,

Stefano

---

# XenSocks Protocol v1

## Rationale

XenSocks is a paravirtualized protocol for the POSIX socket API.

The purpose of XenSocks is to allow the implementation of a specific set
of POSIX calls to be done in a domain other than your own. It allows
connect, accept, bind, release, listen, poll, recvmsg and sendmsg to be
implemented in another domain.

XenSocks provides the following benefits:
* guest networking works out of the box with VPNs, wireless networks and
  any other complex configurations on the host
* guest services listen on ports bound directly to the backend domain IP
  addresses
* localhost becomes a secure namespace for intra-VMs communications
* full visibility of the guest behavior on the backend domain, allowing
  for inexpensive filtering and manipulation of any guest calls
* excellent performance


## Design

### Xenstore

The frontend and the backend connect to each other exchanging information via
xenstore. The toolstack creates front and back nodes with state
XenbusStateInitialising. There can only be one XenSock frontend per domain.

#### Frontend XenBus Nodes

port
     Values:         <uint32_t>

     The identifier of the Xen event channel used to signal activity
     in the ring buffer.

ring-ref
     Values:         <uint32_t>

     The Xen grant reference granting permission for the backend to map
     the sole page in a single page sized ring buffer.


#### State Machine

    **Front**                             **Back**
    XenbusStateInitialising               XenbusStateInitialising
    - Query virtual device                - Query backend device
      properties.                           identification data.
    - Setup OS device instance.                          |
    - Allocate and initialize the                        |
      request ring.                                      V
    - Publish transport parameters                XenbusStateInitWait
      that will be in effect during
      this connection.
                 |
                 |
                 V
       XenbusStateInitialised

                                          - Query frontend transport parameters.
                                          - Connect to the request ring and
                                            event channel.
                                                         |
                                                         |
                                                         V
                                                 XenbusStateConnected

     - Query backend device properties.
     - Finalize OS virtual device
       instance.
                 |
                 |
                 V
        XenbusStateConnected

Once frontend and backend are connected, they have a shared page, which
will is used to exchange messages over a ring, and an event channel,
which is used to send notifications.


### Commands Ring

The shared ring is used by the frontend to forward socket API calls to the
backend. I'll refer to this ring as **commands ring** to distinguish it from
other rings which will be created later in the lifecycle of the protocol (data
rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` macro
(`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
using the `RING_GET_REQUEST` macro.

The format is defined as follows:

    #define XENSOCK_DATARING_ORDER 6
    #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
    #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
    
    #define XENSOCK_CONNECT        0
    #define XENSOCK_RELEASE        3
    #define XENSOCK_BIND           4
    #define XENSOCK_LISTEN         5
    #define XENSOCK_ACCEPT         6
    #define XENSOCK_POLL           7
    
    struct xen_xensock_request {
        uint32_t id;     /* private to guest, echoed in response */
        uint32_t cmd;    /* command to execute */
        uint64_t sockid; /* id of the socket */
        union {
            struct xen_xensock_connect {
                uint8_t addr[28];
                uint32_t len;
                uint32_t flags;
                grant_ref_t ref[XENSOCK_DATARING_PAGES];
                uint32_t evtchn;
            } connect;
            struct xen_xensock_bind {
                uint8_t addr[28]; /* ipv6 ready */
                uint32_t len;
            } bind;
            struct xen_xensock_accept {
                uint64_t sockid;
                grant_ref_t ref[XENSOCK_DATARING_PAGES];
                uint32_t evtchn;
            } accept;
        } u;
    };

The first three fields are common for every command. Their binary layout
is:

    0       4       8       12      16
    +-------+-------+-------+-------+
    |  id   |  cmd  |     sockid    |
    +-------+-------+-------+-------+

- **id** is generated by the frontend and identifies one specific request
- **cmd** is the command requested by the frontend:
    - `XENSOCK_CONNECT`: 0
    - `XENSOCK_RELEASE`: 3
    - `XENSOCK_BIND`:    4
    - `XENSOCK_LISTEN`:  5
    - `XENSOCK_ACCEPT`:  6
    - `XENSOCK_POLL`:    7
- **sockid** is generated by the frontend and identifies the socket to connect,
  bind, etc. A new sockid is required on `XENSOCK_CONNECT` and `XENSOCK_BIND`
  commands. A new sockid is also required on `XENSOCK_ACCEPT`, for the new
  socket.
  
All three fields are echoed back by the backend.

As for the other Xen ring based protocols, after writing a request to the ring,
the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event
channel notification when a notification is required.

Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro.
The format is the following:

    struct xen_xensock_response {
        uint32_t id;
        uint32_t cmd;
        uint64_t sockid;
        int32_t ret;
    };
   
    0       4       8       12      16      20
    +-------+-------+-------+-------+-------+
    |  id   |  cmd  |     sockid    |  ret  |
    +-------+-------+-------+-------+-------+

- **id**: echoed back from request
- **cmd**: echoed back from request
- **sockid**: echoed back from request
- **ret**: return value, identifies success or failure

After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks whether
it needs to notify the frontend and does so via event channel.

A description of each command, their additional request fields and the
expected responses follow.


#### Connect

The **connect** operation corresponds to the connect system call. It connects a
socket to the specified address. **sockid** is freely chosen by the frontend and
references this specific socket from this point forward.

The connect operation creates a new shared ring, which we'll call **data ring**.
The new ring is used to send and receive data over the connected socket.
Information necessary to setup the new ring, such as grant table references and
event channel ports, are passed from the frontend to the backend as part of
this request. A **data ring** is unmapped and freed upon issuing a **release**
command on the active socket identified by **sockid**.

When the frontend issues a **connect** command, the backend:
- creates a new socket and connects it to **addr**
- creates an internal mapping from **sockid** to its own socket
- maps all the grant references and uses them as shared memory for the new data
  ring
- bind the **evtchn**
- replies to the frontend

The data ring format will be described in the following section.

Fields:

- **cmd** value: 0
- additional fields:
  - **addr**: address to connect to, in struct sockaddr format
  - **len**: address length
  - **flags**: flags for the connection, reserved for future usage
  - **ref**: grant references of the data ring
  - **evtchn**: port number of the evtchn to signal activity on the data ring

Binary layout:

        16      20      24      28      32      36      40      44     48
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |                            addr                       |  len  |
        +-------+-------+-------+-------+-------+-------+-------+-------+
        | flags |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] |ref[6] |
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|ref[14]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|ref[22]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|ref[30]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|ref[38]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|ref[46]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|ref[54]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|ref[62]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[63]|evtchn |  
        +-------+-------+

Return value:

  - 0 on success
  - less than 0 on failure, see the error codes of the socket system call

#### Release

The **release** operation closes an existing active or a passive socket.

When a release command is issued on a passive socket, the backend releases it
and frees its internal mappings. When a release command is issued for an active
socket, the data ring is also unmapped and freed:

- frontend sends release command for an active socket
- backend releases the socket
- backend unmaps the ring
- backend unbinds the evtchn
- backend replies to frontend
- frontend frees ring and unbinds evtchn

Fields:

- **cmd** value: 3
- additional fields: none

Return value:

  - 0 on success
  - less than 0 on failure, see the error codes of the shutdown system call

#### Bind

The **bind** operation assigns the address passed as parameter to the socket.
It corresponds to the bind system call. **sockid** is freely chosen by the
frontend and references this specific socket from this point forward. **Bind**,
**listen** and **accept** are the three operations required to have fully
working passive sockets and should be issued in this order.

Fields:

- **cmd** value: 4
- additional fields:
  - **addr**: address to bind to, in struct sockaddr format
  - **len**: address length

Binary layout:

        16      20      24      28      32      36      40      44     48
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |                            addr                       |  len  |
        +-------+-------+-------+-------+-------+-------+-------+-------+

Return value:

  - 0 on success
  - less than 0 on failure, see the error codes of the bind system call


#### Listen

The **listen** operation marks the socket as a passive socket. It corresponds to
the listen system call.

Fields:

- **cmd** value: 5
- additional fields: none

Return value:
  - 0 on success
  - less than 0 on failure, see the error codes of the listen system call


#### Accept

The **accept** operation extracts the first connection request on the queue of
pending connections for the listening socket identified by **sockid** and
creates a new connected socket. The **sockid** of the new socket is also chosen
by the frontend and passed as an additional field of the accept request struct.

Similarly to the **connect** operation, **accept** creates a new data ring.
Information necessary to setup the new ring, such as grant table references and
event channel ports, are passed from the frontend to the backend as part of
the request.

The backend will reply to the request only when a new connection is successfully
accepted, i.e. the backend does not return EAGAIN or EWOULDBLOCK.

Example workflow:

- frontend issues an **accept** request
- backend waits for a connection to be available on the socket
- a new connection becomes available
- backend accepts the new connection
- backend creates an internal mapping from **sockid** to the new socket
- backend maps all the grant references and uses them as shared memory for the
  new data ring
- backend binds the **evtchn**
- backend replies to the frontend

Fields:

- **cmd** value: 6
- additional fields:
  - **sockid**: id of the new socket
  - **ref**: grant references of the data ring
  - **evtchn**: port number of the evtchn to signal activity on the data ring

Binary layout:

        16      20      24      28      32      36      40      44     48
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |    sockid     |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] | 
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[6] |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[14]|ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[22]|ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[30]|ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[38]|ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[46]|ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[54]|ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[62]|ref[63]|evtchn | 
        +-------+-------+-------+

Return value:

  - 0 on success
  - less than 0 on failure, see the error codes of the accept system call


#### Poll

The **poll** operation is only valid for passive sockets. For active sockets,
the frontend should look at the state of the data ring. When a new connection is
available in the queue of the passive socket, the backend generates a response
and notifies the frontend.

Fields:

- **cmd** value: 7
- additional fields: none

Return value:

  - 0 on success
  - less than 0 on failure, see the error codes of the poll system call


### Data ring

Data rings are used for sending and receiving data over a connected socket. They
are created upon a successful **accept** or **connect** command. The ring works
in a similar way to the existing Xen console ring.

#### Format

    #define XENSOCK_DATARING_ORDER 6
    #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
    #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
    typedef uint32_t XENSOCK_RING_IDX;
    
    struct xensock_ring_intf {
    	char in[XENSOCK_DATARING_SIZE/4];
    	char out[XENSOCK_DATARING_SIZE/2];
    	XENSOCK_RING_IDX in_cons, in_prod;
    	XENSOCK_RING_IDX out_cons, out_prod;
    	int32_t in_error, out_error;
    };

The design is flexible and can support different ring sizes (at compile time).
The following description is based on order 6 rings, chosen because they provide
excellent performance.

- **in** is an array of 65536 bytes, used as circular buffer
  It contains data read from the socket. The producer is the backend, the
  consumer is the frontend.
- **out** is an array of 131072 bytes, used as circular buffer
  It contains data to be written to the socket. The producer is the frontend,
  the consumer is the backend.
- **in_cons** and **in_prod**
  Consumer and producer pointers for data read from the socket. They keep track
  of how much data has already been consumed by the frontend from the **in**
  array. **in_prod** is increased by the backend, after writing data to **in**.
  **in_cons** is increased by the frontend, after reading data from **in**.
- **out_cons**, **out_prod**
  Consumer and producer pointers for the data to be written to the socket. They
  keep track of how much data has been written by the frontend to **out** and
  how much data has already been consumed by the backend. **out_prod** is
  increased by the frontend, after writing data to **out**. **out_cons** is
  increased by the backend, after reading data from **out**.
- **in_error** and **out_error** They signal errors when reading from the socket
  (**in_error**) or when writing to the socket (**out_error**). 0 means no
  errors. When an error occurs, no further reads or writes operations are
  performed on the socket. In the case of an orderly socket shutdown (i.e. read
  returns 0) **in_error** is set to -ENOTCONN. **in_error** and **out_error**
  are never set to -EAGAIN or -EWOULDBLOCK.

The binary layout follows:

    0        65536           196608     196612    196616    196620   196624    196628   196632
    +----//----+-------//-------+---------+---------+---------+---------+---------+---------+
    |    in    |      out       | in_cons | in_prod |out_cons |out_prod |in_error |out_error|
    +----//----+-------//-------+---------+---------+---------+---------+---------+---------+
    

#### Workflow

The **in** and **out** arrays are used as circular buffers:
    
    0                               sizeof(array)
    +-----------------------------------+
    |to consume|    free    |to consume |
    +-----------------------------------+
               ^            ^
               prod         cons

    0                               sizeof(array)
    +-----------------------------------+
    |  free    | to consume |   free    |
    +-----------------------------------+
               ^            ^
               cons         prod

The following function is provided to calculate how many bytes are currently
left unconsumed in an array:

    #define _MASK_XENSOCK_IDX(idx, ring_size) ((idx) & (ring_size-1))

    static inline XENSOCK_RING_IDX xensock_ring_queued(XENSOCK_RING_IDX prod,
    		XENSOCK_RING_IDX cons,
    		XENSOCK_RING_IDX ring_size)
    {
    	XENSOCK_RING_IDX size;
    
    	if (prod == cons)
    		return 0;
    
    	prod = _MASK_XENSOCK_IDX(prod, ring_size);
    	cons = _MASK_XENSOCK_IDX(cons, ring_size);
    
    	if (prod == cons)
    		return ring_size;
    
    	if (prod > cons)
    		size = prod - cons;
    	else {
    		size = ring_size - cons;
    		size += prod;
    	}
    	return size;
    }

The producer (the backend for **in**, the frontend for **out**) writes to the
array in the following way:

- read *cons*, *prod*, *error* from shared memory
- memory barrier
- return on *error*
- write to array at position *prod* up to *cons*, wrapping around the circular
  buffer when necessary
- memory barrier
- increase *prod*
- notify the other end via evtchn

The consumer (the backend for **out**, the frontend for **in**) reads from the
array in the following way:

- read *prod*, *cons*, *error* from shared memory
- memory barrier
- return on *error*
- read from array at position *cons* up to *prod*, wrapping around the circular
  buffer when necessary
- memory barrier
- increase *cons*
- notify the other end via evtchn

The producer takes care of writing only as many bytes as available in the buffer
up to *cons*. The consumer takes care of reading only as many bytes as available
in the buffer up to *prod*. *error* is set by the backend when an error occurs
writing or reading from the socket.

[-- Attachment #2: Type: TEXT/PLAIN, Size: 8792 bytes --]

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index c39d745..f4c019d 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -2299,6 +2299,70 @@ int libxl_devid_to_device_vtpm(libxl_ctx *ctx,
     return rc;
 }
 
+/******************************************************************************/
+
+int libxl__device_xensock_setdefault(libxl__gc *gc, libxl_device_xensock *xensock)
+{
+    int rc;
+
+    rc = libxl__resolve_domid(gc, xensock->backend_domname, &xensock->backend_domid);
+    return rc;
+}
+
+static int libxl__device_from_xensock(libxl__gc *gc, uint32_t domid,
+                                   libxl_device_xensock *xensock,
+                                   libxl__device *device)
+{
+   device->backend_devid   = xensock->devid;
+   device->backend_domid   = xensock->backend_domid;
+   device->backend_kind    = LIBXL__DEVICE_KIND_XENSOCK;
+   device->devid           = xensock->devid;
+   device->domid           = domid;
+   device->kind            = LIBXL__DEVICE_KIND_XENSOCK;
+
+   return 0;
+}
+
+
+int libxl__device_xensock_add(libxl__gc *gc, uint32_t domid,
+                           libxl_device_xensock *xensock)
+{
+    flexarray_t *front;
+    flexarray_t *back;
+    libxl__device device;
+    int rc;
+
+    rc = libxl__device_xensock_setdefault(gc, xensock);
+    if (rc) goto out;
+
+    front = flexarray_make(gc, 16, 1);
+    back = flexarray_make(gc, 16, 1);
+
+    if (xensock->devid == -1) {
+        if ((xensock->devid = libxl__device_nextid(gc, domid, "xensock")) < 0) {
+            rc = ERROR_FAIL;
+            goto out;
+        }
+    }
+
+    rc = libxl__device_from_xensock(gc, domid, xensock, &device);
+    if (rc != 0) goto out;
+
+    flexarray_append_pair(back, "frontend-id", libxl__sprintf(gc, "%d", domid));
+    flexarray_append_pair(back, "online", "1");
+    flexarray_append_pair(back, "state", GCSPRINTF("%d", XenbusStateInitialising));
+    flexarray_append_pair(front, "backend-id",
+                          libxl__sprintf(gc, "%d", xensock->backend_domid));
+    flexarray_append_pair(front, "state", GCSPRINTF("%d", XenbusStateInitialising));
+
+    libxl__device_generic_add(gc, XBT_NULL, &device,
+                              libxl__xs_kvs_of_flexarray(gc, back, back->count),
+                              libxl__xs_kvs_of_flexarray(gc, front, front->count),
+                              NULL);
+    rc = 0;
+out:
+    return rc;
+}
 
 /******************************************************************************/
 
@@ -4250,6 +4314,8 @@ out:
  * libxl_device_vfb_destroy
  * libxl_device_usbctrl_remove
  * libxl_device_usbctrl_destroy
+ * libxl_device_xensock_remove
+ * libxl_device_xensock_destroy
  */
 #define DEFINE_DEVICE_REMOVE_EXT(type, remtype, removedestroy, f)        \
     int libxl_device_##type##_##removedestroy(libxl_ctx *ctx,           \
@@ -4311,6 +4377,11 @@ DEFINE_DEVICE_REMOVE(vtpm, destroy, 1)
 DEFINE_DEVICE_REMOVE_CUSTOM(usbctrl, remove, 0)
 DEFINE_DEVICE_REMOVE_CUSTOM(usbctrl, destroy, 1)
 
+/* xensock */
+
+DEFINE_DEVICE_REMOVE(xensock, remove, 0)
+DEFINE_DEVICE_REMOVE(xensock, destroy, 1)
+
 /* channel/console hotunplug is not implemented. There are 2 possibilities:
  * 1. add support for secondary consoles to xenconsoled
  * 2. dynamically add/remove qemu chardevs via qmp messages. */
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 2c0f868..e36958b 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1753,6 +1753,16 @@ int libxl_device_vfb_destroy(libxl_ctx *ctx, uint32_t domid,
                              const libxl_asyncop_how *ao_how)
                              LIBXL_EXTERNAL_CALLERS_ONLY;
 
+/* xensock */
+int libxl_device_xensock_remove(libxl_ctx *ctx, uint32_t domid,
+                            libxl_device_xensock *xensock,
+                            const libxl_asyncop_how *ao_how)
+                             LIBXL_EXTERNAL_CALLERS_ONLY;
+int libxl_device_xensock_destroy(libxl_ctx *ctx, uint32_t domid,
+                             libxl_device_xensock *xensock,
+                             const libxl_asyncop_how *ao_how)
+                             LIBXL_EXTERNAL_CALLERS_ONLY;
+
 /* PCI Passthrough */
 int libxl_device_pci_add(libxl_ctx *ctx, uint32_t domid,
                          libxl_device_pci *pcidev,
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 5000bd0..9549546 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1374,6 +1374,8 @@ static void domcreate_launch_dm(libxl__egc *egc, libxl__multidev *multidev,
             libxl__device_vfb_add(gc, domid, &d_config->vfbs[i]);
             libxl__device_vkb_add(gc, domid, &d_config->vkbs[i]);
         }
+        for (i = 0; i < d_config->num_xensocks; i++)
+            libxl__device_xensock_add(gc, domid, &d_config->xensocks[i]);
 
         init_console_info(gc, &console, 0);
         console.backend_domid = state->console_domid;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index c791418..2cad021 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -1224,6 +1224,7 @@ _hidden int libxl__device_vkb_setdefault(libxl__gc *gc, libxl_device_vkb *vkb);
 _hidden int libxl__device_pci_setdefault(libxl__gc *gc, libxl_device_pci *pci);
 _hidden void libxl__rdm_setdefault(libxl__gc *gc,
                                    libxl_domain_build_info *b_info);
+_hidden int libxl__device_xensock_setdefault(libxl__gc *gc, libxl_device_xensock *xensock);
 
 _hidden const char *libxl__device_nic_devname(libxl__gc *gc,
                                               uint32_t domid,
@@ -2647,6 +2648,10 @@ _hidden int libxl__device_vkb_add(libxl__gc *gc, uint32_t domid,
 _hidden int libxl__device_vfb_add(libxl__gc *gc, uint32_t domid,
                                   libxl_device_vfb *vfb);
 
+/* Internal function to connect a xensock device */
+_hidden int libxl__device_xensock_add(libxl__gc *gc, uint32_t domid,
+                                  libxl_device_xensock *xensock);
+
 /* Waits for the passed device to reach state XenbusStateInitWait.
  * This is not really useful by itself, but is important when executing
  * hotplug scripts, since we need to be sure the device is in the correct
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 9840f3b..6e15766 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -685,6 +685,12 @@ libxl_device_vtpm = Struct("device_vtpm", [
     ("uuid",             libxl_uuid),
 ])
 
+libxl_device_xensock = Struct("device_xensock", [
+    ("backend_domid",    libxl_domid),
+    ("backend_domname",  string),
+    ("devid",            libxl_devid),
+])
+
 libxl_device_channel = Struct("device_channel", [
     ("backend_domid", libxl_domid),
     ("backend_domname", string),
@@ -709,6 +715,7 @@ libxl_domain_config = Struct("domain_config", [
     ("vfbs", Array(libxl_device_vfb, "num_vfbs")),
     ("vkbs", Array(libxl_device_vkb, "num_vkbs")),
     ("vtpms", Array(libxl_device_vtpm, "num_vtpms")),
+    ("xensocks", Array(libxl_device_xensock, "num_xensocks")),
     # a channel manifests as a console with a name,
     # see docs/misc/channels.txt
     ("channels", Array(libxl_device_channel, "num_channels")),
diff --git a/tools/libxl/libxl_types_internal.idl b/tools/libxl/libxl_types_internal.idl
index 177f9b7..4293b22 100644
--- a/tools/libxl/libxl_types_internal.idl
+++ b/tools/libxl/libxl_types_internal.idl
@@ -24,6 +24,7 @@ libxl__device_kind = Enumeration("device_kind", [
     (8, "VTPM"),
     (9, "VUSB"),
     (10, "QUSB"),
+    (11, "XENSOCK"),
     ])
 
 libxl__console_backend = Enumeration("console_backend", [
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 03ab644..7d8257e 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -1898,6 +1898,18 @@ static void parse_config_data(const char *config_source,
             free(buf2);
         }
     }
+    
+    if (!xlu_cfg_get_long(config, "xensock", &l, 0)) {
+        libxl_device_xensock *xensock;
+        fprintf(stderr, "Creating xensock l=%lu\n", l);
+        d_config->num_xensocks = 0;
+        d_config->xensocks = NULL;
+        xensock = ARRAY_EXTEND_INIT(d_config->xensocks,
+                d_config->num_xensocks,
+                libxl_device_xensock_init);
+        libxl_device_xensock_init(xensock);
+        replace_string(&xensock->backend_domname, "0");
+    }
 
     if (!xlu_cfg_get_list (config, "channel", &channels, 0, 0)) {
         d_config->num_channels = 0;

[-- Attachment #3: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-08 11:23 [DRAFT 1] XenSock protocol design document Stefano Stabellini
@ 2016-07-08 12:14 ` Juergen Gross
  2016-07-08 14:16   ` Stefano Stabellini
  2016-07-08 17:11 ` David Vrabel
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Juergen Gross @ 2016-07-08 12:14 UTC (permalink / raw)
  To: Stefano Stabellini, xen-devel
  Cc: lars.kurth, wei.liu2, david.vrabel, boris.ostrovsky,
	joao.m.martins, roger.pau

On 08/07/16 13:23, Stefano Stabellini wrote:
> Hi all,
> 
> as promised, this is the design document for the XenSock protocol I
> mentioned here:
> 
> http://marc.info/?l=xen-devel&m=146520572428581
> 
> It is still in its early days but should give you a good idea of how it
> looks like and how it is supposed to work. Let me know if you find gaps
> in the document and I'll fill them in the next version.
> 
> You can find prototypes of the Linux frontend and backend drivers here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git xensock-1
> 
> To use them, make sure to enable CONFIG_XENSOCK in your kernel config
> and add "xensock=1" to the command line of your DomU Linux kernel. You
> also need the toolstack to create the initial xenstore nodes for the
> protocol. To do that, please apply the attached patch to libxl (the
> patch is based on Xen 4.7.0-rc3) and add "xensock=1" to your DomU config
> file.
> 
> Feel free to try them out! Please be kind, they are only prototypes with
> a few known issues :-) But they should work well enough to run simple
> tests. If you find something missing, let me know or, even better, write
> a patch!
> 
> I'll follow up with a separate document to cover the design of my
> particular implementation of the protocol.
> 
> Cheers,
> 
> Stefano
> 
> ---
> 
> # XenSocks Protocol v1
> 
> ## Rationale
> 
> XenSocks is a paravirtualized protocol for the POSIX socket API.
> 
> The purpose of XenSocks is to allow the implementation of a specific set
> of POSIX calls to be done in a domain other than your own. It allows
> connect, accept, bind, release, listen, poll, recvmsg and sendmsg to be
> implemented in another domain.
> 
> XenSocks provides the following benefits:
> * guest networking works out of the box with VPNs, wireless networks and
>   any other complex configurations on the host
> * guest services listen on ports bound directly to the backend domain IP
>   addresses
> * localhost becomes a secure namespace for intra-VMs communications
> * full visibility of the guest behavior on the backend domain, allowing
>   for inexpensive filtering and manipulation of any guest calls
> * excellent performance
> 
> 
> ## Design
> 
> ### Xenstore
> 
> The frontend and the backend connect to each other exchanging information via
> xenstore. The toolstack creates front and back nodes with state
> XenbusStateInitialising. There can only be one XenSock frontend per domain.
> 
> #### Frontend XenBus Nodes
> 
> port
>      Values:         <uint32_t>
> 
>      The identifier of the Xen event channel used to signal activity
>      in the ring buffer.
> 
> ring-ref
>      Values:         <uint32_t>
> 
>      The Xen grant reference granting permission for the backend to map
>      the sole page in a single page sized ring buffer.
> 
> 
> #### State Machine
> 
>     **Front**                             **Back**
>     XenbusStateInitialising               XenbusStateInitialising
>     - Query virtual device                - Query backend device
>       properties.                           identification data.
>     - Setup OS device instance.                          |
>     - Allocate and initialize the                        |
>       request ring.                                      V
>     - Publish transport parameters                XenbusStateInitWait
>       that will be in effect during
>       this connection.
>                  |
>                  |
>                  V
>        XenbusStateInitialised
> 
>                                           - Query frontend transport parameters.
>                                           - Connect to the request ring and
>                                             event channel.
>                                                          |
>                                                          |
>                                                          V
>                                                  XenbusStateConnected
> 
>      - Query backend device properties.
>      - Finalize OS virtual device
>        instance.
>                  |
>                  |
>                  V
>         XenbusStateConnected
> 
> Once frontend and backend are connected, they have a shared page, which
> will is used to exchange messages over a ring, and an event channel,
> which is used to send notifications.
> 
> 
> ### Commands Ring
> 
> The shared ring is used by the frontend to forward socket API calls to the
> backend. I'll refer to this ring as **commands ring** to distinguish it from
> other rings which will be created later in the lifecycle of the protocol (data
> rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` macro
> (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
> using the `RING_GET_REQUEST` macro.
> 
> The format is defined as follows:
> 
>     #define XENSOCK_DATARING_ORDER 6
>     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
>     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
>     
>     #define XENSOCK_CONNECT        0
>     #define XENSOCK_RELEASE        3
>     #define XENSOCK_BIND           4
>     #define XENSOCK_LISTEN         5
>     #define XENSOCK_ACCEPT         6
>     #define XENSOCK_POLL           7
>     
>     struct xen_xensock_request {
>         uint32_t id;     /* private to guest, echoed in response */
>         uint32_t cmd;    /* command to execute */
>         uint64_t sockid; /* id of the socket */
>         union {
>             struct xen_xensock_connect {
>                 uint8_t addr[28];
>                 uint32_t len;
>                 uint32_t flags;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } connect;
>             struct xen_xensock_bind {
>                 uint8_t addr[28]; /* ipv6 ready */
>                 uint32_t len;
>             } bind;
>             struct xen_xensock_accept {
>                 uint64_t sockid;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } accept;
>         } u;
>     };

Below you write the data ring is flexible and can support different
ring sizes. This is in contradiction to the definition above: as soon
as you modify the ring size you change the interface. You'd have to
modify all guests and the host at the same time.

The flexibility should be kept, so I suggest ring size negotiation via
xenstore: the backend should indicate the maximum supported size and
the frontend should tell which size it is using. In the beginning I'd
see no problem with accepting connection only if both values are
XENSOCK_DATARING_PAGES.

The connect and accept calls should reference only one page (possibly
with an offset into that page) holding the grant_ref_t array of the
needed size.

> 
> The first three fields are common for every command. Their binary layout
> is:
> 
>     0       4       8       12      16
>     +-------+-------+-------+-------+
>     |  id   |  cmd  |     sockid    |
>     +-------+-------+-------+-------+
> 
> - **id** is generated by the frontend and identifies one specific request
> - **cmd** is the command requested by the frontend:
>     - `XENSOCK_CONNECT`: 0
>     - `XENSOCK_RELEASE`: 3
>     - `XENSOCK_BIND`:    4
>     - `XENSOCK_LISTEN`:  5
>     - `XENSOCK_ACCEPT`:  6
>     - `XENSOCK_POLL`:    7

Any reason for omitting the values 1 and 2?

> - **sockid** is generated by the frontend and identifies the socket to connect,
>   bind, etc. A new sockid is required on `XENSOCK_CONNECT` and `XENSOCK_BIND`
>   commands. A new sockid is also required on `XENSOCK_ACCEPT`, for the new
>   socket.
>   
> All three fields are echoed back by the backend.
> 
> As for the other Xen ring based protocols, after writing a request to the ring,
> the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event
> channel notification when a notification is required.
> 
> Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro.
> The format is the following:
> 
>     struct xen_xensock_response {
>         uint32_t id;
>         uint32_t cmd;
>         uint64_t sockid;
>         int32_t ret;
>     };
>    
>     0       4       8       12      16      20
>     +-------+-------+-------+-------+-------+
>     |  id   |  cmd  |     sockid    |  ret  |
>     +-------+-------+-------+-------+-------+
> 
> - **id**: echoed back from request
> - **cmd**: echoed back from request
> - **sockid**: echoed back from request
> - **ret**: return value, identifies success or failure
> 
> After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks whether
> it needs to notify the frontend and does so via event channel.
> 
> A description of each command, their additional request fields and the
> expected responses follow.
> 
> 
> #### Connect
> 
> The **connect** operation corresponds to the connect system call. It connects a
> socket to the specified address. **sockid** is freely chosen by the frontend and
> references this specific socket from this point forward.
> 
> The connect operation creates a new shared ring, which we'll call **data ring**.
> The new ring is used to send and receive data over the connected socket.
> Information necessary to setup the new ring, such as grant table references and
> event channel ports, are passed from the frontend to the backend as part of
> this request. A **data ring** is unmapped and freed upon issuing a **release**
> command on the active socket identified by **sockid**.
> 
> When the frontend issues a **connect** command, the backend:
> - creates a new socket and connects it to **addr**
> - creates an internal mapping from **sockid** to its own socket
> - maps all the grant references and uses them as shared memory for the new data
>   ring
> - bind the **evtchn**
> - replies to the frontend
> 
> The data ring format will be described in the following section.
> 
> Fields:
> 
> - **cmd** value: 0
> - additional fields:
>   - **addr**: address to connect to, in struct sockaddr format

So you expect only Linux guests with the current sockaddr layout?
Please specify the structure in the interface.

>   - **len**: address length
>   - **flags**: flags for the connection, reserved for future usage
>   - **ref**: grant references of the data ring
>   - **evtchn**: port number of the evtchn to signal activity on the data ring
> 
> Binary layout:
> 
>         16      20      24      28      32      36      40      44     48
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |                            addr                       |  len  |
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         | flags |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] |ref[6] |
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|ref[14]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|ref[22]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|ref[30]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|ref[38]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|ref[46]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|ref[54]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|ref[62]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[63]|evtchn |  
>         +-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the socket system call

Again: don't think only Linux.

> 
> #### Release
> 
> The **release** operation closes an existing active or a passive socket.
> 
> When a release command is issued on a passive socket, the backend releases it
> and frees its internal mappings. When a release command is issued for an active
> socket, the data ring is also unmapped and freed:
> 
> - frontend sends release command for an active socket
> - backend releases the socket
> - backend unmaps the ring
> - backend unbinds the evtchn
> - backend replies to frontend
> - frontend frees ring and unbinds evtchn
> 
> Fields:
> 
> - **cmd** value: 3
> - additional fields: none
> 
> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the shutdown system call

Again Linux only.

> 
> #### Bind
> 
> The **bind** operation assigns the address passed as parameter to the socket.
> It corresponds to the bind system call. **sockid** is freely chosen by the
> frontend and references this specific socket from this point forward. **Bind**,
> **listen** and **accept** are the three operations required to have fully
> working passive sockets and should be issued in this order.
> 
> Fields:
> 
> - **cmd** value: 4
> - additional fields:
>   - **addr**: address to bind to, in struct sockaddr format

Dito.

>   - **len**: address length
> 
> Binary layout:
> 
>         16      20      24      28      32      36      40      44     48
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |                            addr                       |  len  |
>         +-------+-------+-------+-------+-------+-------+-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the bind system call

Again.

> 
> 
> #### Listen
> 
> The **listen** operation marks the socket as a passive socket. It corresponds to
> the listen system call.
> 
> Fields:
> 
> - **cmd** value: 5
> - additional fields: none
> 
> Return value:
>   - 0 on success
>   - less than 0 on failure, see the error codes of the listen system call

Again.

> 
> 
> #### Accept
> 
> The **accept** operation extracts the first connection request on the queue of
> pending connections for the listening socket identified by **sockid** and
> creates a new connected socket. The **sockid** of the new socket is also chosen
> by the frontend and passed as an additional field of the accept request struct.
> 
> Similarly to the **connect** operation, **accept** creates a new data ring.
> Information necessary to setup the new ring, such as grant table references and
> event channel ports, are passed from the frontend to the backend as part of
> the request.
> 
> The backend will reply to the request only when a new connection is successfully
> accepted, i.e. the backend does not return EAGAIN or EWOULDBLOCK.
> 
> Example workflow:
> 
> - frontend issues an **accept** request
> - backend waits for a connection to be available on the socket
> - a new connection becomes available
> - backend accepts the new connection
> - backend creates an internal mapping from **sockid** to the new socket
> - backend maps all the grant references and uses them as shared memory for the
>   new data ring
> - backend binds the **evtchn**
> - backend replies to the frontend
> 
> Fields:
> 
> - **cmd** value: 6
> - additional fields:
>   - **sockid**: id of the new socket
>   - **ref**: grant references of the data ring
>   - **evtchn**: port number of the evtchn to signal activity on the data ring
> 
> Binary layout:
> 
>         16      20      24      28      32      36      40      44     48
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |    sockid     |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] | 
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[6] |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[14]|ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[22]|ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[30]|ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[38]|ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[46]|ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[54]|ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[62]|ref[63]|evtchn | 
>         +-------+-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the accept system call

Again.

> 
> 
> #### Poll
> 
> The **poll** operation is only valid for passive sockets. For active sockets,
> the frontend should look at the state of the data ring. When a new connection is
> available in the queue of the passive socket, the backend generates a response
> and notifies the frontend.
> 
> Fields:
> 
> - **cmd** value: 7
> - additional fields: none
> 
> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the poll system call

Again.

> 
> 
> ### Data ring
> 
> Data rings are used for sending and receiving data over a connected socket. They
> are created upon a successful **accept** or **connect** command. The ring works
> in a similar way to the existing Xen console ring.
> 
> #### Format
> 
>     #define XENSOCK_DATARING_ORDER 6
>     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
>     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
>     typedef uint32_t XENSOCK_RING_IDX;
>     
>     struct xensock_ring_intf {
>     	char in[XENSOCK_DATARING_SIZE/4];
>     	char out[XENSOCK_DATARING_SIZE/2];
>     	XENSOCK_RING_IDX in_cons, in_prod;
>     	XENSOCK_RING_IDX out_cons, out_prod;
>     	int32_t in_error, out_error;
>     };

So you are wasting nearly 64kB of memory?

Wouldn't it make more sense to have 1 page with the admin data (in_*,
out_*) and the appropriate number of pages with the ring buffers? The
admin page could be even the one holding the grant_ref_t array of the
ring buffer pages needed for accept/connect.

> The design is flexible and can support different ring sizes (at compile time).
> The following description is based on order 6 rings, chosen because they provide
> excellent performance.
> 
> - **in** is an array of 65536 bytes, used as circular buffer
>   It contains data read from the socket. The producer is the backend, the
>   consumer is the frontend.
> - **out** is an array of 131072 bytes, used as circular buffer
>   It contains data to be written to the socket. The producer is the frontend,
>   the consumer is the backend.
> - **in_cons** and **in_prod**
>   Consumer and producer pointers for data read from the socket. They keep track
>   of how much data has already been consumed by the frontend from the **in**
>   array. **in_prod** is increased by the backend, after writing data to **in**.
>   **in_cons** is increased by the frontend, after reading data from **in**.
> - **out_cons**, **out_prod**
>   Consumer and producer pointers for the data to be written to the socket. They
>   keep track of how much data has been written by the frontend to **out** and
>   how much data has already been consumed by the backend. **out_prod** is
>   increased by the frontend, after writing data to **out**. **out_cons** is
>   increased by the backend, after reading data from **out**.
> - **in_error** and **out_error** They signal errors when reading from the socket
>   (**in_error**) or when writing to the socket (**out_error**). 0 means no
>   errors. When an error occurs, no further reads or writes operations are
>   performed on the socket. In the case of an orderly socket shutdown (i.e. read
>   returns 0) **in_error** is set to -ENOTCONN. **in_error** and **out_error**

Which value? I've found systems with: 57, 76, 107, 134 or 235 (just to
make clear that even an errno name isn't optimal).

>   are never set to -EAGAIN or -EWOULDBLOCK.
> 
> The binary layout follows:
> 
>     0        65536           196608     196612    196616    196620   196624    196628   196632
>     +----//----+-------//-------+---------+---------+---------+---------+---------+---------+
>     |    in    |      out       | in_cons | in_prod |out_cons |out_prod |in_error |out_error|
>     +----//----+-------//-------+---------+---------+---------+---------+---------+---------+
>     
> 
> #### Workflow
> 
> The **in** and **out** arrays are used as circular buffers:
>     
>     0                               sizeof(array)
>     +-----------------------------------+
>     |to consume|    free    |to consume |
>     +-----------------------------------+
>                ^            ^
>                prod         cons
> 
>     0                               sizeof(array)
>     +-----------------------------------+
>     |  free    | to consume |   free    |
>     +-----------------------------------+
>                ^            ^
>                cons         prod
> 
> The following function is provided to calculate how many bytes are currently
> left unconsumed in an array:
> 
>     #define _MASK_XENSOCK_IDX(idx, ring_size) ((idx) & (ring_size-1))
> 
>     static inline XENSOCK_RING_IDX xensock_ring_queued(XENSOCK_RING_IDX prod,
>     		XENSOCK_RING_IDX cons,
>     		XENSOCK_RING_IDX ring_size)
>     {
>     	XENSOCK_RING_IDX size;
>     
>     	if (prod == cons)
>     		return 0;
>     
>     	prod = _MASK_XENSOCK_IDX(prod, ring_size);
>     	cons = _MASK_XENSOCK_IDX(cons, ring_size);
>     
>     	if (prod == cons)
>     		return ring_size;
>     
>     	if (prod > cons)
>     		size = prod - cons;
>     	else {
>     		size = ring_size - cons;
>     		size += prod;
>     	}
>     	return size;
>     }
> 
> The producer (the backend for **in**, the frontend for **out**) writes to the
> array in the following way:
> 
> - read *cons*, *prod*, *error* from shared memory
> - memory barrier
> - return on *error*
> - write to array at position *prod* up to *cons*, wrapping around the circular
>   buffer when necessary
> - memory barrier
> - increase *prod*
> - notify the other end via evtchn
> 
> The consumer (the backend for **out**, the frontend for **in**) reads from the
> array in the following way:
> 
> - read *prod*, *cons*, *error* from shared memory
> - memory barrier
> - return on *error*
> - read from array at position *cons* up to *prod*, wrapping around the circular
>   buffer when necessary
> - memory barrier
> - increase *cons*
> - notify the other end via evtchn
> 
> The producer takes care of writing only as many bytes as available in the buffer
> up to *cons*. The consumer takes care of reading only as many bytes as available
> in the buffer up to *prod*. *error* is set by the backend when an error occurs
> writing or reading from the socket.
> 


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-08 12:14 ` Juergen Gross
@ 2016-07-08 14:16   ` Stefano Stabellini
  2016-07-08 14:27     ` Juergen Gross
  2016-07-08 15:57     ` David Vrabel
  0 siblings, 2 replies; 14+ messages in thread
From: Stefano Stabellini @ 2016-07-08 14:16 UTC (permalink / raw)
  To: Juergen Gross
  Cc: lars.kurth, wei.liu2, Stefano Stabellini, david.vrabel,
	xen-devel, joao.m.martins, boris.ostrovsky, roger.pau

Hi Juergen,

many thanks for the fast and very useful review!


On Fri, 8 Jul 2016, Juergen Gross wrote:
> On 08/07/16 13:23, Stefano Stabellini wrote:
> > Hi all,
> > 
> > as promised, this is the design document for the XenSock protocol I
> > mentioned here:
> > 
> > http://marc.info/?l=xen-devel&m=146520572428581
> > 
> > It is still in its early days but should give you a good idea of how it
> > looks like and how it is supposed to work. Let me know if you find gaps
> > in the document and I'll fill them in the next version.
> > 
> > You can find prototypes of the Linux frontend and backend drivers here:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git xensock-1
> > 
> > To use them, make sure to enable CONFIG_XENSOCK in your kernel config
> > and add "xensock=1" to the command line of your DomU Linux kernel. You
> > also need the toolstack to create the initial xenstore nodes for the
> > protocol. To do that, please apply the attached patch to libxl (the
> > patch is based on Xen 4.7.0-rc3) and add "xensock=1" to your DomU config
> > file.
> > 
> > Feel free to try them out! Please be kind, they are only prototypes with
> > a few known issues :-) But they should work well enough to run simple
> > tests. If you find something missing, let me know or, even better, write
> > a patch!
> > 
> > I'll follow up with a separate document to cover the design of my
> > particular implementation of the protocol.
> > 
> > Cheers,
> > 
> > Stefano
> > 
> > ---
> > 
> > # XenSocks Protocol v1
> > 
> > ## Rationale
> > 
> > XenSocks is a paravirtualized protocol for the POSIX socket API.
> > 
> > The purpose of XenSocks is to allow the implementation of a specific set
> > of POSIX calls to be done in a domain other than your own. It allows
> > connect, accept, bind, release, listen, poll, recvmsg and sendmsg to be
> > implemented in another domain.
> > 
> > XenSocks provides the following benefits:
> > * guest networking works out of the box with VPNs, wireless networks and
> >   any other complex configurations on the host
> > * guest services listen on ports bound directly to the backend domain IP
> >   addresses
> > * localhost becomes a secure namespace for intra-VMs communications
> > * full visibility of the guest behavior on the backend domain, allowing
> >   for inexpensive filtering and manipulation of any guest calls
> > * excellent performance
> > 
> > 
> > ## Design
> > 
> > ### Xenstore
> > 
> > The frontend and the backend connect to each other exchanging information via
> > xenstore. The toolstack creates front and back nodes with state
> > XenbusStateInitialising. There can only be one XenSock frontend per domain.
> > 
> > #### Frontend XenBus Nodes
> > 
> > port
> >      Values:         <uint32_t>
> > 
> >      The identifier of the Xen event channel used to signal activity
> >      in the ring buffer.
> > 
> > ring-ref
> >      Values:         <uint32_t>
> > 
> >      The Xen grant reference granting permission for the backend to map
> >      the sole page in a single page sized ring buffer.
> > 
> > 
> > #### State Machine
> > 
> >     **Front**                             **Back**
> >     XenbusStateInitialising               XenbusStateInitialising
> >     - Query virtual device                - Query backend device
> >       properties.                           identification data.
> >     - Setup OS device instance.                          |
> >     - Allocate and initialize the                        |
> >       request ring.                                      V
> >     - Publish transport parameters                XenbusStateInitWait
> >       that will be in effect during
> >       this connection.
> >                  |
> >                  |
> >                  V
> >        XenbusStateInitialised
> > 
> >                                           - Query frontend transport parameters.
> >                                           - Connect to the request ring and
> >                                             event channel.
> >                                                          |
> >                                                          |
> >                                                          V
> >                                                  XenbusStateConnected
> > 
> >      - Query backend device properties.
> >      - Finalize OS virtual device
> >        instance.
> >                  |
> >                  |
> >                  V
> >         XenbusStateConnected
> > 
> > Once frontend and backend are connected, they have a shared page, which
> > will is used to exchange messages over a ring, and an event channel,
> > which is used to send notifications.
> > 
> > 
> > ### Commands Ring
> > 
> > The shared ring is used by the frontend to forward socket API calls to the
> > backend. I'll refer to this ring as **commands ring** to distinguish it from
> > other rings which will be created later in the lifecycle of the protocol (data
> > rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` macro
> > (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
> > using the `RING_GET_REQUEST` macro.
> > 
> > The format is defined as follows:
> > 
> >     #define XENSOCK_DATARING_ORDER 6
> >     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
> >     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
> >     
> >     #define XENSOCK_CONNECT        0
> >     #define XENSOCK_RELEASE        3
> >     #define XENSOCK_BIND           4
> >     #define XENSOCK_LISTEN         5
> >     #define XENSOCK_ACCEPT         6
> >     #define XENSOCK_POLL           7
> >     
> >     struct xen_xensock_request {
> >         uint32_t id;     /* private to guest, echoed in response */
> >         uint32_t cmd;    /* command to execute */
> >         uint64_t sockid; /* id of the socket */
> >         union {
> >             struct xen_xensock_connect {
> >                 uint8_t addr[28];
> >                 uint32_t len;
> >                 uint32_t flags;
> >                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
> >                 uint32_t evtchn;
> >             } connect;
> >             struct xen_xensock_bind {
> >                 uint8_t addr[28]; /* ipv6 ready */
> >                 uint32_t len;
> >             } bind;
> >             struct xen_xensock_accept {
> >                 uint64_t sockid;
> >                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
> >                 uint32_t evtchn;
> >             } accept;
> >         } u;
> >     };
> 
> Below you write the data ring is flexible and can support different
> ring sizes. This is in contradiction to the definition above: as soon
> as you modify the ring size you change the interface. You'd have to
> modify all guests and the host at the same time.

Yeah, I meant at compile time (which I understand it is not useful for
anything other than embedded use cases). But you are right that it would
be nice to be able to choose the ring size at runtime.


> The flexibility should be kept, so I suggest ring size negotiation via
> xenstore: the backend should indicate the maximum supported size and
> the frontend should tell which size it is using. In the beginning I'd
> see no problem with accepting connection only if both values are
> XENSOCK_DATARING_PAGES.

I'll look into it.


> The connect and accept calls should reference only one page (possibly
> with an offset into that page) holding the grant_ref_t array of the
> needed size.

It would be nice to send the refs as part of the request as done here,
but I imagine that it would be an issue with a variable number of refs
because everything in the request struct needs to be sized up at compile
time. That's the reason why you are suggesting to send them separatly,
right?

However they might fit in the page used as admin page for the ring (as
suggested by you below), so that would be OK. 


> > The first three fields are common for every command. Their binary layout
> > is:
> > 
> >     0       4       8       12      16
> >     +-------+-------+-------+-------+
> >     |  id   |  cmd  |     sockid    |
> >     +-------+-------+-------+-------+
> > 
> > - **id** is generated by the frontend and identifies one specific request
> > - **cmd** is the command requested by the frontend:
> >     - `XENSOCK_CONNECT`: 0
> >     - `XENSOCK_RELEASE`: 3
> >     - `XENSOCK_BIND`:    4
> >     - `XENSOCK_LISTEN`:  5
> >     - `XENSOCK_ACCEPT`:  6
> >     - `XENSOCK_POLL`:    7
> 
> Any reason for omitting the values 1 and 2?

Nope. I'll fix this.


> > - **sockid** is generated by the frontend and identifies the socket to connect,
> >   bind, etc. A new sockid is required on `XENSOCK_CONNECT` and `XENSOCK_BIND`
> >   commands. A new sockid is also required on `XENSOCK_ACCEPT`, for the new
> >   socket.
> >   
> > All three fields are echoed back by the backend.
> > 
> > As for the other Xen ring based protocols, after writing a request to the ring,
> > the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event
> > channel notification when a notification is required.
> > 
> > Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro.
> > The format is the following:
> > 
> >     struct xen_xensock_response {
> >         uint32_t id;
> >         uint32_t cmd;
> >         uint64_t sockid;
> >         int32_t ret;
> >     };
> >    
> >     0       4       8       12      16      20
> >     +-------+-------+-------+-------+-------+
> >     |  id   |  cmd  |     sockid    |  ret  |
> >     +-------+-------+-------+-------+-------+
> > 
> > - **id**: echoed back from request
> > - **cmd**: echoed back from request
> > - **sockid**: echoed back from request
> > - **ret**: return value, identifies success or failure
> > 
> > After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks whether
> > it needs to notify the frontend and does so via event channel.
> > 
> > A description of each command, their additional request fields and the
> > expected responses follow.
> > 
> > 
> > #### Connect
> > 
> > The **connect** operation corresponds to the connect system call. It connects a
> > socket to the specified address. **sockid** is freely chosen by the frontend and
> > references this specific socket from this point forward.
> > 
> > The connect operation creates a new shared ring, which we'll call **data ring**.
> > The new ring is used to send and receive data over the connected socket.
> > Information necessary to setup the new ring, such as grant table references and
> > event channel ports, are passed from the frontend to the backend as part of
> > this request. A **data ring** is unmapped and freed upon issuing a **release**
> > command on the active socket identified by **sockid**.
> > 
> > When the frontend issues a **connect** command, the backend:
> > - creates a new socket and connects it to **addr**
> > - creates an internal mapping from **sockid** to its own socket
> > - maps all the grant references and uses them as shared memory for the new data
> >   ring
> > - bind the **evtchn**
> > - replies to the frontend
> > 
> > The data ring format will be described in the following section.
> > 
> > Fields:
> > 
> > - **cmd** value: 0
> > - additional fields:
> >   - **addr**: address to connect to, in struct sockaddr format
> 
> So you expect only Linux guests with the current sockaddr layout?
> Please specify the structure in the interface.

I meant sockaddr as defined by POSIX (the Open Group standard):

http://pubs.opengroup.org/onlinepubs/7908799/xns/syssocket.h.html


> >   - **len**: address length
> >   - **flags**: flags for the connection, reserved for future usage
> >   - **ref**: grant references of the data ring
> >   - **evtchn**: port number of the evtchn to signal activity on the data ring
> > 
> > Binary layout:
> > 
> >         16      20      24      28      32      36      40      44     48
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |                            addr                       |  len  |
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         | flags |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] |ref[6] |
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|ref[14]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|ref[22]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|ref[30]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|ref[38]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|ref[46]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|ref[54]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|ref[62]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[63]|evtchn |  
> >         +-------+-------+
> > 
> > Return value:
> > 
> >   - 0 on success
> >   - less than 0 on failure, see the error codes of the socket system call
> 
> Again: don't think only Linux.

Same here:

http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html

Except that the errors are negatives, I should have clarified that.


> > #### Release
> > 
> > The **release** operation closes an existing active or a passive socket.
> > 
> > When a release command is issued on a passive socket, the backend releases it
> > and frees its internal mappings. When a release command is issued for an active
> > socket, the data ring is also unmapped and freed:
> > 
> > - frontend sends release command for an active socket
> > - backend releases the socket
> > - backend unmaps the ring
> > - backend unbinds the evtchn
> > - backend replies to frontend
> > - frontend frees ring and unbinds evtchn
> > 
> > Fields:
> > 
> > - **cmd** value: 3
> > - additional fields: none
> > 
> > Return value:
> > 
> >   - 0 on success
> >   - less than 0 on failure, see the error codes of the shutdown system call
> 
> Again Linux only.

http://pubs.opengroup.org/onlinepubs/7908799/xns/shutdown.html


> > 
> > #### Bind
> > 
> > The **bind** operation assigns the address passed as parameter to the socket.
> > It corresponds to the bind system call. **sockid** is freely chosen by the
> > frontend and references this specific socket from this point forward. **Bind**,
> > **listen** and **accept** are the three operations required to have fully
> > working passive sockets and should be issued in this order.
> > 
> > Fields:
> > 
> > - **cmd** value: 4
> > - additional fields:
> >   - **addr**: address to bind to, in struct sockaddr format
> 
> Dito.
>
> >   - **len**: address length
> > 
> > Binary layout:
> > 
> >         16      20      24      28      32      36      40      44     48
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |                            addr                       |  len  |
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> > 
> > Return value:
> > 
> >   - 0 on success
> >   - less than 0 on failure, see the error codes of the bind system call
> 
> Again.

http://pubs.opengroup.org/onlinepubs/7908799/xns/bind.html


> > 
> > 
> > #### Listen
> > 
> > The **listen** operation marks the socket as a passive socket. It corresponds to
> > the listen system call.
> > 
> > Fields:
> > 
> > - **cmd** value: 5
> > - additional fields: none
> > 
> > Return value:
> >   - 0 on success
> >   - less than 0 on failure, see the error codes of the listen system call
> 
> Again.

http://pubs.opengroup.org/onlinepubs/7908799/xns/listen.html


> > 
> > 
> > #### Accept
> > 
> > The **accept** operation extracts the first connection request on the queue of
> > pending connections for the listening socket identified by **sockid** and
> > creates a new connected socket. The **sockid** of the new socket is also chosen
> > by the frontend and passed as an additional field of the accept request struct.
> > 
> > Similarly to the **connect** operation, **accept** creates a new data ring.
> > Information necessary to setup the new ring, such as grant table references and
> > event channel ports, are passed from the frontend to the backend as part of
> > the request.
> > 
> > The backend will reply to the request only when a new connection is successfully
> > accepted, i.e. the backend does not return EAGAIN or EWOULDBLOCK.
> > 
> > Example workflow:
> > 
> > - frontend issues an **accept** request
> > - backend waits for a connection to be available on the socket
> > - a new connection becomes available
> > - backend accepts the new connection
> > - backend creates an internal mapping from **sockid** to the new socket
> > - backend maps all the grant references and uses them as shared memory for the
> >   new data ring
> > - backend binds the **evtchn**
> > - backend replies to the frontend
> > 
> > Fields:
> > 
> > - **cmd** value: 6
> > - additional fields:
> >   - **sockid**: id of the new socket
> >   - **ref**: grant references of the data ring
> >   - **evtchn**: port number of the evtchn to signal activity on the data ring
> > 
> > Binary layout:
> > 
> >         16      20      24      28      32      36      40      44     48
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |    sockid     |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] | 
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[6] |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[14]|ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[22]|ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[30]|ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[38]|ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[46]|ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[54]|ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |ref[62]|ref[63]|evtchn | 
> >         +-------+-------+-------+
> > 
> > Return value:
> > 
> >   - 0 on success
> >   - less than 0 on failure, see the error codes of the accept system call
> 
> Again.

http://pubs.opengroup.org/onlinepubs/7908799/xns/accept.html


> > 
> > 
> > #### Poll
> > 
> > The **poll** operation is only valid for passive sockets. For active sockets,
> > the frontend should look at the state of the data ring. When a new connection is
> > available in the queue of the passive socket, the backend generates a response
> > and notifies the frontend.
> > 
> > Fields:
> > 
> > - **cmd** value: 7
> > - additional fields: none
> > 
> > Return value:
> > 
> >   - 0 on success
> >   - less than 0 on failure, see the error codes of the poll system call
> 
> Again.

http://pubs.opengroup.org/onlinepubs/7908799/xsh/poll.html


> > 
> > 
> > ### Data ring
> > 
> > Data rings are used for sending and receiving data over a connected socket. They
> > are created upon a successful **accept** or **connect** command. The ring works
> > in a similar way to the existing Xen console ring.
> > 
> > #### Format
> > 
> >     #define XENSOCK_DATARING_ORDER 6
> >     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
> >     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
> >     typedef uint32_t XENSOCK_RING_IDX;
> >     
> >     struct xensock_ring_intf {
> >     	char in[XENSOCK_DATARING_SIZE/4];
> >     	char out[XENSOCK_DATARING_SIZE/2];
> >     	XENSOCK_RING_IDX in_cons, in_prod;
> >     	XENSOCK_RING_IDX out_cons, out_prod;
> >     	int32_t in_error, out_error;
> >     };
> 
> So you are wasting nearly 64kB of memory?
> 
> Wouldn't it make more sense to have 1 page with the admin data (in_*,
> out_*) and the appropriate number of pages with the ring buffers? The
> admin page could be even the one holding the grant_ref_t array of the
> ring buffer pages needed for accept/connect.

Right, the same thing I was thinking. I think is a good idea, I'll try
it.


> > The design is flexible and can support different ring sizes (at compile time).
> > The following description is based on order 6 rings, chosen because they provide
> > excellent performance.
> > 
> > - **in** is an array of 65536 bytes, used as circular buffer
> >   It contains data read from the socket. The producer is the backend, the
> >   consumer is the frontend.
> > - **out** is an array of 131072 bytes, used as circular buffer
> >   It contains data to be written to the socket. The producer is the frontend,
> >   the consumer is the backend.
> > - **in_cons** and **in_prod**
> >   Consumer and producer pointers for data read from the socket. They keep track
> >   of how much data has already been consumed by the frontend from the **in**
> >   array. **in_prod** is increased by the backend, after writing data to **in**.
> >   **in_cons** is increased by the frontend, after reading data from **in**.
> > - **out_cons**, **out_prod**
> >   Consumer and producer pointers for the data to be written to the socket. They
> >   keep track of how much data has been written by the frontend to **out** and
> >   how much data has already been consumed by the backend. **out_prod** is
> >   increased by the frontend, after writing data to **out**. **out_cons** is
> >   increased by the backend, after reading data from **out**.
> > - **in_error** and **out_error** They signal errors when reading from the socket
> >   (**in_error**) or when writing to the socket (**out_error**). 0 means no
> >   errors. When an error occurs, no further reads or writes operations are
> >   performed on the socket. In the case of an orderly socket shutdown (i.e. read
> >   returns 0) **in_error** is set to -ENOTCONN. **in_error** and **out_error**
> 
> Which value? I've found systems with: 57, 76, 107, 134 or 235 (just to
> make clear that even an errno name isn't optimal).

I naively assumed that the error codes were also defined by POSIX, but
it doesn't seem to be the case. If they are not standard, I'll have to
include a numeric representation of those error names and possibly do
conversions. I'll get to it in the next version. I think I makes sense
to use the existing xen/include/public/errno.h (credits to Roger for the
suggestion on IRC).



> >   are never set to -EAGAIN or -EWOULDBLOCK.
> > 
> > The binary layout follows:
> > 
> >     0        65536           196608     196612    196616    196620   196624    196628   196632
> >     +----//----+-------//-------+---------+---------+---------+---------+---------+---------+
> >     |    in    |      out       | in_cons | in_prod |out_cons |out_prod |in_error |out_error|
> >     +----//----+-------//-------+---------+---------+---------+---------+---------+---------+
> >     
> > 
> > #### Workflow
> > 
> > The **in** and **out** arrays are used as circular buffers:
> >     
> >     0                               sizeof(array)
> >     +-----------------------------------+
> >     |to consume|    free    |to consume |
> >     +-----------------------------------+
> >                ^            ^
> >                prod         cons
> > 
> >     0                               sizeof(array)
> >     +-----------------------------------+
> >     |  free    | to consume |   free    |
> >     +-----------------------------------+
> >                ^            ^
> >                cons         prod
> > 
> > The following function is provided to calculate how many bytes are currently
> > left unconsumed in an array:
> > 
> >     #define _MASK_XENSOCK_IDX(idx, ring_size) ((idx) & (ring_size-1))
> > 
> >     static inline XENSOCK_RING_IDX xensock_ring_queued(XENSOCK_RING_IDX prod,
> >     		XENSOCK_RING_IDX cons,
> >     		XENSOCK_RING_IDX ring_size)
> >     {
> >     	XENSOCK_RING_IDX size;
> >     
> >     	if (prod == cons)
> >     		return 0;
> >     
> >     	prod = _MASK_XENSOCK_IDX(prod, ring_size);
> >     	cons = _MASK_XENSOCK_IDX(cons, ring_size);
> >     
> >     	if (prod == cons)
> >     		return ring_size;
> >     
> >     	if (prod > cons)
> >     		size = prod - cons;
> >     	else {
> >     		size = ring_size - cons;
> >     		size += prod;
> >     	}
> >     	return size;
> >     }
> > 
> > The producer (the backend for **in**, the frontend for **out**) writes to the
> > array in the following way:
> > 
> > - read *cons*, *prod*, *error* from shared memory
> > - memory barrier
> > - return on *error*
> > - write to array at position *prod* up to *cons*, wrapping around the circular
> >   buffer when necessary
> > - memory barrier
> > - increase *prod*
> > - notify the other end via evtchn
> > 
> > The consumer (the backend for **out**, the frontend for **in**) reads from the
> > array in the following way:
> > 
> > - read *prod*, *cons*, *error* from shared memory
> > - memory barrier
> > - return on *error*
> > - read from array at position *cons* up to *prod*, wrapping around the circular
> >   buffer when necessary
> > - memory barrier
> > - increase *cons*
> > - notify the other end via evtchn
> > 
> > The producer takes care of writing only as many bytes as available in the buffer
> > up to *cons*. The consumer takes care of reading only as many bytes as available
> > in the buffer up to *prod*. *error* is set by the backend when an error occurs
> > writing or reading from the socket.
> > 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-08 14:16   ` Stefano Stabellini
@ 2016-07-08 14:27     ` Juergen Gross
  2016-07-08 15:57     ` David Vrabel
  1 sibling, 0 replies; 14+ messages in thread
From: Juergen Gross @ 2016-07-08 14:27 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: lars.kurth, wei.liu2, david.vrabel, xen-devel, joao.m.martins,
	boris.ostrovsky, roger.pau

On 08/07/16 16:16, Stefano Stabellini wrote:
> Hi Juergen,
> 
> many thanks for the fast and very useful review!
> 
> 
> On Fri, 8 Jul 2016, Juergen Gross wrote:
>> On 08/07/16 13:23, Stefano Stabellini wrote:
>>>     #define XENSOCK_DATARING_ORDER 6
>>>     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
>>>     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
>>>     
>>>     #define XENSOCK_CONNECT        0
>>>     #define XENSOCK_RELEASE        3
>>>     #define XENSOCK_BIND           4
>>>     #define XENSOCK_LISTEN         5
>>>     #define XENSOCK_ACCEPT         6
>>>     #define XENSOCK_POLL           7
>>>     
>>>     struct xen_xensock_request {
>>>         uint32_t id;     /* private to guest, echoed in response */
>>>         uint32_t cmd;    /* command to execute */
>>>         uint64_t sockid; /* id of the socket */
>>>         union {
>>>             struct xen_xensock_connect {
>>>                 uint8_t addr[28];
>>>                 uint32_t len;
>>>                 uint32_t flags;
>>>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>>>                 uint32_t evtchn;
>>>             } connect;
>>>             struct xen_xensock_bind {
>>>                 uint8_t addr[28]; /* ipv6 ready */
>>>                 uint32_t len;
>>>             } bind;
>>>             struct xen_xensock_accept {
>>>                 uint64_t sockid;
>>>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>>>                 uint32_t evtchn;
>>>             } accept;
>>>         } u;
>>>     };
>>
>> Below you write the data ring is flexible and can support different
>> ring sizes. This is in contradiction to the definition above: as soon
>> as you modify the ring size you change the interface. You'd have to
>> modify all guests and the host at the same time.
> 
> Yeah, I meant at compile time (which I understand it is not useful for
> anything other than embedded use cases). But you are right that it would
> be nice to be able to choose the ring size at runtime.
> 
> 
>> The flexibility should be kept, so I suggest ring size negotiation via
>> xenstore: the backend should indicate the maximum supported size and
>> the frontend should tell which size it is using. In the beginning I'd
>> see no problem with accepting connection only if both values are
>> XENSOCK_DATARING_PAGES.
> 
> I'll look into it.
> 
> 
>> The connect and accept calls should reference only one page (possibly
>> with an offset into that page) holding the grant_ref_t array of the
>> needed size.
> 
> It would be nice to send the refs as part of the request as done here,
> but I imagine that it would be an issue with a variable number of refs
> because everything in the request struct needs to be sized up at compile
> time. That's the reason why you are suggesting to send them separatly,
> right?

Correct.

>>> The data ring format will be described in the following section.
>>>
>>> Fields:
>>>
>>> - **cmd** value: 0
>>> - additional fields:
>>>   - **addr**: address to connect to, in struct sockaddr format
>>
>> So you expect only Linux guests with the current sockaddr layout?
>> Please specify the structure in the interface.
> 
> I meant sockaddr as defined by POSIX (the Open Group standard):
> 
> http://pubs.opengroup.org/onlinepubs/7908799/xns/syssocket.h.html

Neither the size of sa_family_t nor the numeric values are defined
there.

>> Which value? I've found systems with: 57, 76, 107, 134 or 235 (just to
>> make clear that even an errno name isn't optimal).
> 
> I naively assumed that the error codes were also defined by POSIX, but
> it doesn't seem to be the case. If they are not standard, I'll have to
> include a numeric representation of those error names and possibly do
> conversions. I'll get to it in the next version. I think I makes sense
> to use the existing xen/include/public/errno.h (credits to Roger for the
> suggestion on IRC).

Sure, xen/include/public/errno.h is fine.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-08 14:16   ` Stefano Stabellini
  2016-07-08 14:27     ` Juergen Gross
@ 2016-07-08 15:57     ` David Vrabel
  2016-07-08 16:52       ` Stefano Stabellini
  1 sibling, 1 reply; 14+ messages in thread
From: David Vrabel @ 2016-07-08 15:57 UTC (permalink / raw)
  To: Stefano Stabellini, Juergen Gross
  Cc: lars.kurth, wei.liu2, david.vrabel, xen-devel, boris.ostrovsky,
	joao.m.martins, roger.pau

On 08/07/16 15:16, Stefano Stabellini wrote:
> 
> http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html

Are you really guaranteeing full POSIX semantics for all these calls?
And not say, POSIX-like except where Linux decides to differ because
POSIX is dumb?

How is the guest (which expects the semantics of its own OS) going to
know that connect(2) to an external IP is going to behave differently to
say connect(2) to localhost?

Given:

a) The difficulties in reconciling the differences in  behaviour and
features between remoted system calls and local ones.

b) The difficulty in fully specifying (and thus fully implementing) the
PV interface.

c) My belief that most of the advantages of this proposal can be
achieved with smarts in the backend.

I'm not sure there is much merit in discussing the finer points of the
protocol until these bigger architectural issues have been addressed.

David


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-08 15:57     ` David Vrabel
@ 2016-07-08 16:52       ` Stefano Stabellini
  2016-07-08 17:10         ` David Vrabel
  0 siblings, 1 reply; 14+ messages in thread
From: Stefano Stabellini @ 2016-07-08 16:52 UTC (permalink / raw)
  To: David Vrabel
  Cc: Juergen Gross, lars.kurth, wei.liu2, Stefano Stabellini,
	xen-devel, joao.m.martins, boris.ostrovsky, roger.pau

On Fri, 8 Jul 2016, David Vrabel wrote:
> On 08/07/16 15:16, Stefano Stabellini wrote:
> > 
> > http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html
> 
> Are you really guaranteeing full POSIX semantics for all these calls?
> And not say, POSIX-like except where Linux decides to differ because
> POSIX is dumb?
> 
> How is the guest (which expects the semantics of its own OS) going to
> know that connect(2) to an external IP is going to behave differently to
> say connect(2) to localhost?
>
> a) The difficulties in reconciling the differences in  behaviour and
> features between remoted system calls and local ones.
> 
> b) The difficulty in fully specifying (and thus fully implementing) the
> PV interface.

I'll refrain from replying to these points because they are about the
implementation, rather than the protocol, which it will be discussed
separately when I manage to publish the design document of the drivers.
I am confident we can solve these issue if we work together
constructively. I noticed you are not making any suggestions on how to
solve these issues, which is good form when doing reviews.

I want to address the following point first:


> c) My belief that most of the advantages of this proposal can be
> achieved with smarts in the backend.

By backend do you mean netfront/netback? If so, I have already pointed
out why that is not the case in previous emails as well as in this
design document.

If you remain unconvinced of the usefulness of this work, that's OK, we
can agree to disagree. Many people work on things I don't believe
particularly useful myself. I am not asking you to spend any time on
this if you don't believe it serves a purpose. But please let the rest
of the community work constructively together.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-08 16:52       ` Stefano Stabellini
@ 2016-07-08 17:10         ` David Vrabel
  2016-07-08 17:36           ` Stefano Stabellini
  0 siblings, 1 reply; 14+ messages in thread
From: David Vrabel @ 2016-07-08 17:10 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Juergen Gross, lars.kurth, wei.liu2, xen-devel, joao.m.martins,
	boris.ostrovsky, roger.pau

On 08/07/16 17:52, Stefano Stabellini wrote:
> 
>> c) My belief that most of the advantages of this proposal can be
>> achieved with smarts in the backend.
> 
> By backend do you mean netfront/netback? If so, I have already pointed
> out why that is not the case in previous emails as well as in this
> design document.
> 
> If you remain unconvinced of the usefulness of this work, that's OK, we
> can agree to disagree. Many people work on things I don't believe
> particularly useful myself. I am not asking you to spend any time on
> this if you don't believe it serves a purpose. But please let the rest
> of the community work constructively together.

If you're only going to implement userspace frontends and backends, then
sure I can step out of this discussion.  However, your initial prototype
drivers suggest you're aiming at in-kernel drivers.

David

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-08 11:23 [DRAFT 1] XenSock protocol design document Stefano Stabellini
  2016-07-08 12:14 ` Juergen Gross
@ 2016-07-08 17:11 ` David Vrabel
  2016-07-11 10:59   ` Stefano Stabellini
  2016-07-11 12:47 ` Paul Durrant
  2016-07-11 14:51 ` Joao Martins
  3 siblings, 1 reply; 14+ messages in thread
From: David Vrabel @ 2016-07-08 17:11 UTC (permalink / raw)
  To: Stefano Stabellini, xen-devel
  Cc: jgross, lars.kurth, wei.liu2, joao.m.martins, boris.ostrovsky, roger.pau

On 08/07/16 12:23, Stefano Stabellini wrote:
> 
> XenSocks provides the following benefits:
> * guest networking works out of the box with VPNs, wireless networks and
>   any other complex configurations on the host

Only in the trivial case where the host only has one external network.
Otherwise, you are going to have to have some sort of configuration to
keep guest traffic isolated from the management or storage network (for
example).

> * guest services listen on ports bound directly to the backend domain IP
>   addresses

I think this could be done with SDN but I'm no expert on this area.

> * localhost becomes a secure namespace for intra-VMs communications

I presume you mean "inter-VM" communication here?  This is already
achievable with a private bridged network for VMs on a host.

> * full visibility of the guest behavior on the backend domain, allowing
>   for inexpensive filtering and manipulation of any guest calls

There's many existing solutions in this space for networking.

> * excellent performance

netback/netfront is pretty good now and further improvements to them
would have wider benefits.

David

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-08 17:10         ` David Vrabel
@ 2016-07-08 17:36           ` Stefano Stabellini
  0 siblings, 0 replies; 14+ messages in thread
From: Stefano Stabellini @ 2016-07-08 17:36 UTC (permalink / raw)
  To: David Vrabel
  Cc: Juergen Gross, lars.kurth, wei.liu2, Stefano Stabellini,
	xen-devel, joao.m.martins, boris.ostrovsky, roger.pau

On Fri, 8 Jul 2016, David Vrabel wrote:
> On 08/07/16 17:52, Stefano Stabellini wrote:
> > 
> >> c) My belief that most of the advantages of this proposal can be
> >> achieved with smarts in the backend.
> > 
> > By backend do you mean netfront/netback? If so, I have already pointed
> > out why that is not the case in previous emails as well as in this
> > design document.
> > 
> > If you remain unconvinced of the usefulness of this work, that's OK, we
> > can agree to disagree. Many people work on things I don't believe
> > particularly useful myself. I am not asking you to spend any time on
> > this if you don't believe it serves a purpose. But please let the rest
> > of the community work constructively together.
> 
> If you're only going to implement userspace frontends and backends, then
> sure I can step out of this discussion.  However, your initial prototype
> drivers suggest you're aiming at in-kernel drivers.
 
We are discussing the protocol here. There could be only userspace
drivers. Reservations about Linux drivers should be expressed elsewhere.

Even if there were kernel drivers, don't worry about maintenance. I
wouldn't ask you to carry the burden. If/when they'll be ready to be
merged, I would be quite happy to take it on (similarly to what happened
with Juergen and his good work on the PV SCSI frontend and backend). Due
to their nature, they'll be quite self-contained. And fortunately we
have many other Linux maintainers and Linux experts in the Xen
community, who can provide constructive feedback. We are an healthy
community after all and we can rely on each others.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-08 17:11 ` David Vrabel
@ 2016-07-11 10:59   ` Stefano Stabellini
  0 siblings, 0 replies; 14+ messages in thread
From: Stefano Stabellini @ 2016-07-11 10:59 UTC (permalink / raw)
  To: David Vrabel
  Cc: jgross, lars.kurth, wei.liu2, Stefano Stabellini, xen-devel,
	joao.m.martins, boris.ostrovsky, roger.pau

On Fri, 8 Jul 2016, David Vrabel wrote:
> On 08/07/16 12:23, Stefano Stabellini wrote:
> > 
> > XenSocks provides the following benefits:
> > * guest networking works out of the box with VPNs, wireless networks and
> >   any other complex configurations on the host
> 
> Only in the trivial case where the host only has one external network.

Which is the most common case and the one we care about the most.


> Otherwise, you are going to have to have some sort of configuration to
> keep guest traffic isolated from the management or storage network (for
> example).

I admit I don't think I understand your example, please add more
details.

In any case how would you achieve this benefit with netfront/netback?


> > * guest services listen on ports bound directly to the backend domain IP
> >   addresses
> 
> I think this could be done with SDN but I'm no expert on this area.

Maybe. But a simple Google search didn't turn up anything useful on
this. The solution used by Docker to achieve this is very expensive in
terms of resources.

In fact even if you are right, these are complex and expensive solutions
you are talking about. It would likely require some sort of address and
port translation. XenSock is a simple solution and the best way to solve
this problem. I don't want to configure an SDN, iptables and whatnot
just to have guest ports bound on the host. More complexity one
introduces, the more difficult becomes handling security and
maintenance. Performance suffers too.


> > * localhost becomes a secure namespace for intra-VMs communications
> 
> I presume you mean "inter-VM" communication here?

Yes, I meant inter-VM, sorry for the confusion.


> This is already achievable with a private bridged network for VMs on a
> host.

Wouldn't that require one more virtual interface per VM?


> > * full visibility of the guest behavior on the backend domain, allowing
> >   for inexpensive filtering and manipulation of any guest calls
> 
> There's many existing solutions in this space for networking.

One the most important points of this work is that users don't need to
use those "existing solutions in this space for networking". They are
expensive (both in terms of money and performance) and suboptimal. They
are never going to have the level of visibility and control that we
could have with XenSock.


> > * excellent performance
> 
> netback/netfront is pretty good now and further improvements to them
> would have wider benefits.
 
You are saying that one could achieve the same benefits of XenSock with:

netfront/netback + some zero configuration tool for netfront/netback +
SDN + a network based application firewall + one more virtual interface
per VM

I feel confortable in stating that XenSock is far better than a
combination of 4 or 5 complex moving pieces.

I admit that Citrix XenServer won't directly benefit from XenSock, at
least not in the short term. But the Xen Community is much wider than
XenServer. People have already pointed out to me why this would be
useful for them.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-08 11:23 [DRAFT 1] XenSock protocol design document Stefano Stabellini
  2016-07-08 12:14 ` Juergen Gross
  2016-07-08 17:11 ` David Vrabel
@ 2016-07-11 12:47 ` Paul Durrant
  2016-07-12 17:39   ` Stefano Stabellini
  2016-07-11 14:51 ` Joao Martins
  3 siblings, 1 reply; 14+ messages in thread
From: Paul Durrant @ 2016-07-11 12:47 UTC (permalink / raw)
  To: Stefano Stabellini, xen-devel
  Cc: jgross, Lars Kurth, Wei Liu, David Vrabel, joao.m.martins,
	boris.ostrovsky, Roger Pau Monne

> -----Original Message-----
[snip]
> 
> # XenSocks Protocol v1
> 
> ## Rationale
> 
> XenSocks is a paravirtualized protocol for the POSIX socket API.
> 
> The purpose of XenSocks is to allow the implementation of a specific set
> of POSIX calls to be done in a domain other than your own. It allows
> connect, accept, bind, release, listen, poll, recvmsg and sendmsg to be
> implemented in another domain.

Does the other domain have privilege over the domain issuing the POSIX calls?

[snip]
> #### State Machine
> 
>     **Front**                             **Back**
>     XenbusStateInitialising               XenbusStateInitialising
>     - Query virtual device                - Query backend device
>       properties.                           identification data.
>     - Setup OS device instance.                          |
>     - Allocate and initialize the                        |
>       request ring.                                      V
>     - Publish transport parameters                XenbusStateInitWait
>       that will be in effect during
>       this connection.
>                  |
>                  |
>                  V
>        XenbusStateInitialised
> 
>                                           - Query frontend transport parameters.
>                                           - Connect to the request ring and
>                                             event channel.
>                                                          |
>                                                          |
>                                                          V
>                                                  XenbusStateConnected
> 
>      - Query backend device properties.
>      - Finalize OS virtual device
>        instance.
>                  |
>                  |
>                  V
>         XenbusStateConnected
> 
> Once frontend and backend are connected, they have a shared page, which
> will is used to exchange messages over a ring, and an event channel,
> which is used to send notifications.
> 

What about XenbusStateClosing and XenbusStateClosed? We're missing half the state model here. Specifically how do individual connections get terminated if either end moves to closing? Does either end have to wait for the other?

> 
> ### Commands Ring
> 
> The shared ring is used by the frontend to forward socket API calls to the
> backend. I'll refer to this ring as **commands ring** to distinguish it from
> other rings which will be created later in the lifecycle of the protocol (data
> rings). The ring format is defined using the familiar `DEFINE_RING_TYPES`
> macro
> (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
> using the `RING_GET_REQUEST` macro.
> 
> The format is defined as follows:
> 
>     #define XENSOCK_DATARING_ORDER 6
>     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
>     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES <<
> PAGE_SHIFT)
> 

Why a fixed size? Also, I assume DATARING should be CMDRING or somesuch here. Plus a fixed size of *six* pages seems like a lot.

>     #define XENSOCK_CONNECT        0
>     #define XENSOCK_RELEASE        3
>     #define XENSOCK_BIND           4
>     #define XENSOCK_LISTEN         5
>     #define XENSOCK_ACCEPT         6
>     #define XENSOCK_POLL           7
> 
>     struct xen_xensock_request {
>         uint32_t id;     /* private to guest, echoed in response */
>         uint32_t cmd;    /* command to execute */
>         uint64_t sockid; /* id of the socket */
>         union {
>             struct xen_xensock_connect {
>                 uint8_t addr[28];
>                 uint32_t len;
>                 uint32_t flags;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } connect;
>             struct xen_xensock_bind {
>                 uint8_t addr[28]; /* ipv6 ready */
>                 uint32_t len;
>             } bind;
>             struct xen_xensock_accept {
>                 uint64_t sockid;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } accept;
>         } u;
>     };
> 

Perhaps some layout diagrams for the above to avoid ABI assumptions?

> The first three fields are common for every command. Their binary layout
> is:
> 
>     0       4       8       12      16
>     +-------+-------+-------+-------+
>     |  id   |  cmd  |     sockid    |
>     +-------+-------+-------+-------+
> 

That's a start at least :-)

> - **id** is generated by the frontend and identifies one specific request
> - **cmd** is the command requested by the frontend:
>     - `XENSOCK_CONNECT`: 0
>     - `XENSOCK_RELEASE`: 3
>     - `XENSOCK_BIND`:    4
>     - `XENSOCK_LISTEN`:  5
>     - `XENSOCK_ACCEPT`:  6
>     - `XENSOCK_POLL`:    7
> - **sockid** is generated by the frontend and identifies the socket to
> connect,
>   bind, etc. A new sockid is required on `XENSOCK_CONNECT` and
> `XENSOCK_BIND`
>   commands. A new sockid is also required on `XENSOCK_ACCEPT`, for the
> new
>   socket.
> 

[snip]
> #### Connect
> 
> The **connect** operation corresponds to the connect system call. It
> connects a
> socket to the specified address. **sockid** is freely chosen by the frontend
> and
> references this specific socket from this point forward.
> 
> The connect operation creates a new shared ring, which we'll call **data
> ring**.
> The new ring is used to send and receive data over the connected socket.
> Information necessary to setup the new ring, such as grant table references
> and
> event channel ports, are passed from the frontend to the backend as part of
> this request. A **data ring** is unmapped and freed upon issuing a
> **release**
> command on the active socket identified by **sockid**.
> 
> When the frontend issues a **connect** command, the backend:
> - creates a new socket and connects it to **addr**
> - creates an internal mapping from **sockid** to its own socket
> - maps all the grant references and uses them as shared memory for the new
> data
>   ring
> - bind the **evtchn**
> - replies to the frontend
> 
> The data ring format will be described in the following section.
> 
> Fields:
> 
> - **cmd** value: 0
> - additional fields:
>   - **addr**: address to connect to, in struct sockaddr format
>   - **len**: address length
>   - **flags**: flags for the connection, reserved for future usage
>   - **ref**: grant references of the data ring
>   - **evtchn**: port number of the evtchn to signal activity on the data ring
> 
> Binary layout:
> 
>         16      20      24      28      32      36      40      44     48
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |                            addr                       |  len  |
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         | flags |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] |ref[6] |
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|ref[14]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|ref[22]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|ref[30]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|ref[38]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|ref[46]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|ref[54]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|ref[62]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[63]|evtchn |
>         +-------+-------+
> 

So you really do want to bake a 64 page ring into the protocol then?

> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the socket system call
> 

The socket system call on which OS?

> #### Bind
> 
> The **bind** operation assigns the address passed as parameter to the
> socket.
> It corresponds to the bind system call.

Is a domain allowed to bind to a privileged port in the backend domain?

> **sockid** is freely chosen by the
> frontend and references this specific socket from this point forward.
> **Bind**,
> **listen** and **accept** are the three operations required to have fully
> working passive sockets and should be issued in this order.
> 
> Fields:
> 
> - **cmd** value: 4
> - additional fields:
>   - **addr**: address to bind to, in struct sockaddr format
>   - **len**: address length
> 
> Binary layout:
> 
>         16      20      24      28      32      36      40      44     48
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |                            addr                       |  len  |
>         +-------+-------+-------+-------+-------+-------+-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the bind system call
> 
> 
> #### Listen
> 
> The **listen** operation marks the socket as a passive socket. It
> corresponds to
> the listen system call.

...which also takes a 'backlog' parameter, which doesn't seem to be specified here.

> 
> Fields:
> 
> - **cmd** value: 5
> - additional fields: none
> 
> Return value:
>   - 0 on success
>   - less than 0 on failure, see the error codes of the listen system call
> 
> 

[snip]
> ### Data ring
> 
> Data rings are used for sending and receiving data over a connected socket.
> They
> are created upon a successful **accept** or **connect** command. The
> ring works
> in a similar way to the existing Xen console ring.
> 
> #### Format
> 
>     #define XENSOCK_DATARING_ORDER 6
>     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
>     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES <<
> PAGE_SHIFT)
>     typedef uint32_t XENSOCK_RING_IDX;
> 
>     struct xensock_ring_intf {
>     	char in[XENSOCK_DATARING_SIZE/4];
>     	char out[XENSOCK_DATARING_SIZE/2];

Why have differing sizes for the rings?

>     	XENSOCK_RING_IDX in_cons, in_prod;
>     	XENSOCK_RING_IDX out_cons, out_prod;
>     	int32_t in_error, out_error;
>     };
> 
> The design is flexible and can support different ring sizes (at compile time).
> The following description is based on order 6 rings, chosen because they
> provide
> excellent performance.
> 

What about datagram sockets? Raw sockets? Setting socket options? Etc.

  Paul

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-08 11:23 [DRAFT 1] XenSock protocol design document Stefano Stabellini
                   ` (2 preceding siblings ...)
  2016-07-11 12:47 ` Paul Durrant
@ 2016-07-11 14:51 ` Joao Martins
  2016-07-13 11:06   ` Stefano Stabellini
  3 siblings, 1 reply; 14+ messages in thread
From: Joao Martins @ 2016-07-11 14:51 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: jgross, lars.kurth, wei.liu2, david.vrabel, xen-devel,
	boris.ostrovsky, roger.pau

On 07/08/2016 12:23 PM, Stefano Stabellini wrote:
> Hi all,
> 
Hey!

[...]

> 
> ## Design
> 
> ### Xenstore
> 
> The frontend and the backend connect to each other exchanging information via
> xenstore. The toolstack creates front and back nodes with state
> XenbusStateInitialising. There can only be one XenSock frontend per domain.
> 
> #### Frontend XenBus Nodes
> 
> port
>      Values:         <uint32_t>
> 
>      The identifier of the Xen event channel used to signal activity
>      in the ring buffer.
> 
> ring-ref
>      Values:         <uint32_t>
> 
>      The Xen grant reference granting permission for the backend to map
>      the sole page in a single page sized ring buffer.

Would it make sense to export minimum, default and maximum size of the socket over
xenstore entries? It normally follows a convention depending on the type of socket
(and OS) you have, or then through settables on socket options.


> ### Commands Ring
> 
> The shared ring is used by the frontend to forward socket API calls to the
> backend. I'll refer to this ring as **commands ring** to distinguish it from
> other rings which will be created later in the lifecycle of the protocol (data
> rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` macro
> (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
> using the `RING_GET_REQUEST` macro.
> 
> The format is defined as follows:
> 
>     #define XENSOCK_DATARING_ORDER 6
>     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
>     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
>     
>     #define XENSOCK_CONNECT        0
>     #define XENSOCK_RELEASE        3
>     #define XENSOCK_BIND           4
>     #define XENSOCK_LISTEN         5
>     #define XENSOCK_ACCEPT         6
>     #define XENSOCK_POLL           7
>     
>     struct xen_xensock_request {
>         uint32_t id;     /* private to guest, echoed in response */
>         uint32_t cmd;    /* command to execute */
>         uint64_t sockid; /* id of the socket */
>         union {
>             struct xen_xensock_connect {
>                 uint8_t addr[28];
>                 uint32_t len;
>                 uint32_t flags;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } connect;
>             struct xen_xensock_bind {
>                 uint8_t addr[28]; /* ipv6 ready */
>                 uint32_t len;
>             } bind;
>             struct xen_xensock_accept {
>                 uint64_t sockid;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } accept;
>         } u;
>     };
> 
> The first three fields are common for every command. Their binary layout
> is:
> 
>     0       4       8       12      16
>     +-------+-------+-------+-------+
>     |  id   |  cmd  |     sockid    |
>     +-------+-------+-------+-------+
> 
> - **id** is generated by the frontend and identifies one specific request
> - **cmd** is the command requested by the frontend:
>     - `XENSOCK_CONNECT`: 0
>     - `XENSOCK_RELEASE`: 3
>     - `XENSOCK_BIND`:    4
>     - `XENSOCK_LISTEN`:  5
>     - `XENSOCK_ACCEPT`:  6
>     - `XENSOCK_POLL`:    7
> - **sockid** is generated by the frontend and identifies the socket to connect,
>   bind, etc. A new sockid is required on `XENSOCK_CONNECT` and `XENSOCK_BIND`
>   commands. A new sockid is also required on `XENSOCK_ACCEPT`, for the new
>   socket.
>   
Interesting - Have you consider setsockopt and getsockopt to be part of this? There
are some common options (as in POSIX defined) and then some more exotic flavors Linux
or FreeBSD specific. Say SO_REUSEPORT used on nginx that is good for load balancing
across a set of workers or Linux SO_BUSY_POLL for low latency sockets. Though not
sure how sensible it is to start exposing all of these socket options but to limit to
a specific subset? Or maybe doesn't make sense for your case - see further suggestion
regarding data ring part.

> All three fields are echoed back by the backend.
> 
> As for the other Xen ring based protocols, after writing a request to the ring,
> the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event
> channel notification when a notification is required.
> 
> Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro.
> The format is the following:
> 
>     struct xen_xensock_response {
>         uint32_t id;
>         uint32_t cmd;
>         uint64_t sockid;
>         int32_t ret;
>     };
>    
>     0       4       8       12      16      20
>     +-------+-------+-------+-------+-------+
>     |  id   |  cmd  |     sockid    |  ret  |
>     +-------+-------+-------+-------+-------+
> 
> - **id**: echoed back from request
> - **cmd**: echoed back from request
> - **sockid**: echoed back from request
> - **ret**: return value, identifies success or failure
> 
Are these fields taken from a specific OS (I assumed Linux)? Probably ids, cmd and
ret size could be less big overall or may be not - in which case could be useful
specifying in the spec if it's following a specific OS.

[...]

> The design is flexible and can support different ring sizes (at compile time).
> The following description is based on order 6 rings, chosen because they provide
> excellent performance.
> 
> - **in** is an array of 65536 bytes, used as circular buffer
>   It contains data read from the socket. The producer is the backend, the
>   consumer is the frontend.
> - **out** is an array of 131072 bytes, used as circular buffer
>   It contains data to be written to the socket. The producer is the frontend,
>   the consumer is the backend.
Could this size be a tunable intercepting RCVBUF and SNDBUF sockopt adjustments
(these two are POSIX defined) ofc under the assumption that in this proposal you want
to replicate local and remote socket? IOW to dynamically allocate how much the socket
will use for sending/receiving which would turn into the amount of grants in use?
Even doing with xenstore entries in the backend is better - even though user may want
to adjust send/receive buffer for whatever aplication needs. Ideally this would be
dynamic per socket, instead of compile-time defined - and would allow more sockets on
the same VM without overshooting the grant table limits.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-11 12:47 ` Paul Durrant
@ 2016-07-12 17:39   ` Stefano Stabellini
  0 siblings, 0 replies; 14+ messages in thread
From: Stefano Stabellini @ 2016-07-12 17:39 UTC (permalink / raw)
  To: Paul Durrant
  Cc: jgross, Lars Kurth, Wei Liu, Stefano Stabellini, David Vrabel,
	xen-devel, joao.m.martins, boris.ostrovsky, Roger Pau Monne

On Mon, 11 Jul 2016, Paul Durrant wrote:
> > -----Original Message-----
> [snip]
> > 
> > # XenSocks Protocol v1
> > 
> > ## Rationale
> > 
> > XenSocks is a paravirtualized protocol for the POSIX socket API.
> > 
> > The purpose of XenSocks is to allow the implementation of a specific set
> > of POSIX calls to be done in a domain other than your own. It allows
> > connect, accept, bind, release, listen, poll, recvmsg and sendmsg to be
> > implemented in another domain.
> 
> Does the other domain have privilege over the domain issuing the POSIX calls?

I don't have a strong opinion on this. In my scenario the backend is in
fact always dom0, but so far nothing in the protocol would prevent
XenSock from being used with driver domains AFAICT. Maybe writing down
that the backend needs to be privileged would allow us to take some
shortcuts in the future, but as there are none at the moment, I don't
think we should make this a requirement. What do you think?


> [snip]
> > #### State Machine
> > 
> >     **Front**                             **Back**
> >     XenbusStateInitialising               XenbusStateInitialising
> >     - Query virtual device                - Query backend device
> >       properties.                           identification data.
> >     - Setup OS device instance.                          |
> >     - Allocate and initialize the                        |
> >       request ring.                                      V
> >     - Publish transport parameters                XenbusStateInitWait
> >       that will be in effect during
> >       this connection.
> >                  |
> >                  |
> >                  V
> >        XenbusStateInitialised
> > 
> >                                           - Query frontend transport parameters.
> >                                           - Connect to the request ring and
> >                                             event channel.
> >                                                          |
> >                                                          |
> >                                                          V
> >                                                  XenbusStateConnected
> > 
> >      - Query backend device properties.
> >      - Finalize OS virtual device
> >        instance.
> >                  |
> >                  |
> >                  V
> >         XenbusStateConnected
> > 
> > Once frontend and backend are connected, they have a shared page, which
> > will is used to exchange messages over a ring, and an event channel,
> > which is used to send notifications.
> > 
> 
> What about XenbusStateClosing and XenbusStateClosed? We're missing half the state model here. Specifically how do individual connections get terminated if either end moves to closing? Does either end have to wait for the other?

I admit I "took inspiration" from xen/include/public/io/blkif.h, which
is also missing the closing steps. I'll try to add them. (If you know of
any existing descriptions of a XenBus closing protocol please let me
know.)


> > 
> > ### Commands Ring
> > 
> > The shared ring is used by the frontend to forward socket API calls to the
> > backend. I'll refer to this ring as **commands ring** to distinguish it from
> > other rings which will be created later in the lifecycle of the protocol (data
> > rings). The ring format is defined using the familiar `DEFINE_RING_TYPES`
> > macro
> > (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
> > using the `RING_GET_REQUEST` macro.
> > 
> > The format is defined as follows:
> > 
> >     #define XENSOCK_DATARING_ORDER 6
> >     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
> >     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES <<
> > PAGE_SHIFT)
> > 
> 
> Why a fixed size? Also, I assume DATARING should be CMDRING or somesuch here. Plus a fixed size of *six* pages seems like a lot.

This is going to be changed and significantly improved following
Juergen's suggestion.

 
> > Return value:
> > 
> >   - 0 on success
> >   - less than 0 on failure, see the error codes of the socket system call
> > 
> 
> The socket system call on which OS?

I'll add more info on this. I'll try to stick to POSIX as much as I can,
defining explicitly anything which is not specified by it (such as error
numbers).


> > #### Bind
> > 
> > The **bind** operation assigns the address passed as parameter to the
> > socket.
> > It corresponds to the bind system call.
> 
> Is a domain allowed to bind to a privileged port in the backend domain?

I would let the backend decide: the backend can return -EACCES if it
doesn't want to allow access to a given port.


> > **sockid** is freely chosen by the
> > frontend and references this specific socket from this point forward.
> > **Bind**,
> > **listen** and **accept** are the three operations required to have fully
> > working passive sockets and should be issued in this order.
> > 
> > Fields:
> > 
> > - **cmd** value: 4
> > - additional fields:
> >   - **addr**: address to bind to, in struct sockaddr format
> >   - **len**: address length
> > 
> > Binary layout:
> > 
> >         16      20      24      28      32      36      40      44     48
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> >         |                            addr                       |  len  |
> >         +-------+-------+-------+-------+-------+-------+-------+-------+
> > 
> > Return value:
> > 
> >   - 0 on success
> >   - less than 0 on failure, see the error codes of the bind system call
> > 
> > 
> > #### Listen
> > 
> > The **listen** operation marks the socket as a passive socket. It
> > corresponds to
> > the listen system call.
> 
> ...which also takes a 'backlog' parameter, which doesn't seem to be specified here.

Fixed, thanks!


> >     	XENSOCK_RING_IDX in_cons, in_prod;
> >     	XENSOCK_RING_IDX out_cons, out_prod;
> >     	int32_t in_error, out_error;
> >     };
> > 
> > The design is flexible and can support different ring sizes (at compile time).
> > The following description is based on order 6 rings, chosen because they
> > provide
> > excellent performance.
> > 
> 
> What about datagram sockets? Raw sockets? Setting socket options? Etc.

All currently unimplemented. Probably they are not going to be part of
the initial version of the protocol, but it would be nice if the
protocol was flexible enough to allow somebody in the future to jump in
and add them without too much trouble.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [DRAFT 1] XenSock protocol design document
  2016-07-11 14:51 ` Joao Martins
@ 2016-07-13 11:06   ` Stefano Stabellini
  0 siblings, 0 replies; 14+ messages in thread
From: Stefano Stabellini @ 2016-07-13 11:06 UTC (permalink / raw)
  To: Joao Martins
  Cc: jgross, lars.kurth, wei.liu2, Stefano Stabellini, david.vrabel,
	xen-devel, boris.ostrovsky, roger.pau

On Mon, 11 Jul 2016, Joao Martins wrote:
> On 07/08/2016 12:23 PM, Stefano Stabellini wrote:
> > Hi all,
> > 
> Hey!
> 
> [...]
> 
> > 
> > ## Design
> > 
> > ### Xenstore
> > 
> > The frontend and the backend connect to each other exchanging information via
> > xenstore. The toolstack creates front and back nodes with state
> > XenbusStateInitialising. There can only be one XenSock frontend per domain.
> > 
> > #### Frontend XenBus Nodes
> > 
> > port
> >      Values:         <uint32_t>
> > 
> >      The identifier of the Xen event channel used to signal activity
> >      in the ring buffer.
> > 
> > ring-ref
> >      Values:         <uint32_t>
> > 
> >      The Xen grant reference granting permission for the backend to map
> >      the sole page in a single page sized ring buffer.
> 
> Would it make sense to export minimum, default and maximum size of the socket over
> xenstore entries? It normally follows a convention depending on the type of socket
> (and OS) you have, or then through settables on socket options.

It makes sense, Juergen suggested something similar. I am thinking of
passing the maximum order of the data ring.

 
> > ### Commands Ring
> > 
> > The shared ring is used by the frontend to forward socket API calls to the
> > backend. I'll refer to this ring as **commands ring** to distinguish it from
> > other rings which will be created later in the lifecycle of the protocol (data
> > rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` macro
> > (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
> > using the `RING_GET_REQUEST` macro.
> > 
> > The format is defined as follows:
> > 
> >     #define XENSOCK_DATARING_ORDER 6
> >     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
> >     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
> >     
> >     #define XENSOCK_CONNECT        0
> >     #define XENSOCK_RELEASE        3
> >     #define XENSOCK_BIND           4
> >     #define XENSOCK_LISTEN         5
> >     #define XENSOCK_ACCEPT         6
> >     #define XENSOCK_POLL           7
> >     
> >     struct xen_xensock_request {
> >         uint32_t id;     /* private to guest, echoed in response */
> >         uint32_t cmd;    /* command to execute */
> >         uint64_t sockid; /* id of the socket */
> >         union {
> >             struct xen_xensock_connect {
> >                 uint8_t addr[28];
> >                 uint32_t len;
> >                 uint32_t flags;
> >                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
> >                 uint32_t evtchn;
> >             } connect;
> >             struct xen_xensock_bind {
> >                 uint8_t addr[28]; /* ipv6 ready */
> >                 uint32_t len;
> >             } bind;
> >             struct xen_xensock_accept {
> >                 uint64_t sockid;
> >                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
> >                 uint32_t evtchn;
> >             } accept;
> >         } u;
> >     };
> > 
> > The first three fields are common for every command. Their binary layout
> > is:
> > 
> >     0       4       8       12      16
> >     +-------+-------+-------+-------+
> >     |  id   |  cmd  |     sockid    |
> >     +-------+-------+-------+-------+
> > 
> > - **id** is generated by the frontend and identifies one specific request
> > - **cmd** is the command requested by the frontend:
> >     - `XENSOCK_CONNECT`: 0
> >     - `XENSOCK_RELEASE`: 3
> >     - `XENSOCK_BIND`:    4
> >     - `XENSOCK_LISTEN`:  5
> >     - `XENSOCK_ACCEPT`:  6
> >     - `XENSOCK_POLL`:    7
> > - **sockid** is generated by the frontend and identifies the socket to connect,
> >   bind, etc. A new sockid is required on `XENSOCK_CONNECT` and `XENSOCK_BIND`
> >   commands. A new sockid is also required on `XENSOCK_ACCEPT`, for the new
> >   socket.
> >   
> Interesting - Have you consider setsockopt and getsockopt to be part of this? There
> are some common options (as in POSIX defined) and then some more exotic flavors Linux
> or FreeBSD specific. Say SO_REUSEPORT used on nginx that is good for load balancing
> across a set of workers or Linux SO_BUSY_POLL for low latency sockets. Though not
> sure how sensible it is to start exposing all of these socket options but to limit to
> a specific subset? Or maybe doesn't make sense for your case - see further suggestion
> regarding data ring part.

I have considered it, but I thought that they might be better suited for
a v2 version of the spec. This protocol needs to be extensible and
adding two new operations such as setsockopt and getsockopt should be
the simplest thing to do. Old backends should return ENOTSUPP. I'll
mention this explicitly in the next draft.


> > All three fields are echoed back by the backend.
> > 
> > As for the other Xen ring based protocols, after writing a request to the ring,
> > the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event
> > channel notification when a notification is required.
> > 
> > Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro.
> > The format is the following:
> > 
> >     struct xen_xensock_response {
> >         uint32_t id;
> >         uint32_t cmd;
> >         uint64_t sockid;
> >         int32_t ret;
> >     };
> >    
> >     0       4       8       12      16      20
> >     +-------+-------+-------+-------+-------+
> >     |  id   |  cmd  |     sockid    |  ret  |
> >     +-------+-------+-------+-------+-------+
> > 
> > - **id**: echoed back from request
> > - **cmd**: echoed back from request
> > - **sockid**: echoed back from request
> > - **ret**: return value, identifies success or failure
> > 
> Are these fields taken from a specific OS (I assumed Linux)? Probably ids, cmd and
> ret size could be less big overall or may be not - in which case could be useful
> specifying in the spec if it's following a specific OS.

I'll do.


> [...]
> 
> > The design is flexible and can support different ring sizes (at compile time).
> > The following description is based on order 6 rings, chosen because they provide
> > excellent performance.
> > 
> > - **in** is an array of 65536 bytes, used as circular buffer
> >   It contains data read from the socket. The producer is the backend, the
> >   consumer is the frontend.
> > - **out** is an array of 131072 bytes, used as circular buffer
> >   It contains data to be written to the socket. The producer is the frontend,
> >   the consumer is the backend.
> Could this size be a tunable intercepting RCVBUF and SNDBUF sockopt adjustments
> (these two are POSIX defined) ofc under the assumption that in this proposal you want
> to replicate local and remote socket? IOW to dynamically allocate how much the socket
> will use for sending/receiving which would turn into the amount of grants in use?
> Even doing with xenstore entries in the backend is better - even though user may want
> to adjust send/receive buffer for whatever aplication needs. Ideally this would be
> dynamic per socket, instead of compile-time defined - and would allow more sockets on
> the same VM without overshooting the grant table limits.

I am working on changing the spec to make the size of the data ring
configurable per socket. Each socket will be able to have a ring of a
different size (I am adding a per-socket ring_order parameter). Hooking
it all up with RCVBUF and SNDBUF should be possible, but I'll leave it
for the future.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-07-13 11:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-08 11:23 [DRAFT 1] XenSock protocol design document Stefano Stabellini
2016-07-08 12:14 ` Juergen Gross
2016-07-08 14:16   ` Stefano Stabellini
2016-07-08 14:27     ` Juergen Gross
2016-07-08 15:57     ` David Vrabel
2016-07-08 16:52       ` Stefano Stabellini
2016-07-08 17:10         ` David Vrabel
2016-07-08 17:36           ` Stefano Stabellini
2016-07-08 17:11 ` David Vrabel
2016-07-11 10:59   ` Stefano Stabellini
2016-07-11 12:47 ` Paul Durrant
2016-07-12 17:39   ` Stefano Stabellini
2016-07-11 14:51 ` Joao Martins
2016-07-13 11:06   ` Stefano Stabellini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).