[DOC v1] Xen transport for 9pfs

* [DOC v1] Xen transport for 9pfs
@ 2016-11-29 23:34 Stefano Stabellini
  2016-11-30 10:59 ` David Vrabel
  2016-12-01 10:56 ` Dario Faggioli
  0 siblings, 2 replies; 10+ messages in thread
From: Stefano Stabellini @ 2016-11-29 23:34 UTC (permalink / raw)
  To: xen-devel; +Cc: stefano, wei.liu2

# Xen transport for 9pfs version 1 

## Background

9pfs is a network filesystem protocol developed for Plan 9. 9pfs is very
simple and describes a series of commands and responses. It is
completely independent from the communication channels, in fact many
clients and servers support multiple channels, usually called
"transports". For example the Linux client supports tcp and unix
sockets, fds, virtio and rdma.

### 9pfs protocol

This document won't cover the full 9pfs specification. Please refer to
this [paper] and this [website] for a detailed description of it.
However it is useful to know that each 9pfs request and response has the
following header:

    struct header {
    	uint32_t size;
    	uint8_t id;
    	uint16_t tag;
    } __attribute__((packed));

    0         4  5    7
    +---------+--+----+
    |  size   |id|tag |
    +---------+--+----+

- *size*
The size of the request or response.

- *id*
The 9pfs request or response operation.

- *tag*
Unique id that identifies a specific request/response pair. It is used
to multiplex operations on a single channel.

It is possible to have multiple requests in-flight at any given time.

## Rationale

This document describes a Xen based transport for 9pfs, in the
traditional PV frontend and backend format. The PV frontend is used by
the client to send commands to the server. The PV backend is used by the
9pfs server to receive commands from clients and send back responses.

Each 9pfs connection corresponds to a single filesystem share.

This document does not cover the 9pfs client/server design or
implementation, only the transport for it.

## Xenstore

The frontend and the backend connect via xenstore to exchange
information. The toolstack creates front and back nodes with state
[XenbusStateInitialising]. The protocol node name is **9pfs**.

Multiple rings are supported for each frontend and backend connection.

### Frontend XenBus Nodes

    num-rings
         Values:         <uint32_t>

         Number of rings. It needs to be lower or equal to max-rings.

    port-<num> (port-0, port-1, etc)
         Values:         <uint32_t>

         The identifier of the Xen event channel used to signal activity
         in the ring buffer. One for each ring.

    ring-ref-<num> (ring-ref-0, ring-ref-1, etc)
         Values:         <uint32_t>

         The Xen grant reference granting permission for the backend to
         map a page with information to setup a share ring. One for each
         ring.

### Backend XenBus Nodes

Backend specific properties, written by the backend, read by the
frontend:

    version
         Values:         <uint32_t>

         Protocol version supported by the backend. Currently the value is
         1.

    max-rings
         Values:         <uint32_t>

         The maximum supported number of rings.

    max-ring-page-order
         Values:         <uint32_t>

         The maximum supported size of a memory allocation in units of
         lb(machine pages), e.g. 0 == 1 page,  1 = 2 pages, 2 == 4 pages,
         etc.

Backend configuration nodes, written by the toolstack, read by the
backend:

    path
         Values:         <string>

         Host filesystem path to share.

    tag
         Values:         <string>

         Alphanumeric tag that identifies the 9pfs share. The client needs
         to know the tag to mount it.

    security_model
         Values:         "none"

         *none*: files are stored using the same credentials as they are
                 created on the guest
         Only "none" is supported in this version of the protocol.

### State Machine

Initialization:

    *Front*                               *Back*
    XenbusStateInitialising               XenbusStateInitialising
    - Query virtual device                - Query backend device
      properties.                           identification data.
    - Setup OS device instance.           - Publish backend features
    - Allocate and initialize the           and transport parameters
      request ring.                                      |
    - Publish transport parameters                       |
      that will be in effect during                      V
      this connection.                            XenbusStateInitWait
                 |
                 |
                 V
       XenbusStateInitialised

                                          - Query frontend transport parameters.
                                          - Connect to the request ring and
                                            event channel.
                                                         |
                                                         |
                                                         V
                                                 XenbusStateConnected

     - Query backend device properties.
     - Finalize OS virtual device
       instance.
                 |
                 |
                 V
        XenbusStateConnected

Once frontend and backend are connected, they have a shared page per
ring, which are used to setup the rings, and an event channel per ring,
which are used to send notifications.

Shutdown:

    *Front*                            *Back*
    XenbusStateConnected               XenbusStateConnected
                |
                |
                V
       XenbusStateClosing

                                       - Unmap grants
                                       - Unbind evtchns
                                                 |
                                                 |
                                                 V
                                         XenbusStateClosing

    - Unbind evtchns
    - Free rings
    - Free data structures
               |
               |
               V
       XenbusStateClosed

                                       - Free remaining data structures
                                                 |
                                                 |
                                                 V
                                         XenbusStateClosed

## Ring Setup

The shared page has the following layout:

    typedef uint32_t XEN_9PFS_RING_IDX;

    struct xen_9pfs_intf {
    	XEN_9PFS_RING_IDX in_cons, in_prod;
    	XEN_9PFS_RING_IDX out_cons, out_prod;

    	uint32_t ring_order;
    	grant_ref_t ref[];
    };

    /* not actually C compliant (ring_order changes from socket to socket) */
    struct ring_data {
        char in[((1 << ring_order) << PAGE_SHIFT) / 2];
        char out[((1 << ring_order) << PAGE_SHIFT) / 2];
    };

- **ring_order**
  It represents the order of the data ring. The following list of grant
  references is of `(1 << ring_order)` elements. It cannot be greater than
  **max-ring-page-order**, as specified by the backend on XenBus.
- **ref[]**
  The list of grant references which will contain the actual data. They are
  mapped contiguosly in virtual memory. The first half of the pages is the
  **in** array, the second half is the **out** array. The array must
  have a power of two number of elements.
- **out** is an array used as circular buffer
  It contains client requests. The producer is the frontend, the
  consumer is the backend.
- **in** is an array used as circular buffer
  It contains server responses. The producer is the backend, the
  consumer is the frontend.
- **out_cons**, **out_prod**
  Consumer and producer indices for client requests. They keep track of
  how much data has been written by the frontend to **out** and how much
  data has already been consumed by the backend. **out_prod** is
  increased by the frontend, after writing data to **out**. **out_cons**
  is increased by the backend, after reading data from **out**.
- **in_cons** and **in_prod**
  Consumer and producer indices for responses. They keep track of how
  much data has already been consumed by the frontend from the **in**
  array. **in_prod** is increased by the backend, after writing data to
  **in**.  **in_cons** is increased by the frontend, after reading data
  from **in**.

The binary layout of `struct xen_9pfs_intf` follows:

    0         4         8         12        16        20
    +---------+---------+---------+---------+---------+
    | in_cons | in_prod |out_cons |out_prod |ring_orde|
    +---------+---------+---------+---------+---------+

    20        24        26      4092      4096
    +---------+---------+----//---+---------+
    |  ref[0] |  ref[1] |         |  ref[N] |
    +---------+---------+----//---+---------+

**N.B** For one page, N is maximum 1019 ((4096-20)/4), but given that N
needs to be a power of two, actually max N is 512.
The binary layout of the ring buffers follow:

    0         ((1<<ring_order)<<PAGE_SHIFT)/2       ((1<<ring_order)<<PAGE_SHIFT)
    +------------//-------------+------------//-------------+
    |            in             |           out             |
    +------------//-------------+------------//-------------+

## Ring Usage

The **in** and **out** arrays are used as circular buffers:

    0                               sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2
    +-----------------------------------+
    |to consume|    free    |to consume |
    +-----------------------------------+
               ^            ^
               prod         cons

    0                               sizeof(array)
    +-----------------------------------+
    |  free    | to consume |   free    |
    +-----------------------------------+
               ^            ^
               cons         prod

The following functions are provided to read and write to an array:

    #define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1))

    static inline void xen_9pfs_read(char *buf,
    		XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
    		uint8_t *h, size_t len) {
    	if (*masked_cons < *masked_prod) {
    		memcpy(h, buf + *masked_cons, len);
    	} else {
    		if (len > XEN_9PFS_RING_SIZE - *masked_cons) {
    			memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons);
    			memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons));
    		} else {
    			memcpy(h, buf + *masked_cons, len);
    		}
    	}
    	*masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len);
    }

    static inline void xen_9pfs_write(char *buf,
    		XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
    		uint8_t *opaque, size_t len) {
    	if (*masked_prod < *masked_cons) {
    		memcpy(buf + *masked_prod, opaque, len);
    	} else {
    		if (len > XEN_9PFS_RING_SIZE - *masked_prod) {
    			memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod);
    			memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod)); 
    		} else {
    			memcpy(buf + *masked_prod, opaque, len); 
    		}
    	}
    	*masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len);
    }

The producer (the backend for **in**, the frontend for **out**) writes to the
array in the following way:

- read *cons*, *prod* from shared memory
- memory barrier
- write to array at position *prod* up to *cons*, wrapping around the circular
  buffer when necessary
- memory barrier
- increase *prod*
- notify the other end via evtchn

The consumer (the backend for **out**, the frontend for **in**) reads from the
array in the following way:

- read *prod*, *cons* from shared memory
- memory barrier
- read from array at position *cons* up to *prod*, wrapping around the circular
  buffer when necessary
- memory barrier
- increase *cons*
- notify the other end via evtchn

The producer takes care of writing only as many bytes as available in the buffer
up to *cons*. The consumer takes care of reading only as many bytes as available
in the buffer up to *prod*.

## Request/Response Workflow

The client chooses one of the available rings, then it sends a request
to the other end on the *out* array, following the producer workflow
described in [Ring Usage].

The server receives the notification and reads the request, following
the consumer workflow described in [Ring Usage]. The server knows how
much to read because it is specified in the *size* field of the 9pfs
header. The server processes the request and sends back a response on
the *in* array of the same ring, following the producer workflow as
usual.

The client receives a notification and reads the response from the *in*
array. The client knows how much data to read because it is specified in
the *size* field of the 9pfs header.

[paper]: https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/hensbergen/hensbergen.pdf
[website]: https://github.com/chaos/diod/blob/master/protocol.md
[XenbusStateInitialising]: http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread