Re: [DOC v9] PV Calls protocol design

From: Stefano Stabellini <stefano@aporeto.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: jgross@suse.com, lars.kurth@citrix.com, wei.liu2@citrix.com,
	andrew.cooper3@citrix.com,
	Stefano Stabellini <stefano@aporeto.com>,
	Paul.Durrant@citrix.com, xen-devel@lists.xenproject.org,
	joao.m.martins@oracle.com, boris.ostrovsky@oracle.com,
	roger.pau@citrix.com
Subject: Re: [DOC v9] PV Calls protocol design
Date: Tue, 14 Feb 2017 13:34:52 -0800 (PST)	[thread overview]
Message-ID: <alpine.DEB.2.10.1702141333230.6418@sstabellini-ThinkPad-X260> (raw)
In-Reply-To: <20170214191938.GB16227@char.us.ORACLE.com>

On Tue, 14 Feb 2017, Konrad Rzeszutek Wilk wrote:
> On Mon, Feb 13, 2017 at 11:46:40AM -0800, Stefano Stabellini wrote:
> > Changes in v9:
> > - specify max-page-order must be >= 1
> > - clarifications
> > - add "Expanding the protocol"
> > - add padding after out_error
> > - add "Why ring.h is not needed"
> 
> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thanks! For your convenience:

---

docs: add PV Calls Protocol

Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff --git a/docs/misc/pvcalls.markdown b/docs/misc/pvcalls.markdown
new file mode 100644
index 0000000..d3f7f20
--- /dev/null
+++ b/docs/misc/pvcalls.markdown
@@ -0,0 +1,1092 @@
+# PV Calls Protocol version 1
+
+## Glossary
+
+The following is a list of terms and definitions used in the Xen
+community. If you are a Xen contributor you can skip this section.
+
+* PV
+
+  Short for paravirtualized.
+
+* Dom0
+
+  First virtual machine that boots. In most configurations Dom0 is
+  privileged and has control over hardware devices, such as network
+  cards, graphic cards, etc.
+
+* DomU
+
+  Regular unprivileged Xen virtual machine.
+
+* Domain
+
+  A Xen virtual machine. Dom0 and all DomUs are all separate Xen
+  domains.
+
+* Guest
+
+  Same as domain: a Xen virtual machine.
+
+* Frontend
+
+  Each DomU has one or more paravirtualized frontend drivers to access
+  disks, network, console, graphics, etc. The presence of PV devices is
+  advertized on XenStore, a cross domain key-value database. Frontends
+  are similar in intent to the virtio drivers in Linux.
+
+* Backend
+
+  A Xen paravirtualized backend typically runs in Dom0 and it is used to
+  export disks, network, console, graphics, etcs, to DomUs. Backends can
+  live both in kernel space and in userspace. For example xen-blkback
+  lives under drivers/block in the Linux kernel and xen_disk lives under
+  hw/block in QEMU. Paravirtualized backends are similar in intent to
+  virtio device emulators.
+
+* VMX and SVM
+  
+  On Intel processors, VMX is the CPU flag for VT-x, hardware
+  virtualization support. It corresponds to SVM on AMD processors.
+
+
+
+## Rationale
+
+PV Calls is a paravirtualized protocol that allows the implementation of
+a set of POSIX functions in a different domain. The PV Calls frontend
+sends POSIX function calls to the backend, which implements them and
+returns a value to the frontend and acts on the function call.
+
+This version of the document covers networking function calls, such as
+connect, accept, bind, release, listen, poll, recvmsg and sendmsg; but
+the protocol is meant to be easily extended to cover different sets of
+calls. Unimplemented commands return ENOTSUP.
+
+PV Calls provide the following benefits:
+* full visibility of the guest behavior on the backend domain, allowing
+  for inexpensive filtering and manipulation of any guest calls
+* excellent performance
+
+Specifically, PV Calls for networking offer these advantages:
+* guest networking works out of the box with VPNs, wireless networks and
+  any other complex configurations on the host
+* guest services listen on ports bound directly to the backend domain IP
+  addresses
+* localhost becomes a secure host wide network for inter-VMs
+  communications
+
+
+## Design
+
+### Why Xen?
+
+PV Calls are part of an effort to create a secure runtime environment
+for containers (Open Containers Initiative images to be precise). PV
+Calls are based on Xen, although porting them to other hypervisors is
+possible. Xen was chosen because of its security and isolation
+properties and because it supports PV guests, a type of virtual machines
+that does not require hardware virtualization extensions (VMX on Intel
+processors and SVM on AMD processors). This is important because PV
+Calls is meant for containers and containers are often run on top of
+public cloud instances, which do not support nested VMX (or SVM) as of
+today (early 2017). Xen PV guests are lightweight, minimalist, and do
+not require machine emulation: all properties that make them a good fit
+for this project.
+
+### Xenstore
+
+The frontend and the backend connect via [xenstore] to
+exchange information. The toolstack creates front and back nodes with
+state of [XenbusStateInitialising]. The protocol node name
+is **pvcalls**.  There can only be one PV Calls frontend per domain.
+
+#### Frontend XenBus Nodes
+
+version
+     Values:         <string>
+
+     Protocol version, chosen among the ones supported by the backend
+     (see **versions** under [Backend XenBus Nodes]). Currently the
+     value must be "1".
+
+port
+     Values:         <uint32_t>
+
+     The identifier of the Xen event channel used to signal activity
+     in the command ring.
+
+ring-ref
+     Values:         <uint32_t>
+
+     The Xen grant reference granting permission for the backend to map
+     the sole page in a single page sized command ring.
+
+#### Backend XenBus Nodes
+
+versions
+     Values:         <string>
+
+     List of comma separated protocol versions supported by the backend.
+     For example "1,2,3". Currently the value is just "1", as there is
+     only one version.
+
+max-page-order
+     Values:         <uint32_t>
+
+     The maximum supported size of a memory allocation in units of
+     log2n(machine pages), e.g. 1 = 2 pages, 2 == 4 pages, etc. It must
+     be 1 or more.
+
+function-calls
+     Values:         <uint32_t>
+
+     Value "0" means that no calls are supported.
+     Value "1" means that socket, connect, release, bind, listen, accept
+     and poll are supported.
+
+#### State Machine
+
+Initialization:
+
+    *Front*                               *Back*
+    XenbusStateInitialising               XenbusStateInitialising
+    - Query virtual device                - Query backend device
+      properties.                           identification data.
+    - Setup OS device instance.           - Publish backend features
+    - Allocate and initialize the           and transport parameters
+      request ring.                                      |
+    - Publish transport parameters                       |
+      that will be in effect during                      V
+      this connection.                            XenbusStateInitWait
+                 |
+                 |
+                 V
+       XenbusStateInitialised
+
+                                          - Query frontend transport parameters.
+                                          - Connect to the request ring and
+                                            event channel.
+                                                         |
+                                                         |
+                                                         V
+                                                 XenbusStateConnected
+
+     - Query backend device properties.
+     - Finalize OS virtual device
+       instance.
+                 |
+                 |
+                 V
+        XenbusStateConnected
+
+Once frontend and backend are connected, they have a shared page, which
+will is used to exchange messages over a ring, and an event channel,
+which is used to send notifications.
+
+Shutdown:
+
+    *Front*                            *Back*
+    XenbusStateConnected               XenbusStateConnected
+                |
+                |
+                V
+       XenbusStateClosing
+
+                                       - Unmap grants
+                                       - Unbind event channels
+                                                 |
+                                                 |
+                                                 V
+                                         XenbusStateClosing
+
+    - Unbind event channels
+    - Free rings
+    - Free data structures
+               |
+               |
+               V
+       XenbusStateClosed
+
+                                       - Free remaining data structures
+                                                 |
+                                                 |
+                                                 V
+                                         XenbusStateClosed
+
+
+### Commands Ring
+
+The shared ring is used by the frontend to forward POSIX function calls
+to the backend. We shall refer to this ring as **commands ring** to
+distinguish it from other rings which can be created later in the
+lifecycle of the protocol (see [Indexes Page and Data ring]). The grant
+reference for shared page for this ring is shared on xenstore (see
+[Frontend XenBus Nodes]). The ring format is defined using the familiar
+`DEFINE_RING_TYPES` macro (`xen/include/public/io/ring.h`).  Frontend
+requests are allocated on the ring using the `RING_GET_REQUEST` macro.
+The list of commands below is in calling order.
+
+The format is defined as follows:
+    
+    #define PVCALLS_SOCKET         0
+    #define PVCALLS_CONNECT        1
+    #define PVCALLS_RELEASE        2
+    #define PVCALLS_BIND           3
+    #define PVCALLS_LISTEN         4
+    #define PVCALLS_ACCEPT         5
+    #define PVCALLS_POLL           6
+
+    struct xen_pvcalls_request {
+    	uint32_t req_id; /* private to guest, echoed in response */
+    	uint32_t cmd;    /* command to execute */
+    	union {
+    		struct xen_pvcalls_socket {
+    			uint64_t id;
+    			uint32_t domain;
+    			uint32_t type;
+    			uint32_t protocol;
+    			#ifdef CONFIG_X86_32
+    			uint8_t pad[4];
+    			#endif
+    		} socket;
+    		struct xen_pvcalls_connect {
+    			uint64_t id;
+    			uint8_t addr[28];
+    			uint32_t len;
+    			uint32_t flags;
+    			grant_ref_t ref;
+    			uint32_t evtchn;
+    			#ifdef CONFIG_X86_32
+    			uint8_t pad[4];
+    			#endif
+    		} connect;
+    		struct xen_pvcalls_release {
+    			uint64_t id;
+    			uint8_t reuse;
+    			#ifdef CONFIG_X86_32
+    			uint8_t pad[7];
+    			#endif
+    		} release;
+    		struct xen_pvcalls_bind {
+    			uint64_t id;
+    			uint8_t addr[28];
+    			uint32_t len;
+    		} bind;
+    		struct xen_pvcalls_listen {
+    			uint64_t id;
+    			uint32_t backlog;
+    			#ifdef CONFIG_X86_32
+    			uint8_t pad[4];
+    			#endif
+    		} listen;
+    		struct xen_pvcalls_accept {
+    			uint64_t id;
+    			uint64_t id_new;
+    			grant_ref_t ref;
+    			uint32_t evtchn;
+    		} accept;
+    		struct xen_pvcalls_poll {
+    			uint64_t id;
+    		} poll;
+    		/* dummy member to force sizeof(struct xen_pvcalls_request) to match across archs */
+    		struct xen_pvcalls_dummy {
+    			uint8_t dummy[56];
+    		} dummy;
+    	} u;
+    };
+
+The first two fields are common for every command. Their binary layout
+is:
+
+    0       4       8
+    +-------+-------+
+    |req_id |  cmd  |
+    +-------+-------+
+
+- **req_id** is generated by the frontend and is a cookie used to
+  identify one specific request/response pair of commands. Not to be
+  confused with any command **id** which are used to identify a socket
+  across multiple commands, see [Socket].
+- **cmd** is the command requested by the frontend:
+
+    - `PVCALLS_SOCKET`:  0
+    - `PVCALLS_CONNECT`: 1
+    - `PVCALLS_RELEASE`: 2
+    - `PVCALLS_BIND`:    3
+    - `PVCALLS_LISTEN`:  4
+    - `PVCALLS_ACCEPT`:  5
+    - `PVCALLS_POLL`:    6
+
+Both fields are echoed back by the backend. See [Socket families and
+address format] for the format of the **addr** field of connect and
+bind. The maximum size of command specific arguments is 56 bytes. Any
+future command that requires more than that will need a bump the
+**version** of the protocol.
+
+Similarly to other Xen ring based protocols, after writing a request to
+the ring, the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and
+issues an event channel notification when a notification is required.
+
+Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro.
+The format is the following:
+
+    struct xen_pvcalls_response {
+        uint32_t req_id;
+        uint32_t cmd;
+        int32_t ret;
+        uint32_t pad;
+        union {
+    		struct _xen_pvcalls_socket {
+    			uint64_t id;
+    		} socket;
+    		struct _xen_pvcalls_connect {
+    			uint64_t id;
+    		} connect;
+    		struct _xen_pvcalls_release {
+    			uint64_t id;
+    		} release;
+    		struct _xen_pvcalls_bind {
+    			uint64_t id;
+    		} bind;
+    		struct _xen_pvcalls_listen {
+    			uint64_t id;
+    		} listen;
+    		struct _xen_pvcalls_accept {
+    			uint64_t id;
+    		} accept;
+    		struct _xen_pvcalls_poll {
+    			uint64_t id;
+    		} poll;
+    		struct _xen_pvcalls_dummy {
+    			uint8_t dummy[8];
+    		} dummy;
+    	} u;
+    };
+
+The first four fields are common for every response. Their binary layout
+is:
+
+    0       4       8       12      16
+    +-------+-------+-------+-------+
+    |req_id |  cmd  |  ret  |  pad  |
+    +-------+-------+-------+-------+
+
+- **req_id**: echoed back from request
+- **cmd**: echoed back from request
+- **ret**: return value, identifies success (0) or failure (see [Error
+  numbers] in further sections). If the **cmd** is not supported by the
+  backend, ret is ENOTSUP.
+- **pad**: padding
+
+After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks whether
+it needs to notify the frontend and does so via event channel.
+
+A description of each command, their additional request and response
+fields follow.
+
+
+#### Socket
+
+The **socket** operation corresponds to the POSIX [socket][socket]
+function. It creates a new socket of the specified family, type and
+protocol. **id** is freely chosen by the frontend and references this
+specific socket from this point forward. See [Socket families and
+address format] to see which ones are supported by different versions of
+the protocol.
+
+Request fields:
+
+- **cmd** value: 0
+- additional fields:
+  - **id**: generated by the frontend, it identifies the new socket
+  - **domain**: the communication domain
+  - **type**: the socket type
+  - **protocol**: the particular protocol to be used with the socket, usually 0
+
+Request binary layout:
+
+    8       12      16      20     24       28
+    +-------+-------+-------+-------+-------+
+    |       id      |domain | type  |protoco|
+    +-------+-------+-------+-------+-------+
+
+Response additional fields:
+
+- **id**: echoed back from request
+
+Response binary layout:
+
+    16       20       24
+    +-------+--------+
+    |       id       |
+    +-------+--------+
+
+Return value:
+
+  - 0 on success
+  - See the [POSIX socket function][connect] for error names; see
+    [Error numbers] in further sections.
+
+#### Connect
+
+The **connect** operation corresponds to the POSIX [connect][connect]
+function. It connects a previously created socket (identified by **id**)
+to the specified address.
+
+The connect operation creates a new shared ring, which we'll call **data
+ring**. The data ring is used to send and receive data from the
+socket. The connect operation passes two additional parameters:
+**evtchn** and **ref**. **evtchn** is the port number of a new event
+channel which will be used for notifications of activity on the data
+ring. **ref** is the grant reference of the **indexes page**: a page
+which contains shared indexes that point to the write and read locations
+in the **data ring**. The **indexes page** also contains the full array
+of grant references for the **data ring**. When the frontend issues a
+**connect** command, the backend:
+
+- finds its own internal socket corresponding to **id**
+- connects the socket to **addr**
+- maps the grant reference **ref**, the indexes page, see struct
+  pvcalls_data_intf
+- maps all the grant references listed in `struct pvcalls_data_intf` and
+  uses them as shared memory for the **data ring**
+- bind the **evtchn**
+- replies to the frontend
+
+The [Indexes Page and Data ring] format will be described in the
+following section. The **data ring** is unmapped and freed upon issuing
+a **release** command on the active socket identified by **id**. A
+frontend state change can also cause data rings to be unmapped.
+
+Request fields:
+
+- **cmd** value: 0
+- additional fields:
+  - **id**: identifies the socket
+  - **addr**: address to connect to, see [Socket families and address format]
+  - **len**: address length up to 28 octets
+  - **flags**: flags for the connection, reserved for future usage
+  - **ref**: grant reference of the indexes page
+  - **evtchn**: port number of the evtchn to signal activity on the **data ring**
+
+Request binary layout:
+
+    8       12      16      20      24      28      32      36      40      44
+    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
+    |       id      |                            addr                       |
+    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
+    | len   | flags |  ref  |evtchn |
+    +-------+-------+-------+-------+
+
+Response additional fields:
+
+- **id**: echoed back from request
+
+Response binary layout:
+
+    16      20      24
+    +-------+-------+
+    |       id      |
+    +-------+-------+
+
+Return value:
+
+  - 0 on success
+  - See the [POSIX connect function][connect] for error names; see
+    [Error numbers] in further sections.
+
+#### Release
+
+The **release** operation closes an existing active or a passive socket.
+
+When a release command is issued on a passive socket, the backend
+releases it and frees its internal mappings. When a release command is
+issued for an active socket, the data ring and indexes page are also
+unmapped and freed:
+
+- frontend sends release command for an active socket
+- backend releases the socket
+- backend unmaps the data ring
+- backend unmaps the indexes page
+- backend unbinds the event channel
+- backend replies to frontend with an **ret** value
+- frontend frees data ring, indexes page and unbinds event channel
+
+Request fields:
+
+- **cmd** value: 1
+- additional fields:
+  - **id**: identifies the socket
+  - **reuse**: an optimization hint for the backend. The field is
+    ignored for passive sockets. When set to 1, the frontend lets the
+    backend know that it will reuse exactly the same set of grant pages
+    (indexes page and data ring) and event channel when creating one of
+    the next active sockets. The backend can take advantage of it by
+    delaying unmapping grants and unbinding the event channel. The
+    backend is free to ignore the hint. Reused data rings are found by
+    **ref**, the grant reference of the page containing the indexes.
+
+Request binary layout:
+
+    8       12      16    17
+    +-------+-------+-----+
+    |       id      |reuse|
+    +-------+-------+-----+
+
+Response additional fields:
+
+- **id**: echoed back from request
+
+Response binary layout:
+
+    16      20      24
+    +-------+-------+
+    |       id      |
+    +-------+-------+
+
+Return value:
+
+  - 0 on success
+  - See the [POSIX shutdown function][shutdown] for error names; see
+    [Error numbers] in further sections.
+
+#### Bind
+
+The **bind** operation corresponds to the POSIX [bind][bind] function.
+It assigns the address passed as parameter to a previously created
+socket, identified by **id**. **Bind**, **listen** and **accept** are
+the three operations required to have fully working passive sockets and
+should be issued in that order.
+
+Request fields:
+
+- **cmd** value: 2
+- additional fields:
+  - **id**: identifies the socket
+  - **addr**: address to connect to, see [Socket families and address
+    format]
+  - **len**: address length up to 28 octets
+
+Request binary layout:
+
+    8       12      16      20      24      28      32      36      40      44
+    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
+    |       id      |                            addr                       |
+    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
+    |  len  |
+    +-------+
+
+Response additional fields:
+
+- **id**: echoed back from request
+
+Response binary layout:
+
+    16      20      24
+    +-------+-------+
+    |       id      |
+    +-------+-------+
+
+Return value:
+
+  - 0 on success
+  - See the [POSIX bind function][bind] for error names; see
+    [Error numbers] in further sections.
+
+
+#### Listen
+
+The **listen** operation marks the socket as a passive socket. It corresponds to
+the [POSIX listen function][listen].
+
+Reuqest fields:
+
+- **cmd** value: 3
+- additional fields:
+  - **id**: identifies the socket
+  - **backlog**: the maximum length to which the queue of pending
+    connections may grow in number of elements
+
+Request binary layout:
+
+    8       12      16      20
+    +-------+-------+-------+
+    |       id      |backlog|
+    +-------+-------+-------+
+
+Response additional fields:
+
+- **id**: echoed back from request
+
+Response binary layout:
+
+    16      20      24
+    +-------+-------+
+    |       id      |
+    +-------+-------+
+
+Return value:
+  - 0 on success
+  - See the [POSIX listen function][listen] for error names; see
+    [Error numbers] in further sections.
+
+
+#### Accept
+
+The **accept** operation extracts the first connection request on the
+queue of pending connections for the listening socket identified by
+**id** and creates a new connected socket. The id of the new socket is
+also chosen by the frontend and passed as an additional field of the
+accept request struct (**id_new**). See the [POSIX accept function][accept]
+as reference.
+
+Similarly to the **connect** operation, **accept** creates new [Indexes
+Page and Data ring]. The **data ring** is used to send and receive data from
+the socket. The **accept** operation passes two additional parameters:
+**evtchn** and **ref**. **evtchn** is the port number of a new event
+channel which will be used for notifications of activity on the data
+ring. **ref** is the grant reference of the **indexes page**: a page
+which contains shared indexes that point to the write and read locations
+in the **data ring**. The **indexes page** also contains the full array of
+grant references for the **data ring**.
+
+The backend will reply to the request only when a new connection is
+successfully accepted, i.e. the backend does not return EAGAIN or
+EWOULDBLOCK.
+
+Example workflow:
+
+- frontend issues an **accept** request
+- backend waits for a connection to be available on the socket
+- a new connection becomes available
+- backend accepts the new connection
+- backend creates an internal mapping from **id_new** to the new socket
+- backend maps the grant reference **ref**, the indexes page, see struct
+  pvcalls_data_intf
+- backend maps all the grant references listed in `struct
+  pvcalls_data_intf` and uses them as shared memory for the new data
+  ring **in** and **out** arrays
+- backend binds to the **evtchn**
+- backend replies to the frontend with a **ret** value
+
+Request fields:
+
+- **cmd** value: 4
+- additional fields:
+  - **id**: id of listening socket
+  - **id_new**: id of the new socket
+  - **ref**: grant reference of the indexes page
+  - **evtchn**: port number of the evtchn to signal activity on the data ring
+
+Request binary layout:
+
+    8       12      16      20      24      28      32
+    +-------+-------+-------+-------+-------+-------+
+    |       id      |    id_new     |  ref  |evtchn |
+    +-------+-------+-------+-------+-------+-------+
+
+Response additional fields:
+
+- **id**: id of the listening socket, echoed back from request
+
+Response binary layout:
+
+    16      20      24
+    +-------+-------+
+    |       id      |
+    +-------+-------+
+
+Return value:
+
+  - 0 on success
+  - See the [POSIX accept function][accept] for error names; see
+    [Error numbers] in further sections.
+
+
+#### Poll
+
+In this version of the protocol, the **poll** operation is only valid
+for passive sockets. For active sockets, the frontend should look at the
+indexes on the **indexes page**. When a new connection is available in
+the queue of the passive socket, the backend generates a response and
+notifies the frontend.
+
+Request fields:
+
+- **cmd** value: 5
+- additional fields:
+  - **id**: identifies the listening socket
+
+Request binary layout:
+
+    8       12      16
+    +-------+-------+
+    |       id      |
+    +-------+-------+
+
+
+Response additional fields:
+
+- **id**: echoed back from request
+
+Response binary layout:
+
+    16       20       24
+    +--------+--------+
+    |        id       |
+    +--------+--------+
+
+Return value:
+
+  - 0 on success
+  - See the [POSIX poll function][poll] for error names; see
+    [Error numbers] in further sections.
+
+#### Expanding the protocol
+
+It is possible to introduce new commands without changing the protocol
+ABI. Naturally, a feature flag among the backend xenstore nodes should
+advertise the availability of a new set of commands.
+
+If a new command requires parameters in struct xen_pvcalls_request
+larger than 56 bytes, which is the current size of the request, then the
+protocol version should be increased. One way to implement the large
+request structure without disrupting the current ABI is to introduce a
+new command, such as PVCALLS_CONNECT_EXTENDED, and a flag to specify
+that the request uses two request slots, for a total of 112 bytes.
+
+#### Error numbers
+
+The numbers corresponding to the error names specified by POSIX are:
+
+    [EPERM]         -1
+    [ENOENT]        -2
+    [ESRCH]         -3
+    [EINTR]         -4
+    [EIO]           -5
+    [ENXIO]         -6
+    [E2BIG]         -7
+    [ENOEXEC]       -8
+    [EBADF]         -9
+    [ECHILD]        -10
+    [EAGAIN]        -11
+    [EWOULDBLOCK]   -11
+    [ENOMEM]        -12
+    [EACCES]        -13
+    [EFAULT]        -14
+    [EBUSY]         -16
+    [EEXIST]        -17
+    [EXDEV]         -18
+    [ENODEV]        -19
+    [EISDIR]        -21
+    [EINVAL]        -22
+    [ENFILE]        -23
+    [EMFILE]        -24
+    [ENOSPC]        -28
+    [EROFS]         -30
+    [EMLINK]        -31
+    [EDOM]          -33
+    [ERANGE]        -34
+    [EDEADLK]       -35
+    [EDEADLOCK]     -35
+    [ENAMETOOLONG]  -36
+    [ENOLCK]        -37
+    [ENOTEMPTY]     -39
+    [ENOSYS]        -38
+    [ENODATA]       -61
+    [ETIME]         -62
+    [EBADMSG]       -74
+    [EOVERFLOW]     -75
+    [EILSEQ]        -84
+    [ERESTART]      -85
+    [ENOTSOCK]      -88
+    [EOPNOTSUPP]    -95
+    [EAFNOSUPPORT]  -97
+    [EADDRINUSE]    -98
+    [EADDRNOTAVAIL] -99
+    [ENOBUFS]       -105
+    [EISCONN]       -106
+    [ENOTCONN]      -107
+    [ETIMEDOUT]     -110
+    [ENOTSUP]      -524
+
+#### Socket families and address format
+
+The following definitions and explicit sizes, together with POSIX
+[sys/socket.h][address] and [netinet/in.h][in] define socket families and
+address format. Please be aware that only the **domain** `AF_INET`, **type**
+`SOCK_STREAM` and **protocol** `0` are supported by this version of the
+specification, others return ENOTSUP.
+
+    #define AF_UNSPEC   0
+    #define AF_UNIX     1   /* Unix domain sockets      */
+    #define AF_LOCAL    1   /* POSIX name for AF_UNIX   */
+    #define AF_INET     2   /* Internet IP Protocol     */
+    #define AF_INET6    10  /* IP version 6         */
+
+    #define SOCK_STREAM 1
+    #define SOCK_DGRAM  2
+    #define SOCK_RAW    3
+
+    /* generic address format */
+    struct sockaddr {
+        uint16_t sa_family_t;
+        char sa_data[26];
+    };
+
+    struct in_addr {
+        uint32_t s_addr;
+    };
+
+    /* AF_INET address format */
+    struct sockaddr_in {
+        uint16_t         sa_family_t;
+        uint16_t         sin_port;
+        struct in_addr   sin_addr;
+        char             sin_zero[20];
+    };
+
+
+### Indexes Page and Data ring
+
+Data rings are used for sending and receiving data over a connected socket. They
+are created upon a successful **accept** or **connect** command.
+The **sendmsg** and **recvmsg** calls are implemented by sending data and
+receiving data from a data ring, and updating the corresponding indexes
+on the **indexes page**.
+
+Firstly, the **indexes page** is shared by a **connect** or **accept**
+command, see **ref** parameter in their sections. The content of the
+**indexes page** is represented by `struct pvcalls_ring_intf`, see
+below. The structure contains the list of grant references which
+constitute the **in** and **out** buffers of the data ring, see ref[]
+below. The backend maps the grant references contiguously. Of the
+resulting shared memory, the first half is dedicated to the **in** array
+and the second half to the **out** array. They are used as circular
+buffers for transferring data, and, together, they are the data ring.
+
+
+  +---------------------------+                 Indexes page
+  | Command ring:             |                 +----------------------+
+  | @0: xen_pvcalls_connect:  |                 |@0 pvcalls_data_intf: |
+  | @44: ref  +-------------------------------->+@76: ring_order = 1   |
+  |                           |                 |@80: ref[0]+          |
+  +---------------------------+                 |@84: ref[1]+          |
+                                                |           |          |
+                                                |           |          |
+                                                +----------------------+
+                                                            |
+                                                            v (data ring)
+                                                    +-------+-----------+
+                                                    |  @0->4098: in     |
+                                                    |  ref[0]           |
+                                                    |-------------------|
+                                                    |  @4099->8196: out |
+                                                    |  ref[1]           |
+                                                    +-------------------+
+ 
+
+#### Indexes Page Structure
+
+    typedef uint32_t PVCALLS_RING_IDX;
+
+    struct pvcalls_data_intf {
+    	PVCALLS_RING_IDX in_cons, in_prod;
+    	int32_t in_error;
+
+    	uint8_t pad[52];
+
+    	PVCALLS_RING_IDX out_cons, out_prod;
+    	int32_t out_error;
+
+    	uint8_t pad[52];
+
+    	uint32_t ring_order;
+    	grant_ref_t ref[];
+    };
+
+    /* not actually C compliant (ring_order changes from socket to socket) */
+    struct pvcalls_data {
+        char in[((1<<ring_order)<<PAGE_SHIFT)/2];
+        char out[((1<<ring_order)<<PAGE_SHIFT)/2];
+    };
+
+- **ring_order**
+  It represents the order of the data ring. The following list of grant
+  references is of `(1 << ring_order)` elements. It cannot be greater than
+  **max-page-order**, as specified by the backend on XenBus. It has to
+  be one at minimum.
+- **ref[]**
+  The list of grant references which will contain the actual data. They are
+  mapped contiguosly in virtual memory. The first half of the pages is the
+  **in** array, the second half is the **out** array. The arrays must
+  have a power of two size. Together, their size is `(1 << ring_order) *
+  PAGE_SIZE`.
+- **in** is an array used as circular buffer
+  It contains data read from the socket. The producer is the backend, the
+  consumer is the frontend.
+- **out** is an array used as circular buffer
+  It contains data to be written to the socket. The producer is the frontend,
+  the consumer is the backend.
+- **in_cons** and **in_prod**
+  Consumer and producer indexes for data read from the socket. They keep track
+  of how much data has already been consumed by the frontend from the **in**
+  array. **in_prod** is increased by the backend, after writing data to **in**.
+  **in_cons** is increased by the frontend, after reading data from **in**.
+- **out_cons**, **out_prod**
+  Consumer and producer indexes for the data to be written to the socket. They
+  keep track of how much data has been written by the frontend to **out** and
+  how much data has already been consumed by the backend. **out_prod** is
+  increased by the frontend, after writing data to **out**. **out_cons** is
+  increased by the backend, after reading data from **out**.
+- **in_error** and **out_error** They signal errors when reading from the socket
+  (**in_error**) or when writing to the socket (**out_error**). 0 means no
+  errors. When an error occurs, no further reads or writes operations are
+  performed on the socket. In the case of an orderly socket shutdown (i.e. read
+  returns 0) **in_error** is set to ENOTCONN. **in_error** and **out_error**
+  are never set to EAGAIN or EWOULDBLOCK (the data is written to the
+  ring as soon as it is available).
+
+The binary layout of `struct pvcalls_data_intf` follows:
+
+    0         4         8         12           64        68        72        76 
+    +---------+---------+---------+-----//-----+---------+---------+---------+
+    | in_cons | in_prod |in_error |  padding   |out_cons |out_prod |out_error|
+    +---------+---------+---------+-----//-----+---------+---------+---------+
+
+    76        80        84        88      4092      4096
+    +---------+---------+---------+----//---+---------+
+    |ring_orde|  ref[0] |  ref[1] |         |  ref[N] |
+    +---------+---------+---------+----//---+---------+
+
+**N.B** For one page, N is maximum 991 ((4096-132)/4), but given that N needs
+to be a power of two, actually max N is 512 (ring_order = 9).
+
+#### Data Ring Structure
+
+The binary layout of the data ring follow:
+
+    0         ((1<<ring_order)<<PAGE_SHIFT)/2       ((1<<ring_order)<<PAGE_SHIFT)
+    +------------//-------------+------------//-------------+
+    |            in             |           out             |
+    +------------//-------------+------------//-------------+
+
+#### Why ring.h is not needed
+
+Many Xen PV protocols use the macros provided by [ring.h] to manage
+their shared ring for communication. PVCalls does not, because the [Data
+Ring Structure] actually comes with two rings: the **in** ring and the
+**out** ring. Each of them is mono-directional, and there is no static
+request size: the producer writes opaque data to the ring. On the other
+end, in [ring.h] they are combined, and the request size is static and
+well-known. In PVCalls:
+
+  in -> backend to frontend only
+  out-> frontend to backend only
+
+In the case of the **in** ring, the frontend is the consumer, and the
+backend is the producer. Everything is the same but mirrored for the
+**out** ring.
+
+The producer, the backend in this case, never reads from the **in**
+ring. In fact, the producer doesn't need any notifications unless the
+ring is full. This version of the protocol doesn't take advantage of it,
+leaving room for optimizations.
+
+On the other end, the consumer always requires notifications, unless it
+is already actively reading from the ring. The producer can figure it
+out, without any additional fields in the protocol, by comparing the
+indexes at the beginning and the end of the function. This is similar to
+what [ring.h] does.
+
+#### Workflow
+
+The **in** and **out** arrays are used as circular buffers:
+    
+    0                               sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2
+    +-----------------------------------+
+    |to consume|    free    |to consume |
+    +-----------------------------------+
+               ^            ^
+               prod         cons
+
+    0                               sizeof(array)
+    +-----------------------------------+
+    |  free    | to consume |   free    |
+    +-----------------------------------+
+               ^            ^
+               cons         prod
+
+The following function is provided to calculate how many bytes are currently
+left unconsumed in an array:
+
+    #define _MASK_PVCALLS_IDX(idx, ring_size) ((idx) & (ring_size-1))
+
+    static inline PVCALLS_RING_IDX pvcalls_ring_unconsumed(PVCALLS_RING_IDX prod,
+    		PVCALLS_RING_IDX cons,
+    		PVCALLS_RING_IDX ring_size)
+    {
+    	PVCALLS_RING_IDX size;
+    
+    	if (prod == cons)
+    		return 0;
+    
+    	prod = _MASK_PVCALLS_IDX(prod, ring_size);
+    	cons = _MASK_PVCALLS_IDX(cons, ring_size);
+    
+    	if (prod == cons)
+    		return ring_size;
+    
+    	if (prod > cons)
+    		size = prod - cons;
+    	else {
+    		size = ring_size - cons;
+    		size += prod;
+    	}
+    	return size;
+    }
+
+The producer (the backend for **in**, the frontend for **out**) writes to the
+array in the following way:
+
+- read *[in|out]_cons*, *[in|out]_prod*, *[in|out]_error* from shared memory
+- general memory barrier
+- return on *[in|out]_error*
+- write to array at position *[in|out]_prod* up to *[in|out]_cons*,
+  wrapping around the circular buffer when necessary
+- write memory barrier
+- increase *[in|out]_prod*
+- notify the other end via evtchn
+
+The consumer (the backend for **out**, the frontend for **in**) reads from the
+array in the following way:
+
+- read *[in|out]_prod*, *[in|out]_cons*, *[in|out]_error* from shared memory
+- read memory barrier
+- return on *[in|out]_error*
+- read from array at position *[in|out]_cons* up to *[in|out]_prod*,
+  wrapping around the circular buffer when necessary
+- general memory barrier
+- increase *[in|out]_cons*
+- notify the other end via evtchn
+
+The producer takes care of writing only as many bytes as available in
+the buffer up to *[in|out]_cons*. The consumer takes care of reading
+only as many bytes as available in the buffer up to *[in|out]_prod*.
+*[in|out]_error* is set by the backend when an error occurs writing or
+reading from the socket.
+
+
+[xenstore]: http://xenbits.xen.org/docs/unstable/misc/xenstore.txt
+[XenbusStateInitialising]: http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html
+[address]: http://pubs.opengroup.org/onlinepubs/7908799/xns/syssocket.h.html
+[in]: http://pubs.opengroup.org/onlinepubs/000095399/basedefs/netinet/in.h.html
+[socket]: http://pubs.opengroup.org/onlinepubs/009695399/functions/socket.html
+[connect]: http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html
+[shutdown]: http://pubs.opengroup.org/onlinepubs/7908799/xns/shutdown.html
+[bind]: http://pubs.opengroup.org/onlinepubs/7908799/xns/bind.html
+[listen]: http://pubs.opengroup.org/onlinepubs/7908799/xns/listen.html
+[accept]: http://pubs.opengroup.org/onlinepubs/7908799/xns/accept.html
+[poll]: http://pubs.opengroup.org/onlinepubs/7908799/xsh/poll.html
+[ring.h]: http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel