Inter-domain Communication using Virtual Sockets (high-level design)

* Inter-domain Communication using Virtual Sockets (high-level design)
@ 2013-06-11 18:07 David Vrabel
  2013-06-11 18:54 ` Andrew Cooper
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: David Vrabel @ 2013-06-11 18:07 UTC (permalink / raw)
  To: Xen-devel; +Cc: Vincent Hanquez, Ross Philipson

All,

This is a high-level design document for an inter-domain communication
system under the virtual sockets API (AF_VSOCK) recently added to Linux.

Two low-level transports are discussed: a shared ring based one
requiring no additional hypervisor support and v4v.

The PDF (including the diagrams) is available here:

http://xenbits.xen.org/people/dvrabel/inter-domain-comms-C.pdf

% Inter-domain Communication using Virtual Sockets
% David Vrabel <<david.vrabel@citrix.com>
% Draft C

Introduction
============

Revision History
----------------

--------------------------------------------------------------------
Version  Date         Changes
-------  -----------  ----------------------------------------------
Draft C  11 Jun 2013  Minor clarifications.

Draft B  10 Jun 2013  Added a section on the low-level shared ring
transport.

                      Added a section on using v4v as the low-level
transport.

Draft A  28 May 2013  Initial draft.
--------------------------------------------------------------------

Purpose
-------

In the Windsor architecture for XenServer, dom0 is disaggregated into
several _service domains_.  Examples of service domains include
network and storage driver domains, and qemu (stub) domains.

To allow the toolstack to manage service domains there needs to be a
communication mechanism between the toolstack running in one domain and
all the service domains.

The principle focus of this new transport is control-plane traffic
(low latency and low data rates) but consideration is given to future
uses requiring higher data rates.

Linux 3.9 support virtual sockets which is a new type of socket (the
new AF_VSOCK address family) for inter-domain communication.  This was
originally implemented for VMWare's VMCI transport but has hooks for
other transports.  This will be used to provide the interface to
applications.

System Overview
---------------

![\label{fig_overview}System Overview](overview.pdf)

Design Map
----------

The linux kernel requires a Xen-specific virtual socket transport and
front and back drivers.

The connection manager is a new user space daemon running in the
backend domain.

Toolstacks will require changes to allow them to set the policy used
by the connection manager.  The design of these changes is out of
scope of this document.

Definitions and Acronyms
------------------------

_AF\_VSOCK_
  ~ The address family for virtual sockets.

_CID (Context ID)_

  ~ The domain ID portion of the AF_VSOCK address format.

_Port_

  ~ The part of the AF_VSOCK address format identifying a specific
    service. Similar to the port number used in TCP connection.

_Virtual Socket_

  ~ A socket using the AF_VSOCK protocol.

References
----------

[Windsor Architecture slides from XenSummit
2012](http://www.slideshare.net/xen_com_mgr/windsor-domain-0-disaggregation-for-xenserver-and-xcp)

Design Considerations
=====================

Assumptions
-----------

* There exists a low-level peer-to-peer, datagram based transport
  mechanism using shared rings (as in libvchan).

Constraints
-----------

* The AF_VSOCK address format is limited to a 32-bit CID and a 32-bit
  port number.  This is sufficient as Xen only has 16-bit domain IDs.

Risks and Volatile Areas
------------------------

* The transport may be used between untrusted peers.  A domain may be
  subject to malicious activity or denial of service attacks.

Architecture
============

Overview
--------

![\label{fig_architecture}Architecture Overview](architecture.pdf)

Linux's virtual sockets are used as the interface to applications.
Virtual sockets were introduced in Linux 3.9 and provides a hypervisor
independent[^1] interface to user space applications for inter-domain
communication.

[^1]: The API and address format is hypervisor independent but the
address values are not.

An internal API is provided to implement a low-level virtual socket
transport.  This will be implemented within a pair of front and back
drivers.  The use of the standard front/back driver method allows the
toolstack to handle the suspend, resume and migration in a similar way
to the existing drivers.

The front/back pair provides a point-to-point link between the two
domains.  This is used to communicate between applications on those
hosts and between the frontend domain and the _connection manager_
running on the backend.

The connection manager allows domUs to request direct connections to
peer domains.  Without the connection manager, peers have no mechanism
to exchange the information ncessary for setting up the direct
connections. The toolstack sets the policy in the connection manager
to allow connection requests.  The default policy is to deny
connection requests.

High Level Design
=================

Virtual Sockets
---------------

The AF_VSOCK socket address family in the Linux kernel has a two part
address format: a uint32_t _context ID_ (_CID_) identifying the domain
and a uint32_t port for the specific service in that domain.

The CID shall be the domain ID and some CIDs have a specific meaning.

CID                     Purpose
-------------------     -------
0x7FF0 (DOMID_SELF)     The local domain.
0x7FF1                  The backend domain (where the connection manager
is).

Some port numbers are reserved.

Port    Purpose
----    -------
0       Reserved
1       Connection Manager
2-1023  Reserved for well-known services (such as a service discovery
service).

Front / Back Drivers
--------------------

Using a front or back driver to provide the virtual socket transport
allows the toolstack to only make the inter-domain communication
facility available to selected domains.

The "standard" xenbus connection state machine shall be used. See
figures \ref{fig_front-sm} and \ref{fig_back-sm} on pages
\pageref{fig_front-sm} and \pageref{fig_back-sm}.

![\label{fig_front-sm}Frontend Connection State Machine](front-sm.pdf)

![\label{fig_back-sm}Backend Connection State Machine](back-sm.pdf)

Connection Manager
------------------

The connection manager has two main purposes.

1. Checking that two domains are permitted to connect.

2. Providing a mechanism for two domains to exchange the grant
   references and event channels needed for them to setup a shared
   ring transport.

Domains commnicate with the connection manager over the front-back
transport link.  The connection manager must be in the same domain as
the virtual socket backend driver.

The connection manager opens a virtual socket and listens on a well
defined port (port 1).

The following messages are defined.

Message          Purpose
-------          -------
CONNECT_req      Request connection to another peer.
CONNECT_rsp      Response to a connection request.
CONNECT_ind      Indicate that a peer is trying to connect.
CONNECT_ack      Acknowledge a connection request.

![\label{fig_conn-msc}Connect Message Sequence Chart](conn.pdf)

Before forwarding a connection request to a peer, the connection
manager checks that the connection is permitted.  The toolstack sets
these permissions.

Disconnecting transport links to an uncooperative (or dead) domain is
required.  Therefore there are no messages for disconnecting transport
links (as these may be ignore or delayed). Instead a transport link is
disconnected by tearing down the local end. The peer will notice the
remote end going away and then teardown its end.

Low-level transport
===================

[ This exact details are yet to be determined but this section should
  provide a reasonably summary of the mechanisms used. ]

Frontend and backend domains
----------------------------

As is typical for frontend and backend drivers, the frontend will
grant copy-only access to two rings -- one for from-front messages and
one for to-front messages.  Each ring shall have an event channel for
notifying when requests and responses are placed on the ring.

Peer domains
------------

The initiator grants copy-only access to a from-initiator (transmit)
ring and provides an event channel for notifications for this ring.
This information is included in the CONNECT_req and CONNECT_ind
messages.

The responder grants copy-only access to a from-responder (transmit)
ring and provides an event channel for notifications for this ring.
The information is included in the CONNECT_ack and CONNECT_rsp
messages.

After the initial connection, the two domains operate as identical
peers.  Disconnection is signalled by a domain ungranting its transmit
ring, notifying the peer via the associated event channel.  The event
channel is then unbound.

Appendix
========

V4V
---

An alternative low-level transport (V4V) has been proposed.  The
hypervisor copies messages from the source domain into a destination
ring provided by the destination domain.

Because peers are untrusted, in order to prevent them from being able
to denial-of-service the processing of messages from other peers, each
receiver must have a per-peer receive ring.  A listening service does
not know in advance which peers may connect so it cannot create these
rings in advance.

The connection manager service running in a trusted domain (as in the
shared ring transport described above) may be used.  The CONNECT_ind
message is used to trigger the creation of receive ring for that
specific sender.

A peer must be able to find the connection manager service both at
start of day and if the connection manager service is restarted in a
new domain.  This can be done in two possible ways:

1. Watch a Xenstore key which contains the connection manager service
   domain ID.

2. Use a frontend/backend driver pair.

### Advantages

* Does not use grant table resource.  If shared rings are used then a
  busy guest with hundreds of peers will require more grant table
  entries than the current default.

### Disadvantages

* Any changes or extentions to the protocol or ring format would
  require a hypervisor change.  This is more difficult than making
  changes to guests.

* The connection-less, "shared-bus" model of v4v is unsuitable for
  untrusted peers.  This requires layering a connection model on top
  and much of the simplicity of the v4v ABI is lost.

* The mechanism for handling full destination rings will not scale up
  on busy domains.  The event channel only indicates that some ring
  may have space -- it does not identify which ring has space.

^ permalink raw reply	[flat|nested] 10+ messages in thread