All of lore.kernel.org
 help / color / mirror / Atom feed
* Inter-domain Communication using Virtual Sockets (high-level design)
@ 2013-06-11 18:07 David Vrabel
  2013-06-11 18:54 ` Andrew Cooper
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: David Vrabel @ 2013-06-11 18:07 UTC (permalink / raw)
  To: Xen-devel; +Cc: Vincent Hanquez, Ross Philipson

All,

This is a high-level design document for an inter-domain communication
system under the virtual sockets API (AF_VSOCK) recently added to Linux.

Two low-level transports are discussed: a shared ring based one
requiring no additional hypervisor support and v4v.

The PDF (including the diagrams) is available here:

http://xenbits.xen.org/people/dvrabel/inter-domain-comms-C.pdf

% Inter-domain Communication using Virtual Sockets
% David Vrabel <<david.vrabel@citrix.com>
% Draft C

Introduction
============

Revision History
----------------

--------------------------------------------------------------------
Version  Date         Changes
-------  -----------  ----------------------------------------------
Draft C  11 Jun 2013  Minor clarifications.

Draft B  10 Jun 2013  Added a section on the low-level shared ring
transport.

                      Added a section on using v4v as the low-level
transport.

Draft A  28 May 2013  Initial draft.
--------------------------------------------------------------------

Purpose
-------

In the Windsor architecture for XenServer, dom0 is disaggregated into
several _service domains_.  Examples of service domains include
network and storage driver domains, and qemu (stub) domains.

To allow the toolstack to manage service domains there needs to be a
communication mechanism between the toolstack running in one domain and
all the service domains.

The principle focus of this new transport is control-plane traffic
(low latency and low data rates) but consideration is given to future
uses requiring higher data rates.

Linux 3.9 support virtual sockets which is a new type of socket (the
new AF_VSOCK address family) for inter-domain communication.  This was
originally implemented for VMWare's VMCI transport but has hooks for
other transports.  This will be used to provide the interface to
applications.


System Overview
---------------

![\label{fig_overview}System Overview](overview.pdf)


Design Map
----------

The linux kernel requires a Xen-specific virtual socket transport and
front and back drivers.

The connection manager is a new user space daemon running in the
backend domain.

Toolstacks will require changes to allow them to set the policy used
by the connection manager.  The design of these changes is out of
scope of this document.

Definitions and Acronyms
------------------------

_AF\_VSOCK_
  ~ The address family for virtual sockets.

_CID (Context ID)_

  ~ The domain ID portion of the AF_VSOCK address format.

_Port_

  ~ The part of the AF_VSOCK address format identifying a specific
    service. Similar to the port number used in TCP connection.

_Virtual Socket_

  ~ A socket using the AF_VSOCK protocol.

References
----------

[Windsor Architecture slides from XenSummit
2012](http://www.slideshare.net/xen_com_mgr/windsor-domain-0-disaggregation-for-xenserver-and-xcp)


Design Considerations
=====================

Assumptions
-----------

* There exists a low-level peer-to-peer, datagram based transport
  mechanism using shared rings (as in libvchan).

Constraints
-----------

* The AF_VSOCK address format is limited to a 32-bit CID and a 32-bit
  port number.  This is sufficient as Xen only has 16-bit domain IDs.

Risks and Volatile Areas
------------------------

* The transport may be used between untrusted peers.  A domain may be
  subject to malicious activity or denial of service attacks.

Architecture
============

Overview
--------

![\label{fig_architecture}Architecture Overview](architecture.pdf)

Linux's virtual sockets are used as the interface to applications.
Virtual sockets were introduced in Linux 3.9 and provides a hypervisor
independent[^1] interface to user space applications for inter-domain
communication.

[^1]: The API and address format is hypervisor independent but the
address values are not.

An internal API is provided to implement a low-level virtual socket
transport.  This will be implemented within a pair of front and back
drivers.  The use of the standard front/back driver method allows the
toolstack to handle the suspend, resume and migration in a similar way
to the existing drivers.

The front/back pair provides a point-to-point link between the two
domains.  This is used to communicate between applications on those
hosts and between the frontend domain and the _connection manager_
running on the backend.

The connection manager allows domUs to request direct connections to
peer domains.  Without the connection manager, peers have no mechanism
to exchange the information ncessary for setting up the direct
connections. The toolstack sets the policy in the connection manager
to allow connection requests.  The default policy is to deny
connection requests.


High Level Design
=================

Virtual Sockets
---------------

The AF_VSOCK socket address family in the Linux kernel has a two part
address format: a uint32_t _context ID_ (_CID_) identifying the domain
and a uint32_t port for the specific service in that domain.

The CID shall be the domain ID and some CIDs have a specific meaning.

CID                     Purpose
-------------------     -------
0x7FF0 (DOMID_SELF)     The local domain.
0x7FF1                  The backend domain (where the connection manager
is).

Some port numbers are reserved.

Port    Purpose
----    -------
0       Reserved
1       Connection Manager
2-1023  Reserved for well-known services (such as a service discovery
service).

Front / Back Drivers
--------------------

Using a front or back driver to provide the virtual socket transport
allows the toolstack to only make the inter-domain communication
facility available to selected domains.

The "standard" xenbus connection state machine shall be used. See
figures \ref{fig_front-sm} and \ref{fig_back-sm} on pages
\pageref{fig_front-sm} and \pageref{fig_back-sm}.

![\label{fig_front-sm}Frontend Connection State Machine](front-sm.pdf)

![\label{fig_back-sm}Backend Connection State Machine](back-sm.pdf)


Connection Manager
------------------

The connection manager has two main purposes.

1. Checking that two domains are permitted to connect.

2. Providing a mechanism for two domains to exchange the grant
   references and event channels needed for them to setup a shared
   ring transport.

Domains commnicate with the connection manager over the front-back
transport link.  The connection manager must be in the same domain as
the virtual socket backend driver.

The connection manager opens a virtual socket and listens on a well
defined port (port 1).

The following messages are defined.

Message          Purpose
-------          -------
CONNECT_req      Request connection to another peer.
CONNECT_rsp      Response to a connection request.
CONNECT_ind      Indicate that a peer is trying to connect.
CONNECT_ack      Acknowledge a connection request.

![\label{fig_conn-msc}Connect Message Sequence Chart](conn.pdf)

Before forwarding a connection request to a peer, the connection
manager checks that the connection is permitted.  The toolstack sets
these permissions.

Disconnecting transport links to an uncooperative (or dead) domain is
required.  Therefore there are no messages for disconnecting transport
links (as these may be ignore or delayed). Instead a transport link is
disconnected by tearing down the local end. The peer will notice the
remote end going away and then teardown its end.

Low-level transport
===================

[ This exact details are yet to be determined but this section should
  provide a reasonably summary of the mechanisms used. ]

Frontend and backend domains
----------------------------

As is typical for frontend and backend drivers, the frontend will
grant copy-only access to two rings -- one for from-front messages and
one for to-front messages.  Each ring shall have an event channel for
notifying when requests and responses are placed on the ring.

Peer domains
------------

The initiator grants copy-only access to a from-initiator (transmit)
ring and provides an event channel for notifications for this ring.
This information is included in the CONNECT_req and CONNECT_ind
messages.

The responder grants copy-only access to a from-responder (transmit)
ring and provides an event channel for notifications for this ring.
The information is included in the CONNECT_ack and CONNECT_rsp
messages.

After the initial connection, the two domains operate as identical
peers.  Disconnection is signalled by a domain ungranting its transmit
ring, notifying the peer via the associated event channel.  The event
channel is then unbound.

Appendix
========

V4V
---

An alternative low-level transport (V4V) has been proposed.  The
hypervisor copies messages from the source domain into a destination
ring provided by the destination domain.

Because peers are untrusted, in order to prevent them from being able
to denial-of-service the processing of messages from other peers, each
receiver must have a per-peer receive ring.  A listening service does
not know in advance which peers may connect so it cannot create these
rings in advance.

The connection manager service running in a trusted domain (as in the
shared ring transport described above) may be used.  The CONNECT_ind
message is used to trigger the creation of receive ring for that
specific sender.

A peer must be able to find the connection manager service both at
start of day and if the connection manager service is restarted in a
new domain.  This can be done in two possible ways:

1. Watch a Xenstore key which contains the connection manager service
   domain ID.

2. Use a frontend/backend driver pair.

### Advantages

* Does not use grant table resource.  If shared rings are used then a
  busy guest with hundreds of peers will require more grant table
  entries than the current default.

### Disadvantages

* Any changes or extentions to the protocol or ring format would
  require a hypervisor change.  This is more difficult than making
  changes to guests.

* The connection-less, "shared-bus" model of v4v is unsuitable for
  untrusted peers.  This requires layering a connection model on top
  and much of the simplicity of the v4v ABI is lost.

* The mechanism for handling full destination rings will not scale up
  on busy domains.  The event channel only indicates that some ring
  may have space -- it does not identify which ring has space.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Inter-domain Communication using Virtual Sockets (high-level design)
  2013-06-11 18:07 Inter-domain Communication using Virtual Sockets (high-level design) David Vrabel
@ 2013-06-11 18:54 ` Andrew Cooper
  2013-06-13 16:27 ` Tim Deegan
  2013-10-30 14:51 ` David Vrabel
  2 siblings, 0 replies; 10+ messages in thread
From: Andrew Cooper @ 2013-06-11 18:54 UTC (permalink / raw)
  To: David Vrabel; +Cc: Vincent Hanquez, Ross Philipson, Xen-devel

On 11/06/13 19:07, David Vrabel wrote:
> All,
>
> This is a high-level design document for an inter-domain communication
> system under the virtual sockets API (AF_VSOCK) recently added to Linux.
>
> Two low-level transports are discussed: a shared ring based one
> requiring no additional hypervisor support and v4v.
>
> The PDF (including the diagrams) is available here:
>
> http://xenbits.xen.org/people/dvrabel/inter-domain-comms-C.pdf
>
> % Inter-domain Communication using Virtual Sockets
> % David Vrabel <<david.vrabel@citrix.com>

Mismatched angles.

> % Draft C
>
> Introduction
> ============
>
> Revision History
> ----------------
>
> --------------------------------------------------------------------
> Version  Date         Changes
> -------  -----------  ----------------------------------------------
> Draft C  11 Jun 2013  Minor clarifications.
>
> Draft B  10 Jun 2013  Added a section on the low-level shared ring
> transport.
>
>                       Added a section on using v4v as the low-level
> transport.
>
> Draft A  28 May 2013  Initial draft.
> --------------------------------------------------------------------
>
> Purpose
> -------
>
> In the Windsor architecture for XenServer, dom0 is disaggregated into
> several _service domains_.  Examples of service domains include
> network and storage driver domains, and qemu (stub) domains.
>
> To allow the toolstack to manage service domains there needs to be a
> communication mechanism between the toolstack running in one domain and
> all the service domains.
>
> The principle focus of this new transport is control-plane traffic
> (low latency and low data rates) but consideration is given to future
> uses requiring higher data rates.
>
> Linux 3.9 support virtual sockets which is a new type of socket (the
> new AF_VSOCK address family) for inter-domain communication.  This was
> originally implemented for VMWare's VMCI transport but has hooks for
> other transports.  This will be used to provide the interface to
> applications.
>
>
> System Overview
> ---------------
>
> ![\label{fig_overview}System Overview](overview.pdf)
>
>
> Design Map
> ----------
>
> The linux kernel requires a Xen-specific virtual socket transport and
> front and back drivers.
>
> The connection manager is a new user space daemon running in the
> backend domain.
>
> Toolstacks will require changes to allow them to set the policy used
> by the connection manager.  The design of these changes is out of
> scope of this document.
>
> Definitions and Acronyms
> ------------------------
>
> _AF\_VSOCK_
>   ~ The address family for virtual sockets.
>
> _CID (Context ID)_
>
>   ~ The domain ID portion of the AF_VSOCK address format.
>
> _Port_
>
>   ~ The part of the AF_VSOCK address format identifying a specific
>     service. Similar to the port number used in TCP connection.
>
> _Virtual Socket_
>
>   ~ A socket using the AF_VSOCK protocol.
>
> References
> ----------
>
> [Windsor Architecture slides from XenSummit
> 2012](http://www.slideshare.net/xen_com_mgr/windsor-domain-0-disaggregation-for-xenserver-and-xcp)
>
>
> Design Considerations
> =====================
>
> Assumptions
> -----------
>
> * There exists a low-level peer-to-peer, datagram based transport
>   mechanism using shared rings (as in libvchan).
>
> Constraints
> -----------
>
> * The AF_VSOCK address format is limited to a 32-bit CID and a 32-bit
>   port number.  This is sufficient as Xen only has 16-bit domain IDs.
>
> Risks and Volatile Areas
> ------------------------
>
> * The transport may be used between untrusted peers.  A domain may be
>   subject to malicious activity or denial of service attacks.
>
> Architecture
> ============
>
> Overview
> --------
>
> ![\label{fig_architecture}Architecture Overview](architecture.pdf)
>
> Linux's virtual sockets are used as the interface to applications.
> Virtual sockets were introduced in Linux 3.9 and provides a hypervisor
> independent[^1] interface to user space applications for inter-domain
> communication.
>
> [^1]: The API and address format is hypervisor independent but the
> address values are not.
>
> An internal API is provided to implement a low-level virtual socket
> transport.  This will be implemented within a pair of front and back
> drivers.  The use of the standard front/back driver method allows the
> toolstack to handle the suspend, resume and migration in a similar way
> to the existing drivers.
>
> The front/back pair provides a point-to-point link between the two
> domains.  This is used to communicate between applications on those
> hosts and between the frontend domain and the _connection manager_
> running on the backend.
>
> The connection manager allows domUs to request direct connections to
> peer domains.  Without the connection manager, peers have no mechanism
> to exchange the information ncessary for setting up the direct
> connections. The toolstack sets the policy in the connection manager
> to allow connection requests.  The default policy is to deny
> connection requests.
>
>
> High Level Design
> =================
>
> Virtual Sockets
> ---------------
>
> The AF_VSOCK socket address family in the Linux kernel has a two part
> address format: a uint32_t _context ID_ (_CID_) identifying the domain
> and a uint32_t port for the specific service in that domain.
>
> The CID shall be the domain ID and some CIDs have a specific meaning.
>
> CID                     Purpose
> -------------------     -------
> 0x7FF0 (DOMID_SELF)     The local domain.
> 0x7FF1                  The backend domain (where the connection manager
> is).

0x7FF1 is DOMID_IO which has a separate definition as far as Xen is
concerned.

Is it not possible for this information to be in xenstore?

>
> Some port numbers are reserved.
>
> Port    Purpose
> ----    -------
> 0       Reserved
> 1       Connection Manager
> 2-1023  Reserved for well-known services (such as a service discovery
> service).

If you are making use of DOMID_SELF, probably also make use of
DOMID_FIRST_RESERVED, which has the same numeric value.

>
> Front / Back Drivers
> --------------------
>
> Using a front or back driver to provide the virtual socket transport
> allows the toolstack to only make the inter-domain communication
> facility available to selected domains.
>
> The "standard" xenbus connection state machine shall be used. See
> figures \ref{fig_front-sm} and \ref{fig_back-sm} on pages
> \pageref{fig_front-sm} and \pageref{fig_back-sm}.
>
> ![\label{fig_front-sm}Frontend Connection State Machine](front-sm.pdf)
>
> ![\label{fig_back-sm}Backend Connection State Machine](back-sm.pdf)
>
>
> Connection Manager
> ------------------
>
> The connection manager has two main purposes.
>
> 1. Checking that two domains are permitted to connect.
>
> 2. Providing a mechanism for two domains to exchange the grant
>    references and event channels needed for them to setup a shared
>    ring transport.
>
> Domains commnicate with the connection manager over the front-back
> transport link.  The connection manager must be in the same domain as
> the virtual socket backend driver.
>
> The connection manager opens a virtual socket and listens on a well
> defined port (port 1).
>
> The following messages are defined.
>
> Message          Purpose
> -------          -------
> CONNECT_req      Request connection to another peer.
> CONNECT_rsp      Response to a connection request.
> CONNECT_ind      Indicate that a peer is trying to connect.
> CONNECT_ack      Acknowledge a connection request.
>
> ![\label{fig_conn-msc}Connect Message Sequence Chart](conn.pdf)
>
> Before forwarding a connection request to a peer, the connection
> manager checks that the connection is permitted.  The toolstack sets
> these permissions.
>
> Disconnecting transport links to an uncooperative (or dead) domain is
> required.  Therefore there are no messages for disconnecting transport
> links (as these may be ignore or delayed). Instead a transport link is
> disconnected by tearing down the local end. The peer will notice the
> remote end going away and then teardown its end.
>
> Low-level transport
> ===================
>
> [ This exact details are yet to be determined but this section should
>   provide a reasonably summary of the mechanisms used. ]
>
> Frontend and backend domains
> ----------------------------
>
> As is typical for frontend and backend drivers, the frontend will
> grant copy-only access to two rings -- one for from-front messages and
> one for to-front messages.  Each ring shall have an event channel for
> notifying when requests and responses are placed on the ring.

The term "grant copy-only" is very confusing to read in context. 
However I cant offhand think of a better way of describing it.

~Andrew

>
> Peer domains
> ------------
>
> The initiator grants copy-only access to a from-initiator (transmit)
> ring and provides an event channel for notifications for this ring.
> This information is included in the CONNECT_req and CONNECT_ind
> messages.
>
> The responder grants copy-only access to a from-responder (transmit)
> ring and provides an event channel for notifications for this ring.
> The information is included in the CONNECT_ack and CONNECT_rsp
> messages.
>
> After the initial connection, the two domains operate as identical
> peers.  Disconnection is signalled by a domain ungranting its transmit
> ring, notifying the peer via the associated event channel.  The event
> channel is then unbound.
>
> Appendix
> ========
>
> V4V
> ---
>
> An alternative low-level transport (V4V) has been proposed.  The
> hypervisor copies messages from the source domain into a destination
> ring provided by the destination domain.
>
> Because peers are untrusted, in order to prevent them from being able
> to denial-of-service the processing of messages from other peers, each
> receiver must have a per-peer receive ring.  A listening service does
> not know in advance which peers may connect so it cannot create these
> rings in advance.
>
> The connection manager service running in a trusted domain (as in the
> shared ring transport described above) may be used.  The CONNECT_ind
> message is used to trigger the creation of receive ring for that
> specific sender.
>
> A peer must be able to find the connection manager service both at
> start of day and if the connection manager service is restarted in a
> new domain.  This can be done in two possible ways:
>
> 1. Watch a Xenstore key which contains the connection manager service
>    domain ID.
>
> 2. Use a frontend/backend driver pair.
>
> ### Advantages
>
> * Does not use grant table resource.  If shared rings are used then a
>   busy guest with hundreds of peers will require more grant table
>   entries than the current default.
>
> ### Disadvantages
>
> * Any changes or extentions to the protocol or ring format would
>   require a hypervisor change.  This is more difficult than making
>   changes to guests.
>
> * The connection-less, "shared-bus" model of v4v is unsuitable for
>   untrusted peers.  This requires layering a connection model on top
>   and much of the simplicity of the v4v ABI is lost.
>
> * The mechanism for handling full destination rings will not scale up
>   on busy domains.  The event channel only indicates that some ring
>   may have space -- it does not identify which ring has space.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Inter-domain Communication using Virtual Sockets (high-level design)
  2013-06-11 18:07 Inter-domain Communication using Virtual Sockets (high-level design) David Vrabel
  2013-06-11 18:54 ` Andrew Cooper
@ 2013-06-13 16:27 ` Tim Deegan
  2013-06-17 16:19   ` David Vrabel
  2013-06-17 18:28   ` Ross Philipson
  2013-10-30 14:51 ` David Vrabel
  2 siblings, 2 replies; 10+ messages in thread
From: Tim Deegan @ 2013-06-13 16:27 UTC (permalink / raw)
  To: David Vrabel; +Cc: Vincent Hanquez, Ross Philipson, Xen-devel

Hi,

At 19:07 +0100 on 11 Jun (1370977636), David Vrabel wrote:
> This is a high-level design document for an inter-domain communication
> system under the virtual sockets API (AF_VSOCK) recently added to Linux.

This document covers a lot of ground (transport, namespace &c), and I'm
not sure where the AF_VSOCK interface comes in that.  E.g., are
communications with the 'connection manager' done by the application
(like DNS lookups) or by the kernel (like routing)?

> Purpose
> -------
> 
> In the Windsor architecture for XenServer, dom0 is disaggregated into
> several _service domains_.  Examples of service domains include
> network and storage driver domains, and qemu (stub) domains.
> 
> To allow the toolstack to manage service domains there needs to be a
> communication mechanism between the toolstack running in one domain and
> all the service domains.
> 
> The principle focus of this new transport is control-plane traffic

<nit>principal</nit>

> (low latency and low data rates) but consideration is given to future
> uses requiring higher data rates.
[...]
> Design Map
> ----------
> 
> The linux kernel requires a Xen-specific virtual socket transport and
> front and back drivers.
> 
> The connection manager is a new user space daemon running in the
> backend domain.

One in every domain that runs backends, or one for the whole system?

[...]
> Linux's virtual sockets are used as the interface to applications.
> Virtual sockets were introduced in Linux 3.9 and provides a hypervisor
> independent[^1] interface to user space applications for inter-domain
> communication.
> 
> [^1]: The API and address format is hypervisor independent but the
> address values are not.
> 
> An internal API is provided to implement a low-level virtual socket
> transport.  This will be implemented within a pair of front and back
> drivers.  The use of the standard front/back driver method allows the
> toolstack to handle the suspend, resume and migration in a similar way
> to the existing drivers.

What does that look like at the socket interface?  Would an AF_VSOCK
socket transparently stay open across migrate but connect to a different
backend?  Or would it be torn down and the application need to DTRT
about re-connecting?

> The front/back pair provides a point-to-point link between the two
> domains.  This is used to communicate between applications on those
> hosts and between the frontend domain and the _connection manager_
> running on the backend.
> 
> The connection manager allows domUs to request direct connections to
> peer domains.  Without the connection manager, peers have no mechanism
> to exchange the information ncessary for setting up the direct
> connections.

Sure they do -- they can use any existing shared namespace.  Xenstore
is the obvious candidate, but there's always DNS, or twitter. :P

> The toolstack sets the policy in the connection manager
> to allow connection requests.  The default policy is to deny
> connection requests.

Hmmm.  Since the underlying transports use their own ACLs (e.g. grant
tables), the connection manager can't actually stop two domains from
communicating.  You'd need to use XSM for that.

> High Level Design
> =================
> 
> Virtual Sockets
> ---------------
> 
> The AF_VSOCK socket address family in the Linux kernel has a two part
> address format: a uint32_t _context ID_ (_CID_) identifying the domain
> and a uint32_t port for the specific service in that domain.
> 
> The CID shall be the domain ID and some CIDs have a specific meaning.
> 
> CID                     Purpose
> -------------------     -------
> 0x7FF0 (DOMID_SELF)     The local domain.
> 0x7FF1                  The backend domain (where the connection manager
> is).

OK, so there's only one connection manager.  And the connection manager
has an address at the socket interface -- does that mean application
code should connect to it and send it requests?  But the information in
those requests is only useful to the code below the socket interface.

> Connection Manager
> ------------------
> 
> The connection manager has two main purposes.
> 
> 1. Checking that two domains are permitted to connect.

As I said, I don't think that can work.

> 2. Providing a mechanism for two domains to exchange the grant
>    references and event channels needed for them to setup a shared
>    ring transport.

If they already want to talk to each other, they can communicate all
that in a single grant ref (which is the same size as an AF_VSOCK port).

So I guess the purpose is multiplexing connection requests: some sort of
listener in the 'backend' must already be talking to the manager (and
because you need the manager to broker new connections, so must the
frontend).

Wait, is this connection manager just xenstore in a funny hat?  Or could
it be implemented by adding a few new node/permission types to xenstore?

> Domains commnicate with the connection manager over the front-back
> transport link.  The connection manager must be in the same domain as
> the virtual socket backend driver.
> 
> The connection manager opens a virtual socket and listens on a well
> defined port (port 1).
> 
> The following messages are defined.
> 
> Message          Purpose
> -------          -------
> CONNECT_req      Request connection to another peer.
> CONNECT_rsp      Response to a connection request.
> CONNECT_ind      Indicate that a peer is trying to connect.
> CONNECT_ack      Acknowledge a connection request.

Again, are these messages carried in a socket connection, or done under
the hood on a non-socket channel?  Or some mix of the two?  I think I
must be missing some key part of the picture. :)

> V4V
> ---
> ### Advantages
> 
> * Does not use grant table resource.  If shared rings are used then a
>   busy guest with hundreds of peers will require more grant table
>   entries than the current default.
> 
> ### Disadvantages
> 
> * Any changes or extentions to the protocol or ring format would
>   require a hypervisor change.  This is more difficult than making
>   changes to guests.

In practice, it's often easier to upgrade the hypervisor than the guest
kernels, but I agree that it's bad to have mechanism in the hypervisor.

> * The connection-less, "shared-bus" model of v4v is unsuitable for
>   untrusted peers.  This requires layering a connection model on top
>   and much of the simplicity of the v4v ABI is lost.

I think that if v4v can't manage a listen/connect model, then that's a
bug in v4v rather than a design-level drawback.  My understanding was
that the shared-receiver ring was intended to serve this purpose, and
that v4vtables would be used to silence over-loud peers (much like the
ACL you propose for the connection manager).  Ross?

> * The mechanism for handling full destination rings will not scale up
>   on busy domains.  The event channel only indicates that some ring
>   may have space -- it does not identify which ring has space.

That's a fair point, which you raised on the v4v thread, and one that I
expect Ross to address.

I'd be very interested to hear the v4v authors' opinions on this VSOCK
draft, btw -- in particular if it (or something similar) can provide all
v4v's features without new hypervisor code, I'd very much prefer it.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Inter-domain Communication using Virtual Sockets (high-level design)
  2013-06-13 16:27 ` Tim Deegan
@ 2013-06-17 16:19   ` David Vrabel
  2013-06-20 11:15     ` Tim Deegan
  2013-06-17 18:28   ` Ross Philipson
  1 sibling, 1 reply; 10+ messages in thread
From: David Vrabel @ 2013-06-17 16:19 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Vincent Hanquez, Ross Philipson, Xen-devel

On 13/06/13 17:27, Tim Deegan wrote:
> Hi,
> 
> At 19:07 +0100 on 11 Jun (1370977636), David Vrabel wrote:
>> This is a high-level design document for an inter-domain communication
>> system under the virtual sockets API (AF_VSOCK) recently added to Linux.
> 
> This document covers a lot of ground (transport, namespace &c), and I'm
> not sure where the AF_VSOCK interface comes in that.  E.g., are
> communications with the 'connection manager' done by the application
> (like DNS lookups) or by the kernel (like routing)?

The doc doesn't really explain this.

The connection manager is a user space process that opens a AF_VSOCK
listening socket on port 1.  The vsock transport of the frontend
effectively connects to this port (but since its in kernel code it
doesn't use the socket API).

>> Design Map
>> ----------
>>
>> The linux kernel requires a Xen-specific virtual socket transport and
>> front and back drivers.
>>
>> The connection manager is a new user space daemon running in the
>> backend domain.
> 
> One in every domain that runs backends, or one for the whole system?

One per backend, but I would anticipate there being only one backend for
most hosts.

>> Linux's virtual sockets are used as the interface to applications.
>> Virtual sockets were introduced in Linux 3.9 and provides a hypervisor
>> independent[^1] interface to user space applications for inter-domain
>> communication.
>>
>> [^1]: The API and address format is hypervisor independent but the
>> address values are not.
>>
>> An internal API is provided to implement a low-level virtual socket
>> transport.  This will be implemented within a pair of front and back
>> drivers.  The use of the standard front/back driver method allows the
>> toolstack to handle the suspend, resume and migration in a similar way
>> to the existing drivers.
> 
> What does that look like at the socket interface?  Would an AF_VSOCK
> socket transparently stay open across migrate but connect to a different
> backend?  Or would it be torn down and the application need to DTRT
> about re-connecting?

All connections are disconnected on migration.  The applications will
need to be able to handle this.

The initial use case for this (in XenServer) is for service domains
which would not be migrated anyway.

>> The front/back pair provides a point-to-point link between the two
>> domains.  This is used to communicate between applications on those
>> hosts and between the frontend domain and the _connection manager_
>> running on the backend.
>>
>> The connection manager allows domUs to request direct connections to
>> peer domains.  Without the connection manager, peers have no mechanism
>> to exchange the information ncessary for setting up the direct
>> connections.
> 
> Sure they do -- they can use any existing shared namespace.  Xenstore
> is the obvious candidate, but there's always DNS, or twitter. :P

I meant we need to /define/ a mechanism.  Using twitter might be fun but
it does need to something within the host ;).

>> The toolstack sets the policy in the connection manager
>> to allow connection requests.  The default policy is to deny
>> connection requests.
> 
> Hmmm.  Since the underlying transports use their own ACLs (e.g. grant
> tables), the connection manager can't actually stop two domains from
> communicating.  You'd need to use XSM for that.

I think there are two security concerns here.

1. Preventing two co-operating domains from setting up a communication
channel.

And,

2. Preventing a domain from connecting to vsock services listening in
another domain.

As you say, the connection manager does not address the first and XSM
would be needed.  This isn't something introduced by this design though.

For the second, I think the connection manager does work here and I
think it is useful to have this level of security without having a
requirement to use XSM.

>> High Level Design
>> =================
>>
>> Virtual Sockets
>> ---------------
>>
>> The AF_VSOCK socket address family in the Linux kernel has a two part
>> address format: a uint32_t _context ID_ (_CID_) identifying the domain
>> and a uint32_t port for the specific service in that domain.
>>
>> The CID shall be the domain ID and some CIDs have a specific meaning.
>>
>> CID                     Purpose
>> -------------------     -------
>> 0x7FF0 (DOMID_SELF)     The local domain.
>> 0x7FF1                  The backend domain (where the connection manager
>> is).
> 
> OK, so there's only one connection manager.  And the connection manager
> has an address at the socket interface -- does that mean application
> code should connect to it and send it requests?  But the information in
> those requests is only useful to the code below the socket interface.

I think I addressed this above.

>> Connection Manager
>> ------------------
>>
>> The connection manager has two main purposes.
>>
>> 1. Checking that two domains are permitted to connect.
> 
> As I said, I don't think that can work.
> 
>> 2. Providing a mechanism for two domains to exchange the grant
>>    references and event channels needed for them to setup a shared
>>    ring transport.
> 
> If they already want to talk to each other, they can communicate all
> that in a single grant ref (which is the same size as an AF_VSOCK port).

The shared rings are per-peer not per-listener.  If a peer becomes
compromised and starts trying a DoS attack (for example), the ring can
be shutdown without impacting other guests.

> So I guess the purpose is multiplexing connection requests: some sort of
> listener in the 'backend' must already be talking to the manager (and
> because you need the manager to broker new connections, so must the
> frontend).
> 
> Wait, is this connection manager just xenstore in a funny hat?  Or could
> it be implemented by adding a few new node/permission types to xenstore?

Er yes, I think this is just xenstore in a funny hat.  Reusing xenstore
would seem preferable to implementing a new daemon.

>> Domains commnicate with the connection manager over the front-back
>> transport link.  The connection manager must be in the same domain as
>> the virtual socket backend driver.
>>
>> The connection manager opens a virtual socket and listens on a well
>> defined port (port 1).
>>
>> The following messages are defined.
>>
>> Message          Purpose
>> -------          -------
>> CONNECT_req      Request connection to another peer.
>> CONNECT_rsp      Response to a connection request.
>> CONNECT_ind      Indicate that a peer is trying to connect.
>> CONNECT_ack      Acknowledge a connection request.
> 
> Again, are these messages carried in a socket connection, or done under
> the hood on a non-socket channel?  Or some mix of the two?  I think I
> must be missing some key part of the picture. :)
> 
>> V4V
>> ---
>> ### Advantages
>>
>> * Does not use grant table resource.  If shared rings are used then a
>>   busy guest with hundreds of peers will require more grant table
>>   entries than the current default.
>>
>> ### Disadvantages
>>
>> * Any changes or extentions to the protocol or ring format would
>>   require a hypervisor change.  This is more difficult than making
>>   changes to guests.
> 
> In practice, it's often easier to upgrade the hypervisor than the guest
> kernels, but I agree that it's bad to have mechanism in the hypervisor.

If this mechanism needs to be extended, the backend domain can be
restarted with a new kernel with minimal impact to already running guests.

>> * The connection-less, "shared-bus" model of v4v is unsuitable for
>>   untrusted peers.  This requires layering a connection model on top
>>   and much of the simplicity of the v4v ABI is lost.
> 
> I think that if v4v can't manage a listen/connect model, then that's a
> bug in v4v rather than a design-level drawback.  My understanding was
> that the shared-receiver ring was intended to serve this purpose, and
> that v4vtables would be used to silence over-loud peers (much like the
> ACL you propose for the connection manager).  Ross?

The v4vtable rules can only be modified by a privileged domain.  Other
guest would need some way to request new rules or the ability to set
some per-receive ring rules.

>> * The mechanism for handling full destination rings will not scale up
>>   on busy domains.  The event channel only indicates that some ring
>>   may have space -- it does not identify which ring has space.
> 
> That's a fair point, which you raised on the v4v thread, and one that I
> expect Ross to address.
>
> I'd be very interested to hear the v4v authors' opinions on this VSOCK
> draft, btw -- in particular if it (or something similar) can provide all
> v4v's features without new hypervisor code, I'd very much prefer it.

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Inter-domain Communication using Virtual Sockets (high-level design)
  2013-06-13 16:27 ` Tim Deegan
  2013-06-17 16:19   ` David Vrabel
@ 2013-06-17 18:28   ` Ross Philipson
  2013-06-20 11:05     ` David Vrabel
  2013-06-20 11:30     ` Tim Deegan
  1 sibling, 2 replies; 10+ messages in thread
From: Ross Philipson @ 2013-06-17 18:28 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Vincent Hanquez, David Vrabel, Xen-devel

On 06/13/2013 12:27 PM, Tim Deegan wrote:
> Hi,
>
> At 19:07 +0100 on 11 Jun (1370977636), David Vrabel wrote:
>> This is a high-level design document for an inter-domain communication
>> system under the virtual sockets API (AF_VSOCK) recently added to Linux.
>
> This document covers a lot of ground (transport, namespace&c), and I'm
> not sure where the AF_VSOCK interface comes in that.  E.g., are
> communications with the 'connection manager' done by the application
> (like DNS lookups) or by the kernel (like routing)?
>
>> Purpose
>> -------
>>
>> In the Windsor architecture for XenServer, dom0 is disaggregated into
>> several _service domains_.  Examples of service domains include
>> network and storage driver domains, and qemu (stub) domains.
>>
>> To allow the toolstack to manage service domains there needs to be a
>> communication mechanism between the toolstack running in one domain and
>> all the service domains.
>>
>> The principle focus of this new transport is control-plane traffic
>
> <nit>principal</nit>
>
>> (low latency and low data rates) but consideration is given to future
>> uses requiring higher data rates.
> [...]
>> Design Map
>> ----------
>>
>> The linux kernel requires a Xen-specific virtual socket transport and
>> front and back drivers.
>>
>> The connection manager is a new user space daemon running in the
>> backend domain.
>
> One in every domain that runs backends, or one for the whole system?
>
> [...]
>> Linux's virtual sockets are used as the interface to applications.
>> Virtual sockets were introduced in Linux 3.9 and provides a hypervisor
>> independent[^1] interface to user space applications for inter-domain
>> communication.
>>
>> [^1]: The API and address format is hypervisor independent but the
>> address values are not.
>>
>> An internal API is provided to implement a low-level virtual socket
>> transport.  This will be implemented within a pair of front and back
>> drivers.  The use of the standard front/back driver method allows the
>> toolstack to handle the suspend, resume and migration in a similar way
>> to the existing drivers.
>
> What does that look like at the socket interface?  Would an AF_VSOCK
> socket transparently stay open across migrate but connect to a different
> backend?  Or would it be torn down and the application need to DTRT
> about re-connecting?
>
>> The front/back pair provides a point-to-point link between the two
>> domains.  This is used to communicate between applications on those
>> hosts and between the frontend domain and the _connection manager_
>> running on the backend.
>>
>> The connection manager allows domUs to request direct connections to
>> peer domains.  Without the connection manager, peers have no mechanism
>> to exchange the information ncessary for setting up the direct
>> connections.
>
> Sure they do -- they can use any existing shared namespace.  Xenstore
> is the obvious candidate, but there's always DNS, or twitter. :P
>
>> The toolstack sets the policy in the connection manager
>> to allow connection requests.  The default policy is to deny
>> connection requests.
>
> Hmmm.  Since the underlying transports use their own ACLs (e.g. grant
> tables), the connection manager can't actually stop two domains from
> communicating.  You'd need to use XSM for that.
>
>> High Level Design
>> =================
>>
>> Virtual Sockets
>> ---------------
>>
>> The AF_VSOCK socket address family in the Linux kernel has a two part
>> address format: a uint32_t _context ID_ (_CID_) identifying the domain
>> and a uint32_t port for the specific service in that domain.
>>
>> The CID shall be the domain ID and some CIDs have a specific meaning.
>>
>> CID                     Purpose
>> -------------------     -------
>> 0x7FF0 (DOMID_SELF)     The local domain.
>> 0x7FF1                  The backend domain (where the connection manager
>> is).
>
> OK, so there's only one connection manager.  And the connection manager
> has an address at the socket interface -- does that mean application
> code should connect to it and send it requests?  But the information in
> those requests is only useful to the code below the socket interface.
>
>> Connection Manager
>> ------------------
>>
>> The connection manager has two main purposes.
>>
>> 1. Checking that two domains are permitted to connect.
>
> As I said, I don't think that can work.
>
>> 2. Providing a mechanism for two domains to exchange the grant
>>     references and event channels needed for them to setup a shared
>>     ring transport.
>
> If they already want to talk to each other, they can communicate all
> that in a single grant ref (which is the same size as an AF_VSOCK port).
>
> So I guess the purpose is multiplexing connection requests: some sort of
> listener in the 'backend' must already be talking to the manager (and
> because you need the manager to broker new connections, so must the
> frontend).
>
> Wait, is this connection manager just xenstore in a funny hat?  Or could
> it be implemented by adding a few new node/permission types to xenstore?
>
>> Domains commnicate with the connection manager over the front-back
>> transport link.  The connection manager must be in the same domain as
>> the virtual socket backend driver.
>>
>> The connection manager opens a virtual socket and listens on a well
>> defined port (port 1).
>>
>> The following messages are defined.
>>
>> Message          Purpose
>> -------          -------
>> CONNECT_req      Request connection to another peer.
>> CONNECT_rsp      Response to a connection request.
>> CONNECT_ind      Indicate that a peer is trying to connect.
>> CONNECT_ack      Acknowledge a connection request.
>
> Again, are these messages carried in a socket connection, or done under
> the hood on a non-socket channel?  Or some mix of the two?  I think I
> must be missing some key part of the picture. :)
>
>> V4V
>> ---
>> ### Advantages
>>
>> * Does not use grant table resource.  If shared rings are used then a
>>    busy guest with hundreds of peers will require more grant table
>>    entries than the current default.
>>
>> ### Disadvantages
>>
>> * Any changes or extentions to the protocol or ring format would
>>    require a hypervisor change.  This is more difficult than making
>>    changes to guests.
>
> In practice, it's often easier to upgrade the hypervisor than the guest
> kernels, but I agree that it's bad to have mechanism in the hypervisor.
>
>> * The connection-less, "shared-bus" model of v4v is unsuitable for
>>    untrusted peers.  This requires layering a connection model on top
>>    and much of the simplicity of the v4v ABI is lost.
>
> I think that if v4v can't manage a listen/connect model, then that's a
> bug in v4v rather than a design-level drawback.  My understanding was
> that the shared-receiver ring was intended to serve this purpose, and
> that v4vtables would be used to silence over-loud peers (much like the
> ACL you propose for the connection manager).  Ross?

We are looking into enhancing this. For one thing, we need some level of 
control over connection management in the core code for it to work 
cleanly with AF_VSOCK. We also have plans to allow the v4vtables to be 
managed by guests too. We are planning a significant overhaul of the 
v4vtables to improve them.

>
>> * The mechanism for handling full destination rings will not scale up
>>    on busy domains.  The event channel only indicates that some ring
>>    may have space -- it does not identify which ring has space.
>
> That's a fair point, which you raised on the v4v thread, and one that I
> expect Ross to address.

We are investigating ways to improve this - ways to relieve the guests 
of the burden of scanning all rings to find what changed.

>
> I'd be very interested to hear the v4v authors' opinions on this VSOCK
> draft, btw -- in particular if it (or something similar) can provide all
> v4v's features without new hypervisor code, I'd very much prefer it.

I guess I cannot be 100% just by reading the part of the spec on the low 
level transport mechanism. We originally tried to use a grant based 
model and ran into issue. Two of the most pronounced were:

  - Failure of grantees to release grants would cause hung domains under 
certain situations. This was discussed early in the V4V RFC work that 
Jean G. did. I am not sure if this has been fixed and if so, how. There 
was a suggestion about a fix in a reply from Daniel a while back.

  - Synchronization between guests was very complicated without a 
central arbitrator like the hypervisor.

Also this solution may have some scaling issues. If I understand the 
model being proposed here, each ring which I guess is a connection 
consumes an event channel. In the large number of connections scenario 
is this not a scaling problem? I may not fully understand the proposed 
low level transport spec.

>
> Cheers,
>
> Tim.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Inter-domain Communication using Virtual Sockets (high-level design)
  2013-06-17 18:28   ` Ross Philipson
@ 2013-06-20 11:05     ` David Vrabel
  2013-06-20 11:30     ` Tim Deegan
  1 sibling, 0 replies; 10+ messages in thread
From: David Vrabel @ 2013-06-20 11:05 UTC (permalink / raw)
  To: Ross Philipson; +Cc: Vincent Hanquez, Tim Deegan, Xen-devel

On 17/06/13 19:28, Ross Philipson wrote:
> On 06/13/2013 12:27 PM, Tim Deegan wrote:
>> Hi,
>>
>> At 19:07 +0100 on 11 Jun (1370977636), David Vrabel wrote:
>>> This is a high-level design document for an inter-domain communication
>>> system under the virtual sockets API (AF_VSOCK) recently added to Linux.
>>
>> I'd be very interested to hear the v4v authors' opinions on this VSOCK
>> draft, btw -- in particular if it (or something similar) can provide all
>> v4v's features without new hypervisor code, I'd very much prefer it.
> 
> I guess I cannot be 100% just by reading the part of the spec on the low
> level transport mechanism. We originally tried to use a grant based
> model and ran into issue. Two of the most pronounced were:
> 
>  - Failure of grantees to release grants would cause hung domains under
> certain situations. This was discussed early in the V4V RFC work that
> Jean G. did. I am not sure if this has been fixed and if so, how. There
> was a suggestion about a fix in a reply from Daniel a while back.

The use of grants that only permit copying (i.e., no map/unmap) should
avoid any issues like these.  The granter can't revoke a copy-only grant
at any time.

>  - Synchronization between guests was very complicated without a central
> arbitrator like the hypervisor.

I'm not sure what you mean here.  What are you synchronizing?

> Also this solution may have some scaling issues. If I understand the
> model being proposed here, each ring which I guess is a connection
> consumes an event channel. In the large number of connections scenario
> is this not a scaling problem? I may not fully understand the proposed
> low level transport spec.

If there are N bits of work to do, N messages to resend for example,
then it doesn't matter if we have N notifications via event channels or
1 notification and some other data structure listing the N peers that
need work -- it's the same amount of work.

The number of event channels being a hard scalability limit will be
removed in Xen 4.4 (using one of the two proposals for an extended event
channel ABI).

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Inter-domain Communication using Virtual Sockets (high-level design)
  2013-06-17 16:19   ` David Vrabel
@ 2013-06-20 11:15     ` Tim Deegan
  0 siblings, 0 replies; 10+ messages in thread
From: Tim Deegan @ 2013-06-20 11:15 UTC (permalink / raw)
  To: David Vrabel; +Cc: Vincent Hanquez, Ross Philipson, Xen-devel

At 17:19 +0100 on 17 Jun (1371489597), David Vrabel wrote:
> The connection manager is a user space process that opens a AF_VSOCK
> listening socket on port 1. 

OK; and the kernel transport in the backend plumbs that over a
pre-arranged shared ring to the kernel transport in the frontend, which
terminates it there (i.e. all traffic on that link is connection-setup
chatter and frontend userspace can't actually talk to the manager?

In that case I think that giving the manager a socket-level name
(i.e. '0x7ff1:1') is just confusing (at least it confused me!), since
it's not really a socket connection, at least at that end.

> The vsock transport of the frontend
> effectively connects to this port (but since its in kernel code it
> doesn't use the socket API).

yep.

> > What does that look like at the socket interface?  Would an AF_VSOCK
> > socket transparently stay open across migrate but connect to a different
> > backend?  Or would it be torn down and the application need to DTRT
> > about re-connecting?
> 
> All connections are disconnected on migration.  The applications will
> need to be able to handle this.

yep.

> >> The toolstack sets the policy in the connection manager
> >> to allow connection requests.  The default policy is to deny
> >> connection requests.
> > 
> > Hmmm.  Since the underlying transports use their own ACLs (e.g. grant
> > tables), the connection manager can't actually stop two domains from
> > communicating.  You'd need to use XSM for that.
> 
> I think there are two security concerns here.
> 
> 1. Preventing two co-operating domains from setting up a communication
> channel.
> 
> And,
> 
> 2. Preventing a domain from connecting to vsock services listening in
> another domain.
> 
> As you say, the connection manager does not address the first and XSM
> would be needed.  This isn't something introduced by this design though.

Agreed.

> For the second, I think the connection manager does work here and I
> think it is useful to have this level of security without having a
> requirement to use XSM.

Fair enough.  Maybe it just needs a big warning in the docs saying
"don't think you can use this to isolate VMs; there are other channels
besides VSOCK".

> >> 2. Providing a mechanism for two domains to exchange the grant
> >>    references and event channels needed for them to setup a shared
> >>    ring transport.
> > 
> > If they already want to talk to each other, they can communicate all
> > that in a single grant ref (which is the same size as an AF_VSOCK port).
> 
> The shared rings are per-peer not per-listener.  If a peer becomes
> compromised and starts trying a DoS attack (for example), the ring can
> be shutdown without impacting other guests.

What I meant to say was: if the frontend has a 64-bit address, and the
backend is expecting the connection, you could just make the address be
domid::grantid and stuff the event-channel info into the shared page. 

But I see now that the actual interesting part is in brokering
connection requests from as-yet-unknown peers.  That leads to the next
point...

> > So I guess the purpose is multiplexing connection requests: some sort of
> > listener in the 'backend' must already be talking to the manager (and
> > because you need the manager to broker new connections, so must the
> > frontend).
> > 
> > Wait, is this connection manager just xenstore in a funny hat?  Or could
> > it be implemented by adding a few new node/permission types to xenstore?
> 
> Er yes, I think this is just xenstore in a funny hat.  Reusing xenstore
> would seem preferable to implementing a new daemon.

That sounds good to me.  I think that some equivalent of the unix sticky
bit could make this brokering fit into the xenstore model.  Maybe we
could have a type of node where other VMs could make subnodes as long as
those subnodes were named with the creator's domid/uuid.  Or something
along those lines.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Inter-domain Communication using Virtual Sockets (high-level design)
  2013-06-17 18:28   ` Ross Philipson
  2013-06-20 11:05     ` David Vrabel
@ 2013-06-20 11:30     ` Tim Deegan
  2013-06-20 14:11       ` Ross Philipson
  1 sibling, 1 reply; 10+ messages in thread
From: Tim Deegan @ 2013-06-20 11:30 UTC (permalink / raw)
  To: Ross Philipson; +Cc: Vincent Hanquez, David Vrabel, Xen-devel

Hi,

At 14:28 -0400 on 17 Jun (1371479326), Ross Philipson wrote:
> >I'd be very interested to hear the v4v authors' opinions on this VSOCK
> >draft, btw -- in particular if it (or something similar) can provide all
> >v4v's features without new hypervisor code, I'd very much prefer it.
> 
> I guess I cannot be 100% just by reading the part of the spec on the low 
> level transport mechanism. We originally tried to use a grant based 
> model and ran into issue. Two of the most pronounced were:
> 
>  - Failure of grantees to release grants would cause hung domains under 
> certain situations. This was discussed early in the V4V RFC work that 
> Jean G. did. I am not sure if this has been fixed and if so, how. There 
> was a suggestion about a fix in a reply from Daniel a while back.

I think that using grant-copy can sort this out.  I believe that with v2
grant tables a grant can be marked as 'copy-only'.

>  - Synchronization between guests was very complicated without a 
> central arbitrator like the hypervisor.

I think that the VSOCK backend is intended to be that arbitrator, but
with the nice properties of allowing multiple arbitrators in a
partitioned system (with independent administrators) and of moving all
the arbitration code out of the hypervisor.

The down-side is that rather than allowing a generic many-to-one
multiplexed channel, VSOCK would provide such a channel _only_ for
connection requests (or at least, adding other uses might require
changing the manager).  That seems OK to me, but you might have other
use cases?

Another down-side is having to bounce requests off an intermediate VM
will add some latency, but again if it's only at connection-setup time
that seems OK.

> Also this solution may have some scaling issues. If I understand the 
> model being proposed here, each ring which I guess is a connection 
> consumes an event channel. In the large number of connections scenario 
> is this not a scaling problem?

I think it relies on the proposed changes to extend the number of event
channels; other than that I suspect it will scale better than the
current v4v 'select' model, where the client must scan every ring
looking for the one that's changed.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Inter-domain Communication using Virtual Sockets (high-level design)
  2013-06-20 11:30     ` Tim Deegan
@ 2013-06-20 14:11       ` Ross Philipson
  0 siblings, 0 replies; 10+ messages in thread
From: Ross Philipson @ 2013-06-20 14:11 UTC (permalink / raw)
  To: Tim (Xen.org); +Cc: Vincent Hanquez, David Vrabel, Xen-devel

> -----Original Message-----
> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Thursday, June 20, 2013 7:30 AM
> To: Ross Philipson
> Cc: David Vrabel; Xen-devel@lists.xen.org; Vincent Hanquez
> Subject: Re: [Xen-devel] Inter-domain Communication using Virtual
> Sockets (high-level design)
> 
> Hi,
> 
> At 14:28 -0400 on 17 Jun (1371479326), Ross Philipson wrote:
> > >I'd be very interested to hear the v4v authors' opinions on this
> VSOCK
> > >draft, btw -- in particular if it (or something similar) can provide
> all
> > >v4v's features without new hypervisor code, I'd very much prefer it.
> >
> > I guess I cannot be 100% just by reading the part of the spec on the
> low
> > level transport mechanism. We originally tried to use a grant based
> > model and ran into issue. Two of the most pronounced were:
> >
> >  - Failure of grantees to release grants would cause hung domains
> under
> > certain situations. This was discussed early in the V4V RFC work that
> > Jean G. did. I am not sure if this has been fixed and if so, how.
> There
> > was a suggestion about a fix in a reply from Daniel a while back.
> 
> I think that using grant-copy can sort this out.  I believe that with v2
> grant tables a grant can be marked as 'copy-only'.
> 
> >  - Synchronization between guests was very complicated without a
> > central arbitrator like the hypervisor.
> 
> I think that the VSOCK backend is intended to be that arbitrator, but
> with the nice properties of allowing multiple arbitrators in a
> partitioned system (with independent administrators) and of moving all
> the arbitration code out of the hypervisor.
> 
> The down-side is that rather than allowing a generic many-to-one
> multiplexed channel, VSOCK would provide such a channel _only_ for
> connection requests (or at least, adding other uses might require
> changing the manager).  That seems OK to me, but you might have other
> use cases?
> 
> Another down-side is having to bounce requests off an intermediate VM
> will add some latency, but again if it's only at connection-setup time
> that seems OK.
> 
> > Also this solution may have some scaling issues. If I understand the
> > model being proposed here, each ring which I guess is a connection
> > consumes an event channel. In the large number of connections scenario
> > is this not a scaling problem?
> 
> I think it relies on the proposed changes to extend the number of event
> channels; other than that I suspect it will scale better than the
> current v4v 'select' model, where the client must scan every ring
> looking for the one that's changed.

I agree that it scales better as things stand now. We are exploring
solution to remove this limitation and provide a guest with info on
what has changed. 

> 
> Cheers,
> 
> Tim.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Inter-domain Communication using Virtual Sockets (high-level design)
  2013-06-11 18:07 Inter-domain Communication using Virtual Sockets (high-level design) David Vrabel
  2013-06-11 18:54 ` Andrew Cooper
  2013-06-13 16:27 ` Tim Deegan
@ 2013-10-30 14:51 ` David Vrabel
  2 siblings, 0 replies; 10+ messages in thread
From: David Vrabel @ 2013-10-30 14:51 UTC (permalink / raw)
  To: Xen-devel; +Cc: Philip Tricca, Vincent Hanquez, Ross Philipson

On 11/06/13 19:07, David Vrabel wrote:
> All,
> 
> This is a high-level design document for an inter-domain communication
> system under the virtual sockets API (AF_VSOCK) recently added to Linux.
> 
> Two low-level transports are discussed: a shared ring based one
> requiring no additional hypervisor support and v4v.
> 
> The PDF (including the diagrams) is available here:
> 
> http://xenbits.xen.org/people/dvrabel/inter-domain-comms-C.pdf

This design was mentioned in a Xen Dev. Summit presentation and I was
reminded of the prototype I wrote a while back.  I haven't yet had the
time to update the design document to reflect the outcome of the prototype.

The prototype is available in this git repo:

git://xenbits.xen.org/people/dvrabel/idc-prototype.git

The prototype is entirely in userspace. A daemon (one per domain) takes
the role of the kernel, providing a system call like interface to other
programs (via XML-RPC).

The calls provided (and their POSIX equivalents are):

    idc_connect() (socket + bind + connect)
    idc_disconnect() (close/shutdown)
    idc_listen() (socket + bind + listen)
    idc_accept() (accept)
    idc_unlisten() (close)
    idc_send() (send)
    idc_recv() (recv)

All connections between two domains are multiplex over the same link.
The setup of the link is done via Xenstore (see link_mgr.c for the
sequence of operations) and connections are then requested using a
CONNECT_req/CONNECT_rsp pair over this link. The data link itself uses
libxenvchan.

Data is encapulated in DATA_ind messages.

Connections are disconnected with a DISCONNECT_ind message. If a link as
no further connections using it, it is disconnected. Disconnecting a
link requires no co-operation from the other peer (the DISCONNECT_ind is
advisory and has no response), so links can disconnected at any time if
the remote end is misbehaving.

The prototype has some key shortcomings:

    It uses grant map/unmap and not grant copies. A replacement link
layer using only grant copies should just be a drop in replacement for
the existing use of libvchan.

    The way the XML-RPC library was uses means it serializes all RPCs.
Simultaneous send/recv/accept/etc. is not possible as these calls may
block. This is only a limitation of the RPC implementation.

To run the demo/test programs:

    Install libxenctrl, libxenstore, libxenvchan into each domain (DUT).

    Run ./idc-setup domid... with all DUTs. This will setup the xenstore
keys to allows these domains to connect to each other.

    Run ./link_mgr in each DUT.

    In one domain run ./test_accept 80 to listen for a connection on
port 80.

    In another domain run ./test_conn domid to connect.

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-10-30 14:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-11 18:07 Inter-domain Communication using Virtual Sockets (high-level design) David Vrabel
2013-06-11 18:54 ` Andrew Cooper
2013-06-13 16:27 ` Tim Deegan
2013-06-17 16:19   ` David Vrabel
2013-06-20 11:15     ` Tim Deegan
2013-06-17 18:28   ` Ross Philipson
2013-06-20 11:05     ` David Vrabel
2013-06-20 11:30     ` Tim Deegan
2013-06-20 14:11       ` Ross Philipson
2013-10-30 14:51 ` David Vrabel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.