All of lore.kernel.org
 help / color / mirror / Atom feed
* [DOC v3] Xen transport for 9pfs
@ 2017-01-06  0:54 Stefano Stabellini
  2017-01-25 17:07 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 5+ messages in thread
From: Stefano Stabellini @ 2017-01-06  0:54 UTC (permalink / raw)
  To: xen-devel; +Cc: stefano, andrew.cooper3, wei.liu2, andr2000

Changes in v3:
- clarify a few statements
- rename port-<num> to event-channel-<num>
- use grant_ref_t ref[1] instead of ref[]

Changes in v2:
- fix copy/paste error
- rename ring-ref-<num> to ring-ref<num>
- fix memory barriers
- add "verify prod/cons against local copy"
- add a paragraph on high level design
- add a note on the maximum possible max-ring-page-order value
- add mechanisms to avoid unnecessary evtchn notifications

---

# Xen transport for 9pfs version 1 

## Background

9pfs is a network filesystem protocol developed for Plan 9. 9pfs is very
simple and describes a series of commands and responses. It is
completely independent from the communication channels, in fact many
clients and servers support multiple channels, usually called
"transports". For example the Linux client supports tcp and unix
sockets, fds, virtio and rdma.


### 9pfs protocol

This document won't cover the full 9pfs specification. Please refer to
this [paper] and this [website] for a detailed description of it.
However it is useful to know that each 9pfs request and response has the
following header:

    struct header {
    	uint32_t size;
    	uint8_t id;
    	uint16_t tag;
    } __attribute__((packed));

    0         4  5    7
    +---------+--+----+
    |  size   |id|tag |
    +---------+--+----+

- *size*
The size of the request or response.

- *id*
The 9pfs request or response operation.

- *tag*
Unique id that identifies a specific request/response pair. It is used
to multiplex operations on a single channel.

It is possible to have multiple requests in-flight at any given time.


## Rationale

This document describes a Xen based transport for 9pfs, in the
traditional PV frontend and backend format. The PV frontend is used by
the client to send commands to the server. The PV backend is used by the
9pfs server to receive commands from clients and send back responses.

The transport protocol supports multiple rings up to the maximum
supported by the backend. The size of every ring is also configurable
and can span multiple pages, up to the maximum supported by the backend
(although it cannot be more than 2MB). The design is to exploit
parallelism at the vCPU level and support multiple outstanding requests
simultaneously.

This document does not cover the 9pfs client/server design or
implementation, only the transport for it.


## Xenstore

The frontend and the backend connect via xenstore to exchange
information. The toolstack creates front and back nodes with state
[XenbusStateInitialising]. The protocol node name is **9pfs**.

Multiple rings are supported for each frontend and backend connection.

The following are mandatory xenstore nodes for this protocol.


### Frontend XenBus Nodes

    num-rings
         Values:         <uint32_t>
    
         Number of rings. It needs to be lower or equal to max-rings.
    
    event-channel-<num> (event-channel-0, event-channel-1, etc)
         Values:         <uint32_t>
    
         The identifier of the Xen event channel used to signal activity
         in the ring buffer. One for each ring.
    
    ring-ref<num> (ring-ref0, ring-ref1, etc)
         Values:         <uint32_t>
    
         The Xen grant reference granting permission for the backend to
         map a page with information to setup a share ring. One for each
         ring.

### Backend XenBus Nodes

Backend specific properties, written by the backend, read by the
frontend:

    version
         Values:         <uint32_t>
    
         Protocol version supported by the backend. Currently the value is
         1.
    
    max-rings
         Values:         <uint32_t>
    
         The maximum supported number of rings per frontend.
    
    max-ring-page-order
         Values:         <uint32_t>
    
         The maximum supported size of a memory allocation in units of
         lb(machine pages), e.g. 0 == 1 page,  1 = 2 pages, 2 == 4 pages,
         etc.

Backend configuration nodes, written by the toolstack, read by the
backend:

    path
         Values:         <string>
    
         Host filesystem path to share.
    
    tag
         Values:         <string>
    
         Alphanumeric tag that identifies the 9pfs share. The client needs
         to know the tag to be able to mount it.
    
    security_model
         Values:         "none"
    
         *none*: files are stored using the same credentials as they are
                 created on the guest
         Only "none" is supported in this version of the protocol.


### State Machine

Initialization:

    *Front*                               *Back*
    XenbusStateInitialising               XenbusStateInitialising
    - Query virtual device                - Query backend device
      properties.                           identification data.
    - Setup OS device instance.           - Publish backend features
    - Allocate and initialize the           and transport parameters
      request ring.                                      |
    - Publish transport parameters                       |
      that will be in effect during                      V
      this connection.                            XenbusStateInitWait
                 |
                 |
                 V
       XenbusStateInitialised

                                          - Query frontend transport parameters.
                                          - Connect to the request ring and
                                            event channel.
                                                         |
                                                         |
                                                         V
                                                 XenbusStateConnected

     - Query backend device properties.
     - Finalize OS virtual device
       instance.
                 |
                 |
                 V
        XenbusStateConnected

Once frontend and backend are connected, they have a shared page per
ring, which are used to setup the rings, and an event channel per ring,
which are used to send notifications.

Shutdown:

    *Front*                            *Back*
    XenbusStateConnected               XenbusStateConnected
                |
                |
                V
       XenbusStateClosing

                                       - Unmap grants
                                       - Unbind evtchns
                                                 |
                                                 |
                                                 V
                                         XenbusStateClosing

    - Unbind evtchns
    - Free rings
    - Free data structures
               |
               |
               V
       XenbusStateClosed

                                       - Free remaining data structures
                                                 |
                                                 |
                                                 V
                                         XenbusStateClosed


## Ring Setup

The shared page has the following layout:

    typedef uint32_t XEN_9PFS_RING_IDX;

    struct xen_9pfs_intf {
    	XEN_9PFS_RING_IDX in_cons, in_prod, in_event;
    	XEN_9PFS_RING_IDX out_cons, out_prod, out_event;
    
    	uint32_t ring_order;
        /* this is an array of (1 << ring_order) elements */
    	grant_ref_t ref[1];
    };

    /* not actually C compliant (ring_order changes from ring to ring) */
    struct ring_data {
        char in[((1 << ring_order) << PAGE_SHIFT) / 2];
        char out[((1 << ring_order) << PAGE_SHIFT) / 2];
    };

- **ring_order**
  It represents the order of the data ring. The following list of grant
  references is of `(1 << ring_order)` elements. It cannot be greater than
  **max-ring-page-order**, as specified by the backend on XenBus.
- **ref[]**
  The list of grant references which will contain the actual data. They are
  mapped contiguosly in virtual memory. The first half of the pages is the
  **in** array, the second half is the **out** array. The array must
  have a power of two number of elements.
- **out** is an array used as circular buffer
  It contains client requests. The producer is the frontend, the
  consumer is the backend.
- **in** is an array used as circular buffer
  It contains server responses. The producer is the backend, the
  consumer is the frontend.
- **out_cons**, **out_prod**
  Consumer and producer indices for client requests. They keep track of
  how much data has been written by the frontend to **out** and how much
  data has already been consumed by the backend. **out_prod** is
  increased by the frontend, after writing data to **out**. **out_cons**
  is increased by the backend, after reading data from **out**.
- **in_cons** and **in_prod**
  Consumer and producer indices for responses. They keep track of how
  much data has already been consumed by the frontend from the **in**
  array. **in_prod** is increased by the backend, after writing data to
  **in**.  **in_cons** is increased by the frontend, after reading data
  from **in**.

The binary layout of `struct xen_9pfs_intf` follows:

    0         4         8         12        16        20         24       28
    +---------+---------+---------+---------+---------+----------+---------+
    | in_cons | in_prod |in_event |out_cons |out_prod |out_event |ring_orde|
    +---------+---------+---------+---------+---------+----------+---------+

    28        32        36      4092      4096
    +---------+---------+----//---+---------+
    |  ref[0] |  ref[1] |         |  ref[N] |
    +---------+---------+----//---+---------+

**N.B** For one page, N is maximum 1017 ((4096-28)/4), but given that N
needs to be a power of two, actually max N is 512. As 512 == (1 << 9),
the maximum possible max-ring-page-order value is 9.

The binary layout of the ring buffers follow:

    0         ((1<<ring_order)<<PAGE_SHIFT)/2       ((1<<ring_order)<<PAGE_SHIFT)
    +------------//-------------+------------//-------------+
    |            in             |           out             |
    +------------//-------------+------------//-------------+


## Ring Usage

The **in** and **out** arrays are used as circular buffers:
    
    0                               sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2
    +-----------------------------------+
    |to consume|    free    |to consume |
    +-----------------------------------+
               ^            ^
               prod         cons

    0                               sizeof(array)
    +-----------------------------------+
    |  free    | to consume |   free    |
    +-----------------------------------+
               ^            ^
               cons         prod

The following functions are provided to read and write to an array:

    #define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1))

    static inline void xen_9pfs_read(char *buf,
    		XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
    		uint8_t *h, size_t len) {
    	if (*masked_cons < *masked_prod) {
    		memcpy(h, buf + *masked_cons, len);
    	} else {
    		if (len > XEN_9PFS_RING_SIZE - *masked_cons) {
    			memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons);
    			memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons));
    		} else {
    			memcpy(h, buf + *masked_cons, len);
    		}
    	}
    	*masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len);
    }
    
    static inline void xen_9pfs_write(char *buf,
    		XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
    		uint8_t *opaque, size_t len) {
    	if (*masked_prod < *masked_cons) {
    		memcpy(buf + *masked_prod, opaque, len);
    	} else {
    		if (len > XEN_9PFS_RING_SIZE - *masked_prod) {
    			memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod);
    			memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod)); 
    		} else {
    			memcpy(buf + *masked_prod, opaque, len); 
    		}
    	}
    	*masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len);
    }

The producer (the backend for **in**, the frontend for **out**) writes to the
array in the following way:

- read *cons*, *prod* from shared memory
- general memory barrier
- verify *prod* against local copy (consumer shouldn't change it)
- write to array at position *prod* up to *cons*, wrapping around the circular
  buffer when necessary
- write memory barrier
- increase *prod*
- notify the other end via evtchn

The consumer (the backend for **out**, the frontend for **in**) reads from the
array in the following way:

- read *prod*, *cons* from shared memory
- read memory barrier
- verify *cons* against local copy (producer shouldn't change it)
- read from array at position *cons* up to *prod*, wrapping around the circular
  buffer when necessary
- general memory barrier
- increase *cons*
- notify the other end via evtchn, if *event* == 1
- general memory barrier 
- read *prod* again from shared memory to check for new requests

The producer takes care of writing only as many bytes as available in the buffer
up to *cons*. The consumer takes care of reading only as many bytes as available
in the buffer up to *prod*.

To avoid unnecessary notifications, the consumer only issues an evtchn
notification if the **event** field (**in_event** or **out_event**), has
been set to **1**. In fact the producer doesn't usually require any
notifications, but if it is necessary, for example because the producer
is forced to wait because the ring is full, then it can request to be
notified by the consumer by setting **in_event** or **out_event**,
depending on the ring. After receiving the notification, the producer
can reset *event*.

The producer always notifies the consumer after incrementing **prod**.
However in some circumstances the producer is allowed not to notify the
consumer, just as a performance improvement, and still maintain
correctness. These are the steps to do it: after incrementing *prod*,
the producer reads *cons* a second time; if the value is changed from
the previous read and it is different from *prod* before the update,
then the notification can be avoided. These are the producer steps, with
the optimization:

- read *prod* (old_prod), *cons* (old_cons) from shared memory
- general memory barrier
- verify *prod* against local copy (consumer shouldn't change it)
- write to array at position *prod* up to *cons*, wrapping around the circular
  buffer when necessary
- write memory barrier
- increase *prod* (new_prod)
- general memory barrier
- read *cons* (new_cons)
- if new_cons == old_cons or new_cons == old_prod, then notify the
  consumer


## Request/Response Workflow

The client chooses one of the available rings, then it sends a request
to the other end on the *out* array, following the producer workflow
described in [Ring Usage].

The server receives the notification and reads the request, following
the consumer workflow described in [Ring Usage]. The server knows how
much to read because it is specified in the *size* field of the 9pfs
header. The server processes the request and sends back a response on
the *in* array of the same ring, following the producer workflow as
usual. Thus, every request/response pair is on one ring.

The client receives a notification and reads the response from the *in*
array. The client knows how much data to read because it is specified in
the *size* field of the 9pfs header.


[paper]: https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/hensbergen/hensbergen.pdf
[website]: https://github.com/chaos/diod/blob/master/protocol.md
[XenbusStateInitialising]: http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [DOC v3] Xen transport for 9pfs
  2017-01-06  0:54 [DOC v3] Xen transport for 9pfs Stefano Stabellini
@ 2017-01-25 17:07 ` Konrad Rzeszutek Wilk
  2017-01-27 19:59   ` Konrad Rzeszutek Wilk
  2017-01-27 23:19   ` Stefano Stabellini
  0 siblings, 2 replies; 5+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-01-25 17:07 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: andr2000, xen-devel, wei.liu2, andrew.cooper3

On Thu, Jan 05, 2017 at 04:54:37PM -0800, Stefano Stabellini wrote:
> Changes in v3:
> - clarify a few statements
> - rename port-<num> to event-channel-<num>
> - use grant_ref_t ref[1] instead of ref[]
> 
> Changes in v2:
> - fix copy/paste error
> - rename ring-ref-<num> to ring-ref<num>
> - fix memory barriers
> - add "verify prod/cons against local copy"
> - add a paragraph on high level design
> - add a note on the maximum possible max-ring-page-order value
> - add mechanisms to avoid unnecessary evtchn notifications
> 
> ---
> 
> # Xen transport for 9pfs version 1 
> 
> ## Background
> 
> 9pfs is a network filesystem protocol developed for Plan 9. 9pfs is very
> simple and describes a series of commands and responses. It is
> completely independent from the communication channels, in fact many
> clients and servers support multiple channels, usually called
> "transports". For example the Linux client supports tcp and unix
> sockets, fds, virtio and rdma.
> 
> 
> ### 9pfs protocol
> 
> This document won't cover the full 9pfs specification. Please refer to
> this [paper] and this [website] for a detailed description of it.
> However it is useful to know that each 9pfs request and response has the
> following header:
> 
>     struct header {
>     	uint32_t size;
>     	uint8_t id;
>     	uint16_t tag;
>     } __attribute__((packed));
> 
>     0         4  5    7
>     +---------+--+----+
>     |  size   |id|tag |
>     +---------+--+----+
> 
> - *size*
> The size of the request or response.
> 
> - *id*
> The 9pfs request or response operation.
> 
> - *tag*
> Unique id that identifies a specific request/response pair. It is used
> to multiplex operations on a single channel.
> 
> It is possible to have multiple requests in-flight at any given time.
> 
> 
> ## Rationale
> 
> This document describes a Xen based transport for 9pfs, in the
> traditional PV frontend and backend format. The PV frontend is used by
> the client to send commands to the server. The PV backend is used by the
> 9pfs server to receive commands from clients and send back responses.
> 
> The transport protocol supports multiple rings up to the maximum
> supported by the backend. The size of every ring is also configurable
> and can span multiple pages, up to the maximum supported by the backend
> (although it cannot be more than 2MB). The design is to exploit
> parallelism at the vCPU level and support multiple outstanding requests
> simultaneously.
> 
> This document does not cover the 9pfs client/server design or
> implementation, only the transport for it.

What about the Paul's binary protocol..

https://patchwork.kernel.org/patch/7968201/ ?

He was using it as a side-channel for hash functions for the network
stack (Windows -> Linux)? Could that be used?

Or I suppose vice-versa - could Paul use your work here
instead of this network specific netif_ctrl?

> 
> 
> ## Xenstore
> 
> The frontend and the backend connect via xenstore to exchange
> information. The toolstack creates front and back nodes with state
> [XenbusStateInitialising]. The protocol node name is **9pfs**.

Since this an opaque protocol to just push data back and forth
could it be more generic?

[edit: I see now the 9pfs specific entries so nevermind]
> 
> Multiple rings are supported for each frontend and backend connection.
> 
> The following are mandatory xenstore nodes for this protocol.
> 
> 
> ### Frontend XenBus Nodes

Could I convience you to have Backend XenBus Nodes first?

> 
>     num-rings
>          Values:         <uint32_t>
>     
>          Number of rings. It needs to be lower or equal to max-rings.
>     
>     event-channel-<num> (event-channel-0, event-channel-1, etc)
>          Values:         <uint32_t>
>     
>          The identifier of the Xen event channel used to signal activity
>          in the ring buffer. One for each ring.
>     
>     ring-ref<num> (ring-ref0, ring-ref1, etc)
>          Values:         <uint32_t>
>     
>          The Xen grant reference granting permission for the backend to
>          map a page with information to setup a share ring. One for each
>          ring.
> 
> ### Backend XenBus Nodes
> 
> Backend specific properties, written by the backend, read by the
> frontend:
> 
>     version
>          Values:         <uint32_t>
>     
>          Protocol version supported by the backend. Currently the value is
>          1.

That is good. But there needs to be an negotiation with frontend where
the frontend says "well, I want version 2" and we exchange back and forth
on this.

>     
>     max-rings
>          Values:         <uint32_t>
>     
>          The maximum supported number of rings per frontend.
>     
>     max-ring-page-order
>          Values:         <uint32_t>
>     
>          The maximum supported size of a memory allocation in units of
>          lb(machine pages), e.g. 0 == 1 page,  1 = 2 pages, 2 == 4 pages,

s/lb/log/ ?

>          etc.
> 
> Backend configuration nodes, written by the toolstack, read by the
> backend:
> 
>     path
>          Values:         <string>
>     
>          Host filesystem path to share.
>     
>     tag
>          Values:         <string>
>     
>          Alphanumeric tag that identifies the 9pfs share. The client needs
>          to know the tag to be able to mount it.
>     
>     security_model

s/_/-/

>          Values:         "none"
>     
>          *none*: files are stored using the same credentials as they are
>                  created on the guest
>          Only "none" is supported in this version of the protocol.

Is this the 'auth' part of the 9pfs? I somehow thought that would be part
of the 9pfs transport. Or is this something different?

> 
> 
> ### State Machine
> 
> Initialization:
> 
>     *Front*                               *Back*
>     XenbusStateInitialising               XenbusStateInitialising
>     - Query virtual device                - Query backend device
>       properties.                           identification data.
>     - Setup OS device instance.           - Publish backend features
>     - Allocate and initialize the           and transport parameters
>       request ring.                                      |
>     - Publish transport parameters                       |
>       that will be in effect during                      V
>       this connection.                            XenbusStateInitWait
>                  |
>                  |
>                  V
>        XenbusStateInitialised
> 
>                                           - Query frontend transport parameters.
>                                           - Connect to the request ring and
>                                             event channel.
>                                                          |
>                                                          |
>                                                          V
>                                                  XenbusStateConnected
> 
>      - Query backend device properties.
>      - Finalize OS virtual device
>        instance.
>                  |
>                  |
>                  V
>         XenbusStateConnected
> 
> Once frontend and backend are connected, they have a shared page per
> ring, which are used to setup the rings, and an event channel per ring,
> which are used to send notifications.
> 
> Shutdown:
> 
>     *Front*                            *Back*
>     XenbusStateConnected               XenbusStateConnected
>                 |
>                 |
>                 V
>        XenbusStateClosing
> 
>                                        - Unmap grants
>                                        - Unbind evtchns
>                                                  |
>                                                  |
>                                                  V
>                                          XenbusStateClosing
> 
>     - Unbind evtchns
>     - Free rings
>     - Free data structures
>                |
>                |
>                V
>        XenbusStateClosed
> 
>                                        - Free remaining data structures
>                                                  |
>                                                  |
>                                                  V
>                                          XenbusStateClosed
> 
> 
> ## Ring Setup
> 
> The shared page has the following layout:
> 
>     typedef uint32_t XEN_9PFS_RING_IDX;
> 
>     struct xen_9pfs_intf {
>     	XEN_9PFS_RING_IDX in_cons, in_prod, in_event;
>     	XEN_9PFS_RING_IDX out_cons, out_prod, out_event;
>     
>     	uint32_t ring_order;
>         /* this is an array of (1 << ring_order) elements */
>     	grant_ref_t ref[1];
>     };

Could you explain why DEFINE_RING_TYPES would not work?

Could you explain more what '[in|out]_event' ?

Also would it be possible to re-organize this structure
so there are no cache bounces?


> 
>     /* not actually C compliant (ring_order changes from ring to ring) */
>     struct ring_data {
>         char in[((1 << ring_order) << PAGE_SHIFT) / 2];
>         char out[((1 << ring_order) << PAGE_SHIFT) / 2];
>     };
> 
> - **ring_order**
>   It represents the order of the data ring. The following list of grant
>   references is of `(1 << ring_order)` elements. It cannot be greater than
>   **max-ring-page-order**, as specified by the backend on XenBus.
> - **ref[]**
>   The list of grant references which will contain the actual data. They are
>   mapped contiguosly in virtual memory. The first half of the pages is the
>   **in** array, the second half is the **out** array. The array must
>   have a power of two number of elements.
> - **out** is an array used as circular buffer
>   It contains client requests. The producer is the frontend, the
>   consumer is the backend.
> - **in** is an array used as circular buffer
>   It contains server responses. The producer is the backend, the
>   consumer is the frontend.
> - **out_cons**, **out_prod**
>   Consumer and producer indices for client requests. They keep track of
>   how much data has been written by the frontend to **out** and how much
>   data has already been consumed by the backend. **out_prod** is
>   increased by the frontend, after writing data to **out**. **out_cons**
>   is increased by the backend, after reading data from **out**.
> - **in_cons** and **in_prod**
>   Consumer and producer indices for responses. They keep track of how
>   much data has already been consumed by the frontend from the **in**
>   array. **in_prod** is increased by the backend, after writing data to
>   **in**.  **in_cons** is increased by the frontend, after reading data
>   from **in**.
> 
> The binary layout of `struct xen_9pfs_intf` follows:
> 
>     0         4         8         12        16        20         24       28
>     +---------+---------+---------+---------+---------+----------+---------+
>     | in_cons | in_prod |in_event |out_cons |out_prod |out_event |ring_orde|
>     +---------+---------+---------+---------+---------+----------+---------+
> 
>     28        32        36      4092      4096
>     +---------+---------+----//---+---------+
>     |  ref[0] |  ref[1] |         |  ref[N] |
>     +---------+---------+----//---+---------+
> 
> **N.B** For one page, N is maximum 1017 ((4096-28)/4), but given that N
> needs to be a power of two, actually max N is 512. As 512 == (1 << 9),
> the maximum possible max-ring-page-order value is 9.
> 
> The binary layout of the ring buffers follow:
> 
>     0         ((1<<ring_order)<<PAGE_SHIFT)/2       ((1<<ring_order)<<PAGE_SHIFT)
>     +------------//-------------+------------//-------------+
>     |            in             |           out             |
>     +------------//-------------+------------//-------------+
> 
> 
> ## Ring Usage
> 
> The **in** and **out** arrays are used as circular buffers:
>     
>     0                               sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2
>     +-----------------------------------+
>     |to consume|    free    |to consume |
>     +-----------------------------------+
>                ^            ^
>                prod         cons
> 
>     0                               sizeof(array)
>     +-----------------------------------+
>     |  free    | to consume |   free    |
>     +-----------------------------------+
>                ^            ^
>                cons         prod
> 
> The following functions are provided to read and write to an array:
> 
>     #define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1))
> 
>     static inline void xen_9pfs_read(char *buf,
>     		XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
>     		uint8_t *h, size_t len) {
>     	if (*masked_cons < *masked_prod) {
>     		memcpy(h, buf + *masked_cons, len);
>     	} else {
>     		if (len > XEN_9PFS_RING_SIZE - *masked_cons) {
>     			memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons);
>     			memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons));
>     		} else {
>     			memcpy(h, buf + *masked_cons, len);
>     		}
>     	}
>     	*masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len);
>     }
>     
>     static inline void xen_9pfs_write(char *buf,
>     		XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
>     		uint8_t *opaque, size_t len) {
>     	if (*masked_prod < *masked_cons) {
>     		memcpy(buf + *masked_prod, opaque, len);
>     	} else {
>     		if (len > XEN_9PFS_RING_SIZE - *masked_prod) {
>     			memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod);
>     			memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod)); 
>     		} else {
>     			memcpy(buf + *masked_prod, opaque, len); 
>     		}
>     	}
>     	*masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len);
>     }
> 
> The producer (the backend for **in**, the frontend for **out**) writes to the
> array in the following way:
> 
> - read *cons*, *prod* from shared memory
> - general memory barrier
> - verify *prod* against local copy (consumer shouldn't change it)
> - write to array at position *prod* up to *cons*, wrapping around the circular
>   buffer when necessary
> - write memory barrier
> - increase *prod*

Which can cause the cacheline to be marked dirty - and if the '*cons'*
is on the same cache line that means it is MUST be evicted from other
CPUs and cause an stall (when reading from *cons*).

I wonder if you can move *cons* and *prod* around so that they
are on seperate cache-lines?

> - notify the other end via evtchn

event channel.
> 
> The consumer (the backend for **out**, the frontend for **in**) reads from the
> array in the following way:
> 
> - read *prod*, *cons* from shared memory
> - read memory barrier
> - verify *cons* against local copy (producer shouldn't change it)
> - read from array at position *cons* up to *prod*, wrapping around the circular
>   buffer when necessary
> - general memory barrier
> - increase *cons*
> - notify the other end via evtchn, if *event* == 1

/me scratches his head. This looks like something that could be on
the XenBus? Could you explain a bit more of why you want
this on the ring instead of some other way?

> - general memory barrier 
> - read *prod* again from shared memory to check for new requests
> 
> The producer takes care of writing only as many bytes as available in the buffer
> up to *cons*. The consumer takes care of reading only as many bytes as available
> in the buffer up to *prod*.
> 
> To avoid unnecessary notifications, the consumer only issues an evtchn
> notification if the **event** field (**in_event** or **out_event**), has
> been set to **1**. In fact the producer doesn't usually require any
> notifications, but if it is necessary, for example because the producer
> is forced to wait because the ring is full, then it can request to be
> notified by the consumer by setting **in_event** or **out_event**,
> depending on the ring. After receiving the notification, the producer
> can reset *event*.

Ok, but why? Why not make it part of XenBus?
> 
> The producer always notifies the consumer after incrementing **prod**.
> However in some circumstances the producer is allowed not to notify the
> consumer, just as a performance improvement, and still maintain
> correctness. These are the steps to do it: after incrementing *prod*,

This seems a bit strange. You are saying 'stop doing interrupts'
at this point (regardless of what is outstannding in the queue).

Perhaps a better approach is to have an CONFIG command on the
ring to change this? It kind of feels weird to have an
configuration knob in the cache intensive area.

Or you could use XenBusReconfigure state which does not mean
you need a full disconnect/connect?


> the producer reads *cons* a second time; if the value is changed from
> the previous read and it is different from *prod* before the update,
> then the notification can be avoided. These are the producer steps, with
> the optimization:
> 
> - read *prod* (old_prod), *cons* (old_cons) from shared memory
> - general memory barrier
> - verify *prod* against local copy (consumer shouldn't change it)
> - write to array at position *prod* up to *cons*, wrapping around the circular
>   buffer when necessary
> - write memory barrier
> - increase *prod* (new_prod)
> - general memory barrier
> - read *cons* (new_cons)
> - if new_cons == old_cons or new_cons == old_prod, then notify the
>   consumer
> 
> 
> ## Request/Response Workflow
> 
> The client chooses one of the available rings, then it sends a request
> to the other end on the *out* array, following the producer workflow
> described in [Ring Usage].
> 
> The server receives the notification and reads the request, following
> the consumer workflow described in [Ring Usage]. The server knows how
> much to read because it is specified in the *size* field of the 9pfs
> header. The server processes the request and sends back a response on
> the *in* array of the same ring, following the producer workflow as
> usual. Thus, every request/response pair is on one ring.
> 
> The client receives a notification and reads the response from the *in*
> array. The client knows how much data to read because it is specified in
> the *size* field of the 9pfs header.

/me nods.
> 
> 
> [paper]: https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/hensbergen/hensbergen.pdf

That was educational! I have to love their performance
tests:
"Our test environment is a cluster of identically
configured dual-processor 866 MHz Pentium III
servers, each configured with 256 MB of m"

866MHZ! Woohoo!

> [website]: https://github.com/chaos/diod/blob/master/protocol.md

https://9p.io/magic/man2html/2/fcall
> [XenbusStateInitialising]: http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [DOC v3] Xen transport for 9pfs
  2017-01-25 17:07 ` Konrad Rzeszutek Wilk
@ 2017-01-27 19:59   ` Konrad Rzeszutek Wilk
  2017-01-27 20:26     ` Stefano Stabellini
  2017-01-27 23:19   ` Stefano Stabellini
  1 sibling, 1 reply; 5+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-01-27 19:59 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: andr2000, xen-devel, wei.liu2, andrew.cooper3

> > 
> > ## Ring Setup
> > 
> > The shared page has the following layout:
> > 
> >     typedef uint32_t XEN_9PFS_RING_IDX;
> > 
> >     struct xen_9pfs_intf {
> >     	XEN_9PFS_RING_IDX in_cons, in_prod, in_event;
> >     	XEN_9PFS_RING_IDX out_cons, out_prod, out_event;
> >     
> >     	uint32_t ring_order;
> >         /* this is an array of (1 << ring_order) elements */
> >     	grant_ref_t ref[1];
> >     };
> 
> Could you explain why DEFINE_RING_TYPES would not work?
> 
> Could you explain more what '[in|out]_event' ?
> 
> Also would it be possible to re-organize this structure
> so there are no cache bounces?

This, along with a similar layout in the PV Calls got me thinking.

Why aren't you using the standard old fashioned DEFINE_RING_TYPES.

And if there is something wrong with it - why not use something
that has been tried and true. Like the VirtIO available and used
ring? That has the idea of 'avail_even't as an index.

And also it provides two nice pages where one is written by
the frontend while the other by the backend. Which nicely 
takes into account cache bounces.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [DOC v3] Xen transport for 9pfs
  2017-01-27 19:59   ` Konrad Rzeszutek Wilk
@ 2017-01-27 20:26     ` Stefano Stabellini
  0 siblings, 0 replies; 5+ messages in thread
From: Stefano Stabellini @ 2017-01-27 20:26 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, xen-devel, wei.liu2, andr2000, andrew.cooper3

On Fri, 27 Jan 2017, Konrad Rzeszutek Wilk wrote:
> > > 
> > > ## Ring Setup
> > > 
> > > The shared page has the following layout:
> > > 
> > >     typedef uint32_t XEN_9PFS_RING_IDX;
> > > 
> > >     struct xen_9pfs_intf {
> > >     	XEN_9PFS_RING_IDX in_cons, in_prod, in_event;
> > >     	XEN_9PFS_RING_IDX out_cons, out_prod, out_event;
> > >     
> > >     	uint32_t ring_order;
> > >         /* this is an array of (1 << ring_order) elements */
> > >     	grant_ref_t ref[1];
> > >     };
> > 
> > Could you explain why DEFINE_RING_TYPES would not work?

They cannot work because they require fixed length messages.


> > Could you explain more what '[in|out]_event' ?

I'll write more about them in reply to your other email, but they are
just an optimization to remove unnecessary event notifications. In fact
the first version I sent didn't have them
(http://marc.info/?l=xen-devel&m=148046256107032).


> > Also would it be possible to re-organize this structure
> > so there are no cache bounces?

I'll separate in and out indexes.


> This, along with a similar layout in the PV Calls got me thinking.
> 
> Why aren't you using the standard old fashioned DEFINE_RING_TYPES.

They only work with fixed length messages.


> And if there is something wrong with it - why not use something
> that has been tried and true. Like the VirtIO available and used
> ring? That has the idea of 'avail_even't as an index.
> 
> And also it provides two nice pages where one is written by
> the frontend while the other by the backend. Which nicely 
> takes into account cache bounces.

This is something that has been tried and true: it is based on the Xen
console ring with only few modifications. See
xen/include/public/io/console.h. The [in|out]_event things are optional
optimizations.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [DOC v3] Xen transport for 9pfs
  2017-01-25 17:07 ` Konrad Rzeszutek Wilk
  2017-01-27 19:59   ` Konrad Rzeszutek Wilk
@ 2017-01-27 23:19   ` Stefano Stabellini
  1 sibling, 0 replies; 5+ messages in thread
From: Stefano Stabellini @ 2017-01-27 23:19 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, andr2000, andrew.cooper3, wei.liu2, xen-devel

On Wed, 25 Jan 2017, Konrad Rzeszutek Wilk wrote:
> On Thu, Jan 05, 2017 at 04:54:37PM -0800, Stefano Stabellini wrote:
> > Changes in v3:
> > - clarify a few statements
> > - rename port-<num> to event-channel-<num>
> > - use grant_ref_t ref[1] instead of ref[]
> > 
> > Changes in v2:
> > - fix copy/paste error
> > - rename ring-ref-<num> to ring-ref<num>
> > - fix memory barriers
> > - add "verify prod/cons against local copy"
> > - add a paragraph on high level design
> > - add a note on the maximum possible max-ring-page-order value
> > - add mechanisms to avoid unnecessary evtchn notifications
> > 
> > ---
> > 
> > # Xen transport for 9pfs version 1 
> > 
> > ## Background
> > 
> > 9pfs is a network filesystem protocol developed for Plan 9. 9pfs is very
> > simple and describes a series of commands and responses. It is
> > completely independent from the communication channels, in fact many
> > clients and servers support multiple channels, usually called
> > "transports". For example the Linux client supports tcp and unix
> > sockets, fds, virtio and rdma.
> > 
> > 
> > ### 9pfs protocol
> > 
> > This document won't cover the full 9pfs specification. Please refer to
> > this [paper] and this [website] for a detailed description of it.
> > However it is useful to know that each 9pfs request and response has the
> > following header:
> > 
> >     struct header {
> >     	uint32_t size;
> >     	uint8_t id;
> >     	uint16_t tag;
> >     } __attribute__((packed));
> > 
> >     0         4  5    7
> >     +---------+--+----+
> >     |  size   |id|tag |
> >     +---------+--+----+
> > 
> > - *size*
> > The size of the request or response.
> > 
> > - *id*
> > The 9pfs request or response operation.
> > 
> > - *tag*
> > Unique id that identifies a specific request/response pair. It is used
> > to multiplex operations on a single channel.
> > 
> > It is possible to have multiple requests in-flight at any given time.
> > 
> > 
> > ## Rationale
> > 
> > This document describes a Xen based transport for 9pfs, in the
> > traditional PV frontend and backend format. The PV frontend is used by
> > the client to send commands to the server. The PV backend is used by the
> > 9pfs server to receive commands from clients and send back responses.
> > 
> > The transport protocol supports multiple rings up to the maximum
> > supported by the backend. The size of every ring is also configurable
> > and can span multiple pages, up to the maximum supported by the backend
> > (although it cannot be more than 2MB). The design is to exploit
> > parallelism at the vCPU level and support multiple outstanding requests
> > simultaneously.
> > 
> > This document does not cover the 9pfs client/server design or
> > implementation, only the transport for it.
> 
> What about the Paul's binary protocol..
> 
> https://patchwork.kernel.org/patch/7968201/ ?
> 
> He was using it as a side-channel for hash functions for the network
> stack (Windows -> Linux)? Could that be used?

I haven't read it in details but:

   * The control ring uses a fixed request/response message size and is
   * balanced (i.e. one request to one response)

This protocol needs to be able to handle messages of different sizes.


> Or I suppose vice-versa - could Paul use your work here
> instead of this network specific netif_ctrl?

Yes, this protocol can be used to carry any data. But specific messages
need to be defined for each implementation. In this case, the messages
are defined by 9p.


> > ## Xenstore
> > 
> > The frontend and the backend connect via xenstore to exchange
> > information. The toolstack creates front and back nodes with state
> > [XenbusStateInitialising]. The protocol node name is **9pfs**.
> 
> Since this an opaque protocol to just push data back and forth
> could it be more generic?
> 
> [edit: I see now the 9pfs specific entries so nevermind]

Actually, I think it could be made generic and it is in my todo list.
In all honesty, it is not your fault, but given the speed of review of
new Xen PV protocols, I am wary of adding another one to the queue.


> > Multiple rings are supported for each frontend and backend connection.
> > 
> > The following are mandatory xenstore nodes for this protocol.
> > 
> > 
> > ### Frontend XenBus Nodes
> 
> Could I convience you to have Backend XenBus Nodes first?

Sure


> > 
> >     num-rings
> >          Values:         <uint32_t>
> >     
> >          Number of rings. It needs to be lower or equal to max-rings.
> >     
> >     event-channel-<num> (event-channel-0, event-channel-1, etc)
> >          Values:         <uint32_t>
> >     
> >          The identifier of the Xen event channel used to signal activity
> >          in the ring buffer. One for each ring.
> >     
> >     ring-ref<num> (ring-ref0, ring-ref1, etc)
> >          Values:         <uint32_t>
> >     
> >          The Xen grant reference granting permission for the backend to
> >          map a page with information to setup a share ring. One for each
> >          ring.
> > 
> > ### Backend XenBus Nodes
> > 
> > Backend specific properties, written by the backend, read by the
> > frontend:
> > 
> >     version
> >          Values:         <uint32_t>
> >     
> >          Protocol version supported by the backend. Currently the value is
> >          1.
> 
> That is good. But there needs to be an negotiation with frontend where
> the frontend says "well, I want version 2" and we exchange back and forth
> on this.

Yeah, I'll add something similar to what I did for the latest PVCalls
version.


> >     
> >     max-rings
> >          Values:         <uint32_t>
> >     
> >          The maximum supported number of rings per frontend.
> >     
> >     max-ring-page-order
> >          Values:         <uint32_t>
> >     
> >          The maximum supported size of a memory allocation in units of
> >          lb(machine pages), e.g. 0 == 1 page,  1 = 2 pages, 2 == 4 pages,
> 
> s/lb/log/ ?

I'll make the change


> >          etc.
> > 
> > Backend configuration nodes, written by the toolstack, read by the
> > backend:
> > 
> >     path
> >          Values:         <string>
> >     
> >          Host filesystem path to share.
> >     
> >     tag
> >          Values:         <string>
> >     
> >          Alphanumeric tag that identifies the 9pfs share. The client needs
> >          to know the tag to be able to mount it.
> >     
> >     security_model
> 
> s/_/-/

all right


> >          Values:         "none"
> >     
> >          *none*: files are stored using the same credentials as they are
> >                  created on the guest
> >          Only "none" is supported in this version of the protocol.
> 
> Is this the 'auth' part of the 9pfs? I somehow thought that would be part
> of the 9pfs transport. Or is this something different?

This is a backend configuration. Similarly to other network filesystem
protocols, you can squash/remap user ownership of the files. "none"
means "do not translate or filter uids and gids", which is fine if the
filesystem shared is used uniquely by one virtual machine.


> > ### State Machine
> > 
> > Initialization:
> > 
> >     *Front*                               *Back*
> >     XenbusStateInitialising               XenbusStateInitialising
> >     - Query virtual device                - Query backend device
> >       properties.                           identification data.
> >     - Setup OS device instance.           - Publish backend features
> >     - Allocate and initialize the           and transport parameters
> >       request ring.                                      |
> >     - Publish transport parameters                       |
> >       that will be in effect during                      V
> >       this connection.                            XenbusStateInitWait
> >                  |
> >                  |
> >                  V
> >        XenbusStateInitialised
> > 
> >                                           - Query frontend transport parameters.
> >                                           - Connect to the request ring and
> >                                             event channel.
> >                                                          |
> >                                                          |
> >                                                          V
> >                                                  XenbusStateConnected
> > 
> >      - Query backend device properties.
> >      - Finalize OS virtual device
> >        instance.
> >                  |
> >                  |
> >                  V
> >         XenbusStateConnected
> > 
> > Once frontend and backend are connected, they have a shared page per
> > ring, which are used to setup the rings, and an event channel per ring,
> > which are used to send notifications.
> > 
> > Shutdown:
> > 
> >     *Front*                            *Back*
> >     XenbusStateConnected               XenbusStateConnected
> >                 |
> >                 |
> >                 V
> >        XenbusStateClosing
> > 
> >                                        - Unmap grants
> >                                        - Unbind evtchns
> >                                                  |
> >                                                  |
> >                                                  V
> >                                          XenbusStateClosing
> > 
> >     - Unbind evtchns
> >     - Free rings
> >     - Free data structures
> >                |
> >                |
> >                V
> >        XenbusStateClosed
> > 
> >                                        - Free remaining data structures
> >                                                  |
> >                                                  |
> >                                                  V
> >                                          XenbusStateClosed
> > 
> > 
> > ## Ring Setup
> > 
> > The shared page has the following layout:
> > 
> >     typedef uint32_t XEN_9PFS_RING_IDX;
> > 
> >     struct xen_9pfs_intf {
> >     	XEN_9PFS_RING_IDX in_cons, in_prod, in_event;
> >     	XEN_9PFS_RING_IDX out_cons, out_prod, out_event;
> >     
> >     	uint32_t ring_order;
> >         /* this is an array of (1 << ring_order) elements */
> >     	grant_ref_t ref[1];
> >     };
> 
> Could you explain why DEFINE_RING_TYPES would not work?

The protocol needs to handle messages of multiple sizes.


> Could you explain more what '[in|out]_event' ?

See below.


> Also would it be possible to re-organize this structure
> so there are no cache bounces?

I'll do.

> 
> > 
> >     /* not actually C compliant (ring_order changes from ring to ring) */
> >     struct ring_data {
> >         char in[((1 << ring_order) << PAGE_SHIFT) / 2];
> >         char out[((1 << ring_order) << PAGE_SHIFT) / 2];
> >     };
> > 
> > - **ring_order**
> >   It represents the order of the data ring. The following list of grant
> >   references is of `(1 << ring_order)` elements. It cannot be greater than
> >   **max-ring-page-order**, as specified by the backend on XenBus.
> > - **ref[]**
> >   The list of grant references which will contain the actual data. They are
> >   mapped contiguosly in virtual memory. The first half of the pages is the
> >   **in** array, the second half is the **out** array. The array must
> >   have a power of two number of elements.
> > - **out** is an array used as circular buffer
> >   It contains client requests. The producer is the frontend, the
> >   consumer is the backend.
> > - **in** is an array used as circular buffer
> >   It contains server responses. The producer is the backend, the
> >   consumer is the frontend.
> > - **out_cons**, **out_prod**
> >   Consumer and producer indices for client requests. They keep track of
> >   how much data has been written by the frontend to **out** and how much
> >   data has already been consumed by the backend. **out_prod** is
> >   increased by the frontend, after writing data to **out**. **out_cons**
> >   is increased by the backend, after reading data from **out**.
> > - **in_cons** and **in_prod**
> >   Consumer and producer indices for responses. They keep track of how
> >   much data has already been consumed by the frontend from the **in**
> >   array. **in_prod** is increased by the backend, after writing data to
> >   **in**.  **in_cons** is increased by the frontend, after reading data
> >   from **in**.
> > 
> > The binary layout of `struct xen_9pfs_intf` follows:
> > 
> >     0         4         8         12        16        20         24       28
> >     +---------+---------+---------+---------+---------+----------+---------+
> >     | in_cons | in_prod |in_event |out_cons |out_prod |out_event |ring_orde|
> >     +---------+---------+---------+---------+---------+----------+---------+
> > 
> >     28        32        36      4092      4096
> >     +---------+---------+----//---+---------+
> >     |  ref[0] |  ref[1] |         |  ref[N] |
> >     +---------+---------+----//---+---------+
> > 
> > **N.B** For one page, N is maximum 1017 ((4096-28)/4), but given that N
> > needs to be a power of two, actually max N is 512. As 512 == (1 << 9),
> > the maximum possible max-ring-page-order value is 9.
> > 
> > The binary layout of the ring buffers follow:
> > 
> >     0         ((1<<ring_order)<<PAGE_SHIFT)/2       ((1<<ring_order)<<PAGE_SHIFT)
> >     +------------//-------------+------------//-------------+
> >     |            in             |           out             |
> >     +------------//-------------+------------//-------------+
> > 
> > 
> > ## Ring Usage
> > 
> > The **in** and **out** arrays are used as circular buffers:
> >     
> >     0                               sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2
> >     +-----------------------------------+
> >     |to consume|    free    |to consume |
> >     +-----------------------------------+
> >                ^            ^
> >                prod         cons
> > 
> >     0                               sizeof(array)
> >     +-----------------------------------+
> >     |  free    | to consume |   free    |
> >     +-----------------------------------+
> >                ^            ^
> >                cons         prod
> > 
> > The following functions are provided to read and write to an array:
> > 
> >     #define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1))
> > 
> >     static inline void xen_9pfs_read(char *buf,
> >     		XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
> >     		uint8_t *h, size_t len) {
> >     	if (*masked_cons < *masked_prod) {
> >     		memcpy(h, buf + *masked_cons, len);
> >     	} else {
> >     		if (len > XEN_9PFS_RING_SIZE - *masked_cons) {
> >     			memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons);
> >     			memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons));
> >     		} else {
> >     			memcpy(h, buf + *masked_cons, len);
> >     		}
> >     	}
> >     	*masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len);
> >     }
> >     
> >     static inline void xen_9pfs_write(char *buf,
> >     		XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
> >     		uint8_t *opaque, size_t len) {
> >     	if (*masked_prod < *masked_cons) {
> >     		memcpy(buf + *masked_prod, opaque, len);
> >     	} else {
> >     		if (len > XEN_9PFS_RING_SIZE - *masked_prod) {
> >     			memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod);
> >     			memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod)); 
> >     		} else {
> >     			memcpy(buf + *masked_prod, opaque, len); 
> >     		}
> >     	}
> >     	*masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len);
> >     }
> > 
> > The producer (the backend for **in**, the frontend for **out**) writes to the
> > array in the following way:
> > 
> > - read *cons*, *prod* from shared memory
> > - general memory barrier
> > - verify *prod* against local copy (consumer shouldn't change it)
> > - write to array at position *prod* up to *cons*, wrapping around the circular
> >   buffer when necessary
> > - write memory barrier
> > - increase *prod*
> 
> Which can cause the cacheline to be marked dirty - and if the '*cons'*
> is on the same cache line that means it is MUST be evicted from other
> CPUs and cause an stall (when reading from *cons*).
> 
> I wonder if you can move *cons* and *prod* around so that they
> are on seperate cache-lines?

Yes, I'll do.


> > - notify the other end via evtchn
> 
> event channel.

OK


> > 
> > The consumer (the backend for **out**, the frontend for **in**) reads from the
> > array in the following way:
> > 
> > - read *prod*, *cons* from shared memory
> > - read memory barrier
> > - verify *cons* against local copy (producer shouldn't change it)
> > - read from array at position *cons* up to *prod*, wrapping around the circular
> >   buffer when necessary
> > - general memory barrier
> > - increase *cons*
> > - notify the other end via evtchn, if *event* == 1
> 
> /me scratches his head. This looks like something that could be on
> the XenBus? Could you explain a bit more of why you want
> this on the ring instead of some other way?

I realize that this event mechanism looks very ad-hoc, but it is just an
optimization to avoid notifying the other end all the times. It is
entirely optional. It was suggested by David (not specifically in the
current form). It is not present in the original PV console protocol,
which is what this transport is based on. Personally, I think it would
be fine to omit it in the first version of the protocol, we could add it
back as an extension later with a feature flag.


The idea behind [in|out]_event is very simple: when [in|out]_event is 0,
there is no need to send an event. This is the explanation (also below):

  To avoid unnecessary notifications, the consumer only issues an evtchn
  notification if the **event** field (**in_event** or **out_event**), has
  been set to **1**. In fact the producer doesn't usually require any
  notifications, but if it is necessary, for example because the producer
  is forced to wait because the ring is full, then it can request to be
  notified by the consumer by setting **in_event** or **out_event**,
  depending on the ring. After receiving the notification, the producer
  can reset *event*.

The producer only writes to the ring, so it doesn't need notifications
unless the ring is full. In those cases, it is typically forced to wait.
Rather than busy-waiting, it can set [in|out]_event to 1 and go to
sleep.


> > - general memory barrier 
> > - read *prod* again from shared memory to check for new requests
> > 
> > The producer takes care of writing only as many bytes as available in the buffer
> > up to *cons*. The consumer takes care of reading only as many bytes as available
> > in the buffer up to *prod*.
> > 
> > To avoid unnecessary notifications, the consumer only issues an evtchn
> > notification if the **event** field (**in_event** or **out_event**), has
> > been set to **1**. In fact the producer doesn't usually require any
> > notifications, but if it is necessary, for example because the producer
> > is forced to wait because the ring is full, then it can request to be
> > notified by the consumer by setting **in_event** or **out_event**,
> > depending on the ring. After receiving the notification, the producer
> > can reset *event*.
> 
> Ok, but why? Why not make it part of XenBus?

It changes dynamically depending on the status of the ring. As I wrote,
when there is room available on the ring, the producer will leave
[in|out]_event to 0. If there is no room available and the producer
wants to sleep-wait (instead of busy-wait), it will set [in|out]_event
to 1. This is a performance optimization, it is meant to fast: having
the flag on xenstore would defeat its purpose.

Let me repeat that this is entirely optional. The protocol works just
fine without it.


> > The producer always notifies the consumer after incrementing **prod**.
> > However in some circumstances the producer is allowed not to notify the
> > consumer, just as a performance improvement, and still maintain
> > correctness. These are the steps to do it: after incrementing *prod*,
> 
> This seems a bit strange. You are saying 'stop doing interrupts'
> at this point (regardless of what is outstannding in the queue).

Yes, this is another performance improvement for the other direction:
producer to consumer notifications. This also is completely optional and
we could remove it from the doc.


> Perhaps a better approach is to have an CONFIG command on the
> ring to change this? It kind of feels weird to have an
> configuration knob in the cache intensive area.
> 
> Or you could use XenBusReconfigure state which does not mean
> you need a full disconnect/connect?

Like in the other case, it would defeat the purpose. The idea was just
to document cases where notifications can be avoided as a performance
improvement. I realize that it is too early to make such considerations
for a protocol that hasn't even been accepted yet.

The best course of action is probably just to go back to the original
version that didn't have [in|out]_event's and always assumed
notifications from the other end. Simpler and proven to be working fine.


> > the producer reads *cons* a second time; if the value is changed from
> > the previous read and it is different from *prod* before the update,
> > then the notification can be avoided. These are the producer steps, with
> > the optimization:
> > 
> > - read *prod* (old_prod), *cons* (old_cons) from shared memory
> > - general memory barrier
> > - verify *prod* against local copy (consumer shouldn't change it)
> > - write to array at position *prod* up to *cons*, wrapping around the circular
> >   buffer when necessary
> > - write memory barrier
> > - increase *prod* (new_prod)
> > - general memory barrier
> > - read *cons* (new_cons)
> > - if new_cons == old_cons or new_cons == old_prod, then notify the
> >   consumer

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-01-27 23:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-06  0:54 [DOC v3] Xen transport for 9pfs Stefano Stabellini
2017-01-25 17:07 ` Konrad Rzeszutek Wilk
2017-01-27 19:59   ` Konrad Rzeszutek Wilk
2017-01-27 20:26     ` Stefano Stabellini
2017-01-27 23:19   ` Stefano Stabellini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.