All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] netif: staging grants for requests
@ 2016-12-14 18:11 Joao Martins
  2017-01-04 13:54 ` Wei Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Joao Martins @ 2016-12-14 18:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Wei Liu, David Vrabel, Stefano Stabellini

Hey,

Back in the Xen hackaton '16 networking session there were a couple of ideas
brought up. One of them was about exploring permanently mapped grants between
xen-netback/xen-netfront.

I started experimenting and came up with sort of a design document (in pandoc)
on what it would like to be proposed. This is meant as a seed for discussion
and also requesting input to know if this is a good direction. Of course, I
am willing to try alternatives that we come up beyond the contents of the
spec, or any other suggested changes ;)

Any comments or feedback is welcome!

Cheers,
Joao

---
% Staging grants for network I/O requests
% Joao Martins <<joao.m.martins@oracle.com>>
% Revision 1

\clearpage

--------------------------------------------------------------------
Status: **Experimental**

Architecture(s): x86 and ARM

Component(s): Guest

Hardware: Intel and AMD
--------------------------------------------------------------------

# Background and Motivation

At the Xen hackaton '16 networking session, we spoke about having a permanently
mapped region to describe header/linear region of packet buffers. This document
outlines the proposal covering motivation of this and applicability for other
use-cases alongside the necessary changes. This proposal is an RFC and also
includes alternative solutions.

The motivation of this work is to eliminate grant ops for packet I/O intensive
workloads such as those observed with smaller requests size (i.e. <= 256 bytes
or <= MTU). Currently on Xen, only bulk transfer (e.g. 32K..64K packets) are the
only ones performing really good (up to 80 Gbit/s in few CPUs), usually
backing end-hosts and server appliances. Anything that involves higher packet
rates (<= 1500 MTU) or without sg, performs badly almost like a 1 Gbit/s
throughput.

# Proposal

The proposal is to leverage the already implicit copy from and to packet linear
data on netfront and netback, to be done instead from a permanently mapped
region. In some (physical) NICs this is known as header/data split.

Specifically some workloads (e.g. NFV) it would provide a big increase in
throughput when we switch to (zero)copying in the backend/frontend, instead of
the grant hypercalls. Thus this extension aims at futureproofing the netif
protocol by adding the possibility of guests setting up a list of grants that
are set up at device creation and revoked at device freeing - without taking
too much grant entries in account for the general case (i.e. to cover only the
header region <= 256 bytes, 16 grants per ring) while configurable by kernel
when one wants to resort to a copy-based as opposed to grant copy/map.

\clearpage

# General Operation

Here we describe how netback and netfront general operate, and where the proposed
solution will fit. The security mechanism currently involves grants references
which in essence are round-robin recycled 'tickets' stamped with the GPFNs,
permission attributes, and the authorized domain:

(This is an in-memory view of struct grant_entry_v1):

     0     1     2     3     4     5     6     7 octet
    +------------+-----------+------------------------+
    | flags      | domain id | frame                  |
    +------------+-----------+------------------------+

Where there are N grant entries in a grant table, for example:

    @0:
    +------------+-----------+------------------------+
    | rw         | 0         | 0xABCDEF               |
    +------------+-----------+------------------------+
    | rw         | 0         | 0xFA124                |
    +------------+-----------+------------------------+
    | ro         | 1         | 0xBEEF                 |
    +------------+-----------+------------------------+

      .....
    @N:
    +------------+-----------+------------------------+
    | rw         | 0         | 0x9923A                |
    +------------+-----------+------------------------+

Each entry consumes 8 bytes, therefore 512 entries can fit on one page.
The `gnttab_max_frames` which is a default of 32 pages. Hence 16,384
grants. The ParaVirtualized (PV) drivers will use the grant reference (index
in the grant table - 0 .. N) in their command ring.

\clearpage

## Guest Transmit

The view of the shared transmit ring is the following:

     0     1     2     3     4     5     6     7 octet
    +------------------------+------------------------+
    | req_prod               | req_event              |
    +------------------------+------------------------+
    | rsp_prod               | rsp_event              |
    +------------------------+------------------------+
    | pvt                    | pad[44]                |
    +------------------------+                        |
    | ....                                            | [64bytes]
    +------------------------+------------------------+-\
    | gref                   | offset    | flags      | |
    +------------+-----------+------------------------+ +-'struct
    | id         | size      | id        | status     | | netif_tx_sring_entry'
    +-------------------------------------------------+-/
    |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N
    +-------------------------------------------------+

Each entry consumes 16 octets therefore 256 entries can fit on one page.`struct
netif_tx_sring_entry` includes both `struct netif_tx_request` (first 12 octets)
and `struct netif_tx_response` (last 4 octets).  Additionally a `struct
netif_extra_info` may overlay the request in which case the format is:

    +------------------------+------------------------+-\
    | type |flags| type specific data (gso, hash, etc)| |
    +------------+-----------+------------------------+ +-'struct
    | padding for tx         | unused                 | | netif_extra_info'
    +-------------------------------------------------+-/

In essence the transmission of a packet in a from frontend to the backend
network stack goes as following:

**Frontend**

1) Calculate how many slots are needed for transmitting the packet.
   Fail if there are aren't enough slots.

[ Calculation needs to estimate slots taking into account 4k page boundary ]

2) Make first request for the packet.
   The first request contains the whole packet size, checksum info,
   flag whether it contains extra metadata, and if following slots contain
   more data.

3) Put grant in the `gref` field of the tx slot.

4) Set extra info if packet requires special metadata (e.g. GSO size)

5) If there's still data to be granted set flag `NETTXF_more_data` in
request `flags`.

6) Grant remaining packet pages one per slot. (grant boundary is 4k)

7) Fill resultant grefs in the slots setting `NETTXF_more_data` for the N-1.

8) Fill the total packet size in the first request.

9) Set checksum info of the packet (if the chksum offload if supported)

10) Update the request producer index (`req_prod`)

11) Check whether backend needs a notification

11.1) Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__
      depending on the guest type.

**Backend**

12) Backend gets an interrupt and runs its interrupt service routine.

13) Backend checks if there are unconsumed requests

14) Backend consume a request from the ring

15) Process extra info (e.g. if GSO info was set)

16) Counts all requests for this packet to be processed (while
`NETTXF_more_data` is set) and performs a few validation tests:

16.1) Fail transmission if total packet size is smaller than Ethernet
minimum allowed;

  Failing transmission means filling `id` of the request and
  `status` of `NETIF_RSP_ERR` of `struct netif_tx_response`;
  update rsp_prod and finally notify frontend (through `EVTCHNOP_send`).

16.2) Fail transmission if one of the slots (size + offset) crosses the page
boundary

16.3) Fail transmission if number of slots are bigger than spec defined
(18 slots max in netif.h)

17) Allocate packet metadata

[ *Linux specific*: This structure emcompasses a linear data region which
generally accomodates the protocol header and such. Netback allocates up to 128
bytes for that. ]

18) *Linux specific*: Setup up a `GNTTABOP_copy` to copy up to 128 bytes to this small
region (linear part of the skb) *only* from the first slot.

19) Setup GNTTABOP operations to copy/map the packet

20) Perform the `GNTTABOP_copy` (grant copy) and/or `GNTTABOP_map_grant_ref`
    hypercalls.

[ *Linux-specific*: does a copy for the linear region (<=128 bytes) and maps the
         remaining slots as frags for the rest of the data ]

21) Check if the grant operations were successful and fail transmission if
any of the resultant operation `status` were different than `GNTST_okay`.

21.1) If it's a grant copying backend, therefore produce responses for all the
the copied grants like in 16.1). Only difference is that status is
`NETIF_RSP_OKAY`.

21.2) Update the response producer index (`rsp_prod`)

22) Set up gso info requested by frontend [optional]

23) Set frontend provided checksum info

24) *Linux-specific*: Register destructor callback when packet pages are freed.

25) Call into to the network stack.

26) Update `req_event` to `request consumer index + 1` to receive a notification
    on the first produced request from frontend.
    [optional, if backend is polling the ring and never sleeps]

27) *Linux-specific*: Packet destructor callback is called.

27.1) Set up `GNTTABOP_unmap_grant_ref` ops for the designated packet pages.

27.2) Once done, perform `GNTTABOP_unmap_grant_ref` hypercall. Underlying
this hypercall a TLB flush of all backend vCPUS is done.

27.3) Produce Tx response like step 21.1) and 21.2)

[*Linux-specific*: It contains a thread that is woken for this purpose. And
it batch these unmap operations. The callback just queues another unmap.]

27.4) Check whether frontend requested a notification

27.4.1) If so, Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__
      depending on the guest type.

**Frontend**

28) Transmit interrupt is raised which signals the packet transmission completion.

29) Transmit completion routine checks for unconsumed responses

30) Processes the responses and revokes the grants provided.

31) Updates `rsp_cons` (request consumer index)

This proposal aims at replacing steps 19) 20) 21) to be be a memcpy from a
permanently mapped region. Additionally backend may choose to use these
permanently mapped pages in the rest of the packet therefore replacing step
27) (the unmap) preventing the TLB flush.

Note that a grant copy does the following (in pseudo code):

	rcu_lock(src_domain);
	rcu_lock(dst_domain);

	for (op = gntcopy[0]; op < nr_ops; op++) {
		src_frame = __acquire_grant_for_copy(src_domain, <op.src.gref>);
		^ here implies a holding a potential contended per CPU lock on the
	          remote grant table.
		src_vaddr = map_domain_page(src_frame);
	
		dst_frame = __get_paged_frame(dst_domain, <op.dst.mfn>)
		dst_vaddr = map_domain_page(dst_frame);

		memcpy(dst_vaddr + <op.dst.offset>,
			src_frame + <op.src.offset>,
			<op.size>);

		unmap_domain_page(src_frame);
		unmap_domain_page(dst_frame);

	rcu_unlock(src_domain);
	rcu_unlock(dst_domain);

Whereas we propose doing a memcpy from a premapped region that a frontend would
have to set. This region would also let us avoid the tricky case that the frontend
linear region might be spanned across 4K page boundary (which leads to have one slot
with a tiny amount of data). The disadvantage would be to require an additional
copy in steps 4) and 6). But even with that the benefits are still quite big as
results hint in section [Performance](#performance).

\clearpage

## Guest Receive

The view of the shared receive ring is the following:

     0     1     2     3     4     5     6     7 octet
    +------------------------+------------------------+
    | req_prod               | req_event              |
    +------------------------+------------------------+
    | rsp_prod               | rsp_event              |
    +------------------------+------------------------+
    | pvt                    | pad[44]                |
    +------------------------+                        |
    | ....                                            | [64bytes]
    +------------------------+------------------------+
    | id         | pad       | gref                   | ->'struct netif_rx_request'
    +------------+-----------+------------------------+
    | id         | offset    | flags     | status     | ->'struct netif_rx_response'
    +-------------------------------------------------+
    |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N
    +-------------------------------------------------+


Each entry in the ring occupies 16 octets which means a page fits 256 entries.
Additionally a `struct netif_extra_info` may overlay the rx request in which
case the format is:

    +------------------------+------------------------+
    | type |flags| type specific data (gso, hash, etc)| ->'struct netif_extra_info'
    +------------+-----------+------------------------+

Notice the lack of padding, and that is because it's not used on Rx, as Rx
request boundary is 8 octets.

In essence the steps for receiving of a packet in a Linux frontend is as
 from backend to frontend network stack:

**Backend**

1) Backend transmit function starts

[*Linux-specific*: It means we take a packet and add to an internal queue
 (protected by a lock) whereas a separate thread takes it from that queue and
 process the actual like the steps below. This thread has the purpose of
 aggregating as much copies as possible.]

2) Checks if there are enough rx ring slots that can accomodate the packet.

3) Gets a request from the ring for the first data slot and fetches the `gref`
   from it.

4) Create grant copy op from packet page to `gref`.

[ It's up to the backend to choose how it fills this data. E.g. backend may
  choose to merge as much as data from different pages into this single gref,
  similar to mergeable rx buffers in vhost. ]

5) Sets up flags/checksum info on first request.

6) Gets a response from the ring for this data slot.

7) Prefill expected response ring with the request `id` and slot size.

8) Update the request consumer index (`req_cons`)

9) Gets a request from the ring for the first extra info [optional]

10) Sets up extra info (e.g. GSO descriptor) [optional] repeat step 8).

11) Repeat steps 3 through 8 for all packet pages and set `NETRXF_more_data`
   in the N-1 slot.

12) Perform the `GNTTABOP_copy` hypercall.

13) Check if the grant operations status was incorrect and if so set `status`
    of the `struct netif_rx_response` field to NETIF_RSP_ERR.

14) Update the response producer index (`rsp_prod`)

**Frontend**

15) Frontend gets an interrupt and runs its interrupt service routine

16) Checks if there's unconsumed responses

17) Consumes a response from the ring (first response for a packet)

18) Revoke the `gref` in the response

19) Consumes extra info response [optional]

20) While N-1 requests has `NETRXF_more_data`, then fetch each of responses
    and revoke the designated `gref`.

21) Update the response consumer index (`rsp_cons`)

22) *Linux-specific*: Copy (from first slot gref) up to 256 bytes to the linear
    region of the packet metadata structure (skb). The rest of the pages
    processed in the responses are then added as frags.

23) Set checksum info based on first response flags.

24) Call packet into the network stack.

25) Allocate new pages and any necessary packet metadata strutures to new
    requests. These requests will then be used in step 1) and so forth.

26) Update the request producer index (`req_prod`)

27) Check whether backend needs notification:

27.1) If so, Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__
      depending on the guest type.

28) Update `rsp_event` to `response consumer index + 1` such that frontend
    receive a notification on the first newly produced response.
    [optional, if frontend is polling the ring and never sleeps]

This proposal aims at replacing step 4), 12) and  22) with a memcpy and allow
frontend to instead pull from a permanently mapped region. Furthermore it
further allows fast recycling of unused grefs for replinishing the ring.

Depending on the implementation, receive path it would mean that we no longer
would need to aggregate as much as grant ops as possible (step 1) and could
transmit the packet on the transmit function (e.g. Linux ```ndo_start_xmit```)
as previously proposed
here\[[0](http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html)\].
This would heavily improve efficiency specifially for smaller packets. Which in
return would decrease RTT, having data being acknoledged much quicker.

\clearpage

# Use Cases and Recent Trends in Networking

Recent trends in networking are aiming heavily at specialized network stacks
and packet I/O engines. Network backends and both partial/full bypass Packet
I/O engines have been appearing from vendors (OpenOnload, RDMA, DPDK, etc) and
operating systems (FreeBSD netmap, PFRING_DNA). These bring novelty in the space
of NFV, and fast packet processing. Some of the things/techniques used in these
approaches include:

 * Pre-allocated TX/RX buffers/pages
 * Efficient device drivers
 * Memory mapped buffers
 * I/O batching
 * Poll-based methods

In this section we further discuss some of these use-cases and how this
extension can be useful for these scenarios

While this document initially aimed (i.e. in hackaton discussion) at points
[3.1](#guest-transmit) and [3.2](#guest-receive), it would like to be
discussed with extensionability as reasoned on [4.1](#netmap-on-freebsd),
[4.2](#dataplane-development-kit-dpdk), [4.3](#express-data-plane-xdp) use
cases. For consideration I also have working prototypes for
[4.2](#dataplane-development-kit-dpdk), [4.3](#express-data-plane-xdp) showing
really promissing results. The whole argument here is to improve the netif
protocol to further advertise and allow recycling grants on the backend _if_
hinted/requested by the frontend.

## Netmap on FreeBSD

Buffers used by netmap-mode drivers in FreeBSD are permanently allocated by
default 2K bytes
\[[1](https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362)\]
and are statically allocated per netmap-mode device and freed on device
teardown. netmap is targetted at network intensive workloads (i.e. to able to
fill up 10/40 Gbit cards for at least packet sizes <= 1500 bytes) so being able
to avoid the grant ops enables FreeBSD network backend to take full take
advantage of netmap which is in tree from the last couple years.  For a netmap
xen-netback, one would map the frontend pages and use indirect buffers feature
on the backend for a vif on top of the netmap-based switch (VALE). VALE is a
software switch (or an engine) based on netmap
\[[2](https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1)\]. It is
really fast, but somewhat limited in features when compared to openvswitch or
Linux bridge. It has a nice concept of modularity decoupling switching logic
(lookup and switch state) and fowarding engine (port locking and engine
algorithm of transversing packets between different ports).

## DataPlane Development Kit (DPDK)

DPDK uses statically allocated buffers like netmap but more importantly,
it is a complete kernel bypass framework and its general operation consists
around busy polling the NIC ring descriptors *without* any interrupts or system
calls. Everything is done on userspace thus having to make an hypercall to
copy packet buffers heavily deteriotates throughput in the rest of that
pipeline (that is the rest of the NICs in a given poll mode thread). Adding
support for this permanently mapped region would avoid the need for making any
system calls and just use data buffers granted by the guest with the added
benefit of a really fast memcpy without the problems associated with costs of
FPU save/restore (in the backend kernel or hypervisor). Only the notification
would be the last thing in the way, which can be mitigaged by having each side
(i.e. each DPDK driver) not setting respective event indexes (rsp_event and
req_event, as mentioned previously).

## eXpress Data Plane (XDP)

Starting with Linux 4.8 Linux brings an experimental feature
\[[3](https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf)\].
It is which is to receive and/or forward packet buffers at early stages of the
driver RX path.  Packet filters/forwarders by loading a eBPF (Extended Berkeley
Packet Filter) program which the driver will run before even any skbs are
allocated. Some of use-cases include be a ILA router or a load-balancer and
future applications could be a programable switch.

Drivers supporting XDP (currently) require
\[[4](http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data)\]
setting up a bidirectional DMA
region (or in our case a granted region with R/W permissions) and reuse packet
pages after being done in the eBPF program. No unbounded memory allocations are
made (such queueing with new pages), but instead recycle the packet pages after
we're done in BPF.

This creates an opportunity for a backend to avoid map and unmap grants. BPF
program could also modify the packet and thus requested to forward on the same
port, reusing the same RX grant on TX. We should be able to create a
grant region with R/W permissions to the backend, which would be equivalent
of [4.1](#netmap-on-freebsd). Although packets between guests would still involve
a copy.

## Existing page recycle approaches in NICs

Linux (and probably other OSes) follow similar approaches in
network drivers (e.g. ixgbe and mellanox) to address the bottlenecks of
page-allocator and DMA APIs. For example:

1) ixgbe (and probably other Intel cards) will memcpy up to 256 bytes into
newly allocated packet buffer and recycle half of page for refilling other
available descriptors
\[[5](http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073)\];

2) other drivers (such as mellanox) chooose to allocate page order>1 and divide them
multiple NIC descriptors
\[[6](http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52)\];

\clearpage

# Wire Performance

This section is a glossary meant to keep in mind numbers on the wire.

The minimum size that can fit in a single packet with size N is calculated as:

  Packet = Ethernet Header (14) + Protocol Data Unit (46 - 1500) = 60 bytes

In the wire it's a bit more:

  Preamble (7) + Start Frame Delimiter (1) + Packet + CRC (4) + Interframe gap (12) = 84 bytes

For given Link-speed in Bits/sec and Packet size, real packet rate is
calculated as:

  Rate = Link-speed / ((Preamble + Packet + CRC + Interframe gap) * 8)

Numbers to keep in mind (packet size excludes PHY layer, though packet rates
disclosed by vendors take those into account, since it's what goes on the
wire):

| Packet + CRC (bytes)   | 10 Gbit/s  |  40 Gbit/s |  100 Gbit/s  |
|------------------------|:----------:|:----------:|:------------:|
| 64                     | 14.88  Mpps|  59.52 Mpps|  148.80 Mpps |
| 128                    |  8.44  Mpps|  33.78 Mpps|   84.46 Mpps |
| 256                    |  4.52  Mpps|  18.11 Mpps|   45.29 Mpps |
| 1500                   |   822  Kpps|   3.28 Mpps|    8.22 Mpps |
| 65535                  |   ~19  Kpps|  76.27 Kpps|  190.68 Kpps |

Caption:  Mpps (Million packets per second) ; Kpps (Kilo packets per second)

\clearpage

# Proposed Extension

## Terminology

`data gref` refers to the reusable/staging grants, whereas `gref` are the
ones that are granted/revoked for each packet being sent. `command ring`
refers to the current ring, whereas `data list` refers to the one proposed
below.

## Xenstore

"feature-staging-grants" is a capability to allow the use of a set of
recycled buffers permanently mapped at the device setup. If
advertised, the frontend will grant a region equivalent to ```maximum number
of slots per ring * size of buffer``` where:

 * `maximum number of slots per ring` is the number of request structure
   that can be fit within the command ring. By default a request
   structure consumes 16 bytes. The first 64 bytes of a ring are used by
   producer and consumer indicies.
 * `size of buffer` is the maximum portion of the packet the backend (and
   frontend) have negotiated which will be put for each slot of the
   `data ring`.

Which means that for 256 possible packets in ring with 256 bytes of
chosen size the amount of memory to be permanently granted is a region of
64KB i.e. 16 grefs.

The lack of "feature-staging-grants" or a value of 0 means that it's not
supported. If feature is present then a second entry "staging-grants-sizes"
must exist and it contains the sizes that can be used per slot. To avoid
frontend clobbering with various different values, we limit to a set of fixed
ones semi-colon delimited.

The allowed values are implementation specific. Examples of good values
include: 256 (protocol/header region), 2048 (fits 1 MSS which is common to be
in the linear part of linux packets), 4096 (grant per packet if feature-sg=0, for
DPDK/XDP/netmap buffers) and 65535 (maximum packet size i.e.
```NETIF_NR_SLOTS_MIN * XEN_PAGE_SIZE``` for feature-sg=1).

Individual size covered per entry by frontend is through ```tx-data-len``` or
```rx-data-len``` which contains maximum amount of data on a single packet.
Chosen values of "tx-data-len" and "rx-data-len" are asymmetrical (hence can
be different between TX and RX) are validated against this list of valid sizes.
For it's use see [datapath](#datapath-changes) section further below.

	/local/domain/1/device/vif/0/feature-staging-grants = "1"
	/local/domain/1/device/vif/0/staging-grants-sizes = "128;256;2048;4096"
	/local/domain/1/device/vif/0/queue-0/tx-data-ref0 = "<data-ref-tx0>"
	/local/domain/1/device/vif/0/queue-0/tx-data-len = "128"
	/local/domain/1/device/vif/0/queue-0/rx-data-ref0 = "<data-ref-rx0>"
	/local/domain/1/device/vif/0/queue-0/rx-data-len = "256"

The ```tx-data-ref%u``` and ```rx-data-ref%u``` describe the list of grants to be
used for each ring. The example above exemplifies a list composed of a single
page whereas multiple pages would be as:

	tx-data-ref0 = <data-ref-tx0>
	tx-data-ref1 = <data-ref-tx1>
	rx-data-ref0 = <data-ref-rx0>
	rx-data-ref1 = <data-ref-rx1>

Each slot in the `data-ref` list is formatted as following:

            0     1     2     3     4     5     6     7  octet
         +-----+-----+-----+-----+-----+-----+-----+-----+
         | id        | gref                  | offset    |
         +-----+-----+-----+-----+-----+-----+-----+-----+

	id: the frame identifier.
	gref: grant reference to the frame.
	offset: offset within the frame.

This list has twice as max slots as would have `tx-ring-ref` or `rx-ring-ref`
respectively, and it is set up at device creation and removed at device
teardown, same as the command ring entries. This way we keep up with ring size
changes as it it expected to be in the command ring. A hypothetical multi-page
command ring would increase number of slots and thus this data list would grow
accordingly. List is terminated by an entry which ```gref``` field is 0, having
ignored the other fields of this specific entry.

## Datapath Changes

The packet is copied to/from the mapped data list grefs of up to `tx-data-len`
or `rx-data-len`. This means that the buffer (referenced by `gref` from within
the `struct netif_[tx|rx]_request`) has the same data up to `size`.  In other
words, *gref[0->size] contents is replicated on the `data-ring` at `idx`. Hence
netback should ignore up to `size` of the `gref` when processing as the
`data-ring` has the contents of it.

Values bigger then the 4096 page/grant boundary only have special meaning for
backend being how much it is required to be copied/pulled across the whole
packet (which can be composed of multiple slots). Hence (e.g.) a value of 65536
vs 4096 will have the same data list size and the latter value would lead to
only copy/pull one gref in the whole packet, whereas the former will be a
copy-only interface for all slots.

## Buffer Identification and Flags

The data list ids must start from 0 and are global and continguous (across both
lists).  Data slot is identified by ring slot ```id``` field. Resultant data
gref id to be used in RX data list is computed by subtract of `struct
netif_[tx|rx]_request` ```id``` from total amount of tx data grefs.Example of
the lists layout  below:

```
 [tx-data-ref-0, tx-data-len=256]
 { .id = 0, gref = 512, .offset = 0x0   }
 { .id = 1, gref = 512, .offset = 0x100 }
 { .id = 2, gref = 512, .offset = 0x200 }
 ...
 { .id = 256, gref = 0, .offset = 0x0   }

 [rx-data-ref-0, rx-data-len=4096]
 { .id = 256, gref = 529, .offset = 0x0 }
 { .id = 257, gref = 530, .offset = 0x0 }
 { .id = 258, gref = 531, .offset = 0x0 }
 ...
 { .id = 512, gref = 0,   .offset = 0x0 }
```

Permissions of RX data grefs are read-write whereas TX data grefs is read-only.

## Zerocopy

Frontend may wish to provide a bigger RX list than TX, and use RX buffers for
transmission in a zerocopy fashion for guests mainly doing forwarding. In such
cases backend set NETTXF_staging_buffer flag in ```netif_tx_request``` flags
field such that `gref` field instead designates the `id` of a data grefs.

This is only valid when packets are solely described by the staging grants for
the slot packet size being written. Or when [tx|rx]-data-len is 4096 (for
feature-sg 0) or 65535 (for feature-sg 1) and thus no new `gref` is needed for
describing the packet payload.

\clearpage

## Performance

Numbers that give a rough idea on the performance benefits of this extension.
These are Guest <-> Dom0 which test the communication between backend and
frontend, excluding other bottlenecks in the datapath (the software switch).

```
# grant copy
Guest TX (1vcpu,  64b, UDP in pps):  1 506 170 pps
Guest TX (4vcpu,  64b, UDP in pps):  4 988 563 pps
Guest TX (1vcpu, 256b, UDP in pps):  1 295 001 pps
Guest TX (4vcpu, 256b, UDP in pps):  4 249 211 pps

# grant copy + grant map (see next subsection)
Guest TX (1vcpu, 260b, UDP in pps):    577 782 pps
Guest TX (4vcpu, 260b, UDP in pps):  1 218 273 pps

# drop at the guest network stack
Guest RX (1vcpu,  64b, UDP in pps):  1 549 630 pps
Guest RX (4vcpu,  64b, UDP in pps):  2 870 947 pps
```

With this extension:
```
# memcpy
data-len=256 TX (1vcpu,  64b, UDP in pps):  3 759 012 pps
data-len=256 TX (4vcpu,  64b, UDP in pps): 12 416 436 pps
data-len=256 TX (1vcpu, 256b, UDP in pps):  3 248 392 pps
data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps

# memcpy + grant map (see next subsection)
data-len=256 TX (1vcpu, 260b, UDP in pps):    588 428 pps
data-len=256 TX (4vcpu, 260b, UDP in pps):  1 668 044 pps

# (drop at the guest network stack)
data-len=256 RX (1vcpu,  64b, UDP in pps):  3 285 362 pps
data-len=256 RX (4vcpu,  64b, UDP in pps): 11 761 847 pps

# (drop with guest XDP_DROP prog)
data-len=256 RX (1vcpu,  64b, UDP in pps):  9 466 591 pps
data-len=256 RX (4vcpu,  64b, UDP in pps): 33 006 157 pps
```

Latency measurements (netperf TCP_RR request size 1 and response size 1):
```
24 KTps vs 28 KTps
39 KTps vs 50 KTps (with kernel busy poll)
```

TCP Bulk transfer measurements aren't showing a representative increase on
maximum throughput (sometimes ~10%), but rather less retransmissions and
more stable. This is probably because of being having a slight decrease in rtt
time (i.e. receiver acknowledging data quicker). Currently trying exploring
other data list sizes and probably will have a better idea on the effects of
this.

## Linux grant copy vs map remark

Based on numbers above there's a sudden 2x performance drop when we switch from
grant copy to also grant map the ` gref`: 1 295 001 vs  577 782 for 256 and 260
packets bytes respectivally. Which is all the more visible when removing the grant
copy with memcpy in this extension (3 248 392 vs 588 428). While there's been
discussions of avoid the TLB unflush on unmap, one could wonder what the
threshold of that improvement would be. Chances are that this is the least of
our concerns in a fully poppulated host (or with an oversubscribed one). Would
it be worth experimenting increasing the threshold of the copy beyond the
header?

\clearpage

# References

[0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html

[1] https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362

[2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1

[3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf

[4]
http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data

[5] http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073

[6] http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52

# History

A table of changes to the document, in chronological order.

------------------------------------------------------------------------
Date       Revision Version  Notes
---------- -------- -------- -------------------------------------------
2016-12-14 1        Xen 4.9  Initial version.
---------- -------- -------- -------------------------------------------

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2016-12-14 18:11 [RFC] netif: staging grants for requests Joao Martins
@ 2017-01-04 13:54 ` Wei Liu
  2017-01-05 20:27   ` Joao Martins
  2017-01-04 19:40 ` Stefano Stabellini
  2017-01-06  9:33 ` Paul Durrant
  2 siblings, 1 reply; 17+ messages in thread
From: Wei Liu @ 2017-01-04 13:54 UTC (permalink / raw)
  To: Joao Martins
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Paul Durrant,
	David Vrabel, xen-devel

Hey!

Thanks for writing this detailed document!

On Wed, Dec 14, 2016 at 06:11:12PM +0000, Joao Martins wrote:
> Hey,
> 
> Back in the Xen hackaton '16 networking session there were a couple of ideas
> brought up. One of them was about exploring permanently mapped grants between
> xen-netback/xen-netfront.
> 
> I started experimenting and came up with sort of a design document (in pandoc)
> on what it would like to be proposed. This is meant as a seed for discussion
> and also requesting input to know if this is a good direction. Of course, I
> am willing to try alternatives that we come up beyond the contents of the
> spec, or any other suggested changes ;)
> 
> Any comments or feedback is welcome!
> 
> Cheers,
> Joao
> 
> ---
> % Staging grants for network I/O requests
> % Joao Martins <<joao.m.martins@oracle.com>>
> % Revision 1
> 
> \clearpage
> 
> --------------------------------------------------------------------
> Status: **Experimental**
> 
> Architecture(s): x86 and ARM
> 

Any.

> Component(s): Guest
> 
> Hardware: Intel and AMD

No need to specify this.

> --------------------------------------------------------------------
> 
> # Background and Motivation
> 

I skimmed through the middle -- I think you description of transmissions
in both directions is accurate.

The proposal to replace some steps with explicit memcpy is also
sensible.

> \clearpage
> 
> ## Performance
> 
> Numbers that give a rough idea on the performance benefits of this extension.
> These are Guest <-> Dom0 which test the communication between backend and
> frontend, excluding other bottlenecks in the datapath (the software switch).
> 
> ```
> # grant copy
> Guest TX (1vcpu,  64b, UDP in pps):  1 506 170 pps
> Guest TX (4vcpu,  64b, UDP in pps):  4 988 563 pps
> Guest TX (1vcpu, 256b, UDP in pps):  1 295 001 pps
> Guest TX (4vcpu, 256b, UDP in pps):  4 249 211 pps
> 
> # grant copy + grant map (see next subsection)
> Guest TX (1vcpu, 260b, UDP in pps):    577 782 pps
> Guest TX (4vcpu, 260b, UDP in pps):  1 218 273 pps
> 
> # drop at the guest network stack
> Guest RX (1vcpu,  64b, UDP in pps):  1 549 630 pps
> Guest RX (4vcpu,  64b, UDP in pps):  2 870 947 pps
> ```
> 
> With this extension:
> ```
> # memcpy
> data-len=256 TX (1vcpu,  64b, UDP in pps):  3 759 012 pps
> data-len=256 TX (4vcpu,  64b, UDP in pps): 12 416 436 pps

This basically means we can almost get line rate for 10Gb link.

It is already a  good result. I'm interested in knowing if there is
possibility to approach 40 or 100 Gb/s?  It would be good if we design
this extension with higher goals in mind.


> data-len=256 TX (1vcpu, 256b, UDP in pps):  3 248 392 pps
> data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps
> 
> # memcpy + grant map (see next subsection)
> data-len=256 TX (1vcpu, 260b, UDP in pps):    588 428 pps
> data-len=256 TX (4vcpu, 260b, UDP in pps):  1 668 044 pps
> 
> # (drop at the guest network stack)
> data-len=256 RX (1vcpu,  64b, UDP in pps):  3 285 362 pps
> data-len=256 RX (4vcpu,  64b, UDP in pps): 11 761 847 pps
> 
> # (drop with guest XDP_DROP prog)
> data-len=256 RX (1vcpu,  64b, UDP in pps):  9 466 591 pps
> data-len=256 RX (4vcpu,  64b, UDP in pps): 33 006 157 pps
> ```
> 
> Latency measurements (netperf TCP_RR request size 1 and response size 1):
> ```
> 24 KTps vs 28 KTps
> 39 KTps vs 50 KTps (with kernel busy poll)
> ```
> 
> TCP Bulk transfer measurements aren't showing a representative increase on
> maximum throughput (sometimes ~10%), but rather less retransmissions and
> more stable. This is probably because of being having a slight decrease in rtt
> time (i.e. receiver acknowledging data quicker). Currently trying exploring
> other data list sizes and probably will have a better idea on the effects of
> this.
> 
> ## Linux grant copy vs map remark
> 
> Based on numbers above there's a sudden 2x performance drop when we switch from
> grant copy to also grant map the ` gref`: 1 295 001 vs  577 782 for 256 and 260
> packets bytes respectivally. Which is all the more visible when removing the grant
> copy with memcpy in this extension (3 248 392 vs 588 428). While there's been
> discussions of avoid the TLB unflush on unmap, one could wonder what the
> threshold of that improvement would be. Chances are that this is the least of
> our concerns in a fully poppulated host (or with an oversubscribed one). Would
> it be worth experimenting increasing the threshold of the copy beyond the
> header?
> 

Yes, it would be interesting to see more data points and provide
sensible default. But I think this is secondary goal because "sensible
default" can change overtime and on different environments.

> \clearpage
> 
> # References
> 
> [0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html
> 
> [1] https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362
> 
> [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1
> 
> [3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
> 
> [4]
> http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data
> 
> [5] http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073
> 
> [6] http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52
> 
> # History
> 
> A table of changes to the document, in chronological order.
> 
> ------------------------------------------------------------------------
> Date       Revision Version  Notes
> ---------- -------- -------- -------------------------------------------
> 2016-12-14 1        Xen 4.9  Initial version.
> ---------- -------- -------- -------------------------------------------

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2016-12-14 18:11 [RFC] netif: staging grants for requests Joao Martins
  2017-01-04 13:54 ` Wei Liu
@ 2017-01-04 19:40 ` Stefano Stabellini
  2017-01-05 11:54   ` Wei Liu
  2017-01-05 20:27   ` Joao Martins
  2017-01-06  9:33 ` Paul Durrant
  2 siblings, 2 replies; 17+ messages in thread
From: Stefano Stabellini @ 2017-01-04 19:40 UTC (permalink / raw)
  To: Joao Martins
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Paul Durrant,
	David Vrabel, xen-devel

On Wed, 14 Dec 2016, Joao Martins wrote:
> # Proposed Extension
> 
> ## Terminology
> 
> `data gref` refers to the reusable/staging grants, whereas `gref` are the
> ones that are granted/revoked for each packet being sent. `command ring`
> refers to the current ring, whereas `data list` refers to the one proposed
> below.
> 
> ## Xenstore
> 
> "feature-staging-grants" is a capability to allow the use of a set of
> recycled buffers permanently mapped at the device setup. If
> advertised, the frontend will grant a region equivalent to ```maximum number
> of slots per ring * size of buffer``` where:
> 
>  * `maximum number of slots per ring` is the number of request structure
>    that can be fit within the command ring. By default a request
>    structure consumes 16 bytes. The first 64 bytes of a ring are used by
>    producer and consumer indicies.

This means that by default the number of slots is (4096-64)/16 = 252,
correct?


>  * `size of buffer` is the maximum portion of the packet the backend (and
>    frontend) have negotiated which will be put for each slot of the
>    `data ring`.
> 
> Which means that for 256 possible packets in ring with 256 bytes of
> chosen size the amount of memory to be permanently granted is a region of
> 64KB i.e. 16 grefs.
> 
> The lack of "feature-staging-grants" or a value of 0 means that it's not
> supported. If feature is present then a second entry "staging-grants-sizes"
> must exist and it contains the sizes that can be used per slot. To avoid
> frontend clobbering with various different values, we limit to a set of fixed
> ones semi-colon delimited.
> 
> The allowed values are implementation specific. Examples of good values
> include: 256 (protocol/header region), 2048 (fits 1 MSS which is common to be
> in the linear part of linux packets), 4096 (grant per packet if feature-sg=0, for
> DPDK/XDP/netmap buffers) and 65535 (maximum packet size i.e.
> ```NETIF_NR_SLOTS_MIN * XEN_PAGE_SIZE``` for feature-sg=1).

I am not following. So far we have been talking about two values:
- maximum number of slots per ring
- size of buffer
We have also discussed the number of bytes used for indexes at the
beginning of the command ring, which is 64 bytes.

But now you are introducing a few other values:
- protocol/header region
- fits 1 MSS which is common to be in the linear part of linux packets
- grant per packet
- maximum packet size (I take this is the same as size of buffer)

Could you please described what these values represent in more details
and how they relate to the two previous values?


> Individual size covered per entry by frontend is through ```tx-data-len``` or
> ```rx-data-len``` which contains maximum amount of data on a single packet.
> Chosen values of "tx-data-len" and "rx-data-len" are asymmetrical (hence can
> be different between TX and RX) are validated against this list of valid sizes.
> For it's use see [datapath](#datapath-changes) section further below.
> 
> 	/local/domain/1/device/vif/0/feature-staging-grants = "1"
> 	/local/domain/1/device/vif/0/staging-grants-sizes = "128;256;2048;4096"

I don't get what the 4 numbers mean. Are they possible supported options
for "tx-data-len" and "rx-data-len", which in turn are the "size of
buffer" as described above?

If so, shouldn't they be written by the backend to backend paths? Why
only 4 options? Could there be more? What is the max number of allowed
option? And what is the minimum?


> 	/local/domain/1/device/vif/0/queue-0/tx-data-ref0 = "<data-ref-tx0>"
> 	/local/domain/1/device/vif/0/queue-0/tx-data-len = "128"
> 	/local/domain/1/device/vif/0/queue-0/rx-data-ref0 = "<data-ref-rx0>"
> 	/local/domain/1/device/vif/0/queue-0/rx-data-len = "256"
> 
> The ```tx-data-ref%u``` and ```rx-data-ref%u``` describe the list of grants to be
> used for each ring. The example above exemplifies a list composed of a single

For what ring, the data ring?


> page whereas multiple pages would be as:
> 
> 	tx-data-ref0 = <data-ref-tx0>
> 	tx-data-ref1 = <data-ref-tx1>
> 	rx-data-ref0 = <data-ref-rx0>
> 	rx-data-ref1 = <data-ref-rx1>
> 
> Each slot in the `data-ref` list is formatted as following:
> 
>             0     1     2     3     4     5     6     7  octet
>          +-----+-----+-----+-----+-----+-----+-----+-----+
>          | id        | gref                  | offset    |
>          +-----+-----+-----+-----+-----+-----+-----+-----+
> 
> 	id: the frame identifier.

What is a "frame identifier"?


> 	gref: grant reference to the frame.
> 	offset: offset within the frame.

I assumed that each grant used for the data ring would be used in its
entirety. Does it really make sense to create a ring using portions of
pages? Given that we can only share memory at a page granularity, I
don't think it makes sense: the rest of the page would still be
accessible from the backend, which could compromise its content. I would
share a list of pages that can be mapped contiguously to create a data
ring.


> This list has twice as max slots as would have `tx-ring-ref` or `rx-ring-ref`
> respectively, and it is set up at device creation and removed at device
> teardown, same as the command ring entries. This way we keep up with ring size
> changes as it it expected to be in the command ring. A hypothetical multi-page
                ^ repetition

> command ring would increase number of slots and thus this data list would grow
> accordingly. List is terminated by an entry which ```gref``` field is 0, having
> ignored the other fields of this specific entry.

Could you please explain a bit more how the command ring increases the
number of slot and why we need to allocate twice the number of required
slots at the beginning?



> ## Datapath Changes
> 
> The packet is copied to/from the mapped data list grefs of up to `tx-data-len`
> or `rx-data-len`. This means that the buffer (referenced by `gref` from within
> the `struct netif_[tx|rx]_request`) has the same data up to `size`.  In other
> words, *gref[0->size] contents is replicated on the `data-ring` at `idx`. Hence
> netback should ignore up to `size` of the `gref` when processing as the
> `data-ring` has the contents of it.

My lack of netfront/netback knowledge might show, but I am not
following. What do gref, size and idx represent in this context? Please
explain the new workflow in more details.


> Values bigger then the 4096 page/grant boundary only have special meaning for
> backend being how much it is required to be copied/pulled across the whole
> packet (which can be composed of multiple slots). Hence (e.g.) a value of 65536
> vs 4096 will have the same data list size and the latter value would lead to
> only copy/pull one gref in the whole packet, whereas the former will be a
> copy-only interface for all slots.
> 
> ## Buffer Identification and Flags
> 
> The data list ids must start from 0 and are global and continguous (across both
> lists).  Data slot is identified by ring slot ```id``` field. Resultant data
> gref id to be used in RX data list is computed by subtract of `struct
> netif_[tx|rx]_request` ```id``` from total amount of tx data grefs.Example of
> the lists layout  below:

What is the "resultant data gref id"? What is the "RX data list"?
Please explain in more details.


> ```
>  [tx-data-ref-0, tx-data-len=256]
>  { .id = 0, gref = 512, .offset = 0x0   }
>  { .id = 1, gref = 512, .offset = 0x100 }
>  { .id = 2, gref = 512, .offset = 0x200 }
>  ...
>  { .id = 256, gref = 0, .offset = 0x0   }
> 
>  [rx-data-ref-0, rx-data-len=4096]
>  { .id = 256, gref = 529, .offset = 0x0 }
>  { .id = 257, gref = 530, .offset = 0x0 }
>  { .id = 258, gref = 531, .offset = 0x0 }
>  ...
>  { .id = 512, gref = 0,   .offset = 0x0 }
> ```
> 
> Permissions of RX data grefs are read-write whereas TX data grefs is read-only.
> 
> ## Zerocopy
> 
> Frontend may wish to provide a bigger RX list than TX, and use RX buffers for
> transmission in a zerocopy fashion for guests mainly doing forwarding. In such
> cases backend set NETTXF_staging_buffer flag in ```netif_tx_request``` flags
> field such that `gref` field instead designates the `id` of a data grefs.
> 
> This is only valid when packets are solely described by the staging grants for
> the slot packet size being written. Or when [tx|rx]-data-len is 4096 (for
> feature-sg 0) or 65535 (for feature-sg 1) and thus no new `gref` is needed for
> describing the packet payload.
> 
> \clearpage
> 
> ## Performance
> 
> Numbers that give a rough idea on the performance benefits of this extension.
> These are Guest <-> Dom0 which test the communication between backend and
> frontend, excluding other bottlenecks in the datapath (the software switch).
> 
> ```
> # grant copy
> Guest TX (1vcpu,  64b, UDP in pps):  1 506 170 pps
> Guest TX (4vcpu,  64b, UDP in pps):  4 988 563 pps
> Guest TX (1vcpu, 256b, UDP in pps):  1 295 001 pps
> Guest TX (4vcpu, 256b, UDP in pps):  4 249 211 pps
> 
> # grant copy + grant map (see next subsection)
> Guest TX (1vcpu, 260b, UDP in pps):    577 782 pps
> Guest TX (4vcpu, 260b, UDP in pps):  1 218 273 pps
> 
> # drop at the guest network stack
> Guest RX (1vcpu,  64b, UDP in pps):  1 549 630 pps
> Guest RX (4vcpu,  64b, UDP in pps):  2 870 947 pps
> ```
> 
> With this extension:
> ```
> # memcpy
> data-len=256 TX (1vcpu,  64b, UDP in pps):  3 759 012 pps
> data-len=256 TX (4vcpu,  64b, UDP in pps): 12 416 436 pps
> data-len=256 TX (1vcpu, 256b, UDP in pps):  3 248 392 pps
> data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps
> 
> # memcpy + grant map (see next subsection)
> data-len=256 TX (1vcpu, 260b, UDP in pps):    588 428 pps
> data-len=256 TX (4vcpu, 260b, UDP in pps):  1 668 044 pps
> 
> # (drop at the guest network stack)
> data-len=256 RX (1vcpu,  64b, UDP in pps):  3 285 362 pps
> data-len=256 RX (4vcpu,  64b, UDP in pps): 11 761 847 pps
> 
> # (drop with guest XDP_DROP prog)
> data-len=256 RX (1vcpu,  64b, UDP in pps):  9 466 591 pps
> data-len=256 RX (4vcpu,  64b, UDP in pps): 33 006 157 pps
> ```

Very nice!


> Latency measurements (netperf TCP_RR request size 1 and response size 1):
> ```
> 24 KTps vs 28 KTps
> 39 KTps vs 50 KTps (with kernel busy poll)
> ```
> 
> TCP Bulk transfer measurements aren't showing a representative increase on
> maximum throughput (sometimes ~10%), but rather less retransmissions and
> more stable. This is probably because of being having a slight decrease in rtt
> time (i.e. receiver acknowledging data quicker). Currently trying exploring
> other data list sizes and probably will have a better idea on the effects of
> this.

This is strange. By TCP Bulk transfers, do you mean iperf? From my
pvcalls experience, I would expect a great improvement there too.


> ## Linux grant copy vs map remark
> 
> Based on numbers above there's a sudden 2x performance drop when we switch from
> grant copy to also grant map the ` gref`: 1 295 001 vs  577 782 for 256 and 260
> packets bytes respectivally. Which is all the more visible when removing the grant
> copy with memcpy in this extension (3 248 392 vs 588 428). While there's been
> discussions of avoid the TLB unflush on unmap, one could wonder what the
> threshold of that improvement would be. Chances are that this is the least of
> our concerns in a fully poppulated host (or with an oversubscribed one). Would
> it be worth experimenting increasing the threshold of the copy beyond the
> header?

+1


> \clearpage
> 
> # References
> 
> [0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html
> 
> [1] https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362
> 
> [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1
> 
> [3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
> 
> [4]
> http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data
> 
> [5] http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073
> 
> [6] http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52
> 
> # History
> 
> A table of changes to the document, in chronological order.
> 
> ------------------------------------------------------------------------
> Date       Revision Version  Notes
> ---------- -------- -------- -------------------------------------------
> 2016-12-14 1        Xen 4.9  Initial version.
> ---------- -------- -------- -------------------------------------------
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-04 19:40 ` Stefano Stabellini
@ 2017-01-05 11:54   ` Wei Liu
  2017-01-05 20:27   ` Joao Martins
  1 sibling, 0 replies; 17+ messages in thread
From: Wei Liu @ 2017-01-05 11:54 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, Andrew Cooper, Paul Durrant, David Vrabel, xen-devel,
	Joao Martins

On Wed, Jan 04, 2017 at 11:40:21AM -0800, Stefano Stabellini wrote:
> On Wed, 14 Dec 2016, Joao Martins wrote:
> > # Proposed Extension
> > 
> > ## Terminology
> > 
> > `data gref` refers to the reusable/staging grants, whereas `gref` are the
> > ones that are granted/revoked for each packet being sent. `command ring`
> > refers to the current ring, whereas `data list` refers to the one proposed
> > below.
> > 
> > ## Xenstore
> > 
> > "feature-staging-grants" is a capability to allow the use of a set of
> > recycled buffers permanently mapped at the device setup. If
> > advertised, the frontend will grant a region equivalent to ```maximum number
> > of slots per ring * size of buffer``` where:
> > 
> >  * `maximum number of slots per ring` is the number of request structure
> >    that can be fit within the command ring. By default a request
> >    structure consumes 16 bytes. The first 64 bytes of a ring are used by
> >    producer and consumer indicies.
> 
> This means that by default the number of slots is (4096-64)/16 = 252,
> correct?
> 

The number needs to be rounded down to power of 2 IIRC, so it would be
128.

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-04 13:54 ` Wei Liu
@ 2017-01-05 20:27   ` Joao Martins
  0 siblings, 0 replies; 17+ messages in thread
From: Joao Martins @ 2017-01-05 20:27 UTC (permalink / raw)
  To: Wei Liu
  Cc: xen-devel, Paul Durrant, Stefano Stabellini, David Vrabel, Andrew Cooper

On 01/04/2017 01:54 PM, Wei Liu wrote:
> Hey!
Hey!

> Thanks for writing this detailed document!
Thanks a lot for the review and comments!

> 
> On Wed, Dec 14, 2016 at 06:11:12PM +0000, Joao Martins wrote:
>> Hey,
>>
>> Back in the Xen hackaton '16 networking session there were a couple of ideas
>> brought up. One of them was about exploring permanently mapped grants between
>> xen-netback/xen-netfront.
>>
>> I started experimenting and came up with sort of a design document (in pandoc)
>> on what it would like to be proposed. This is meant as a seed for discussion
>> and also requesting input to know if this is a good direction. Of course, I
>> am willing to try alternatives that we come up beyond the contents of the
>> spec, or any other suggested changes ;)
>>
>> Any comments or feedback is welcome!
>>
>> Cheers,
>> Joao
>>
>> ---
>> % Staging grants for network I/O requests
>> % Joao Martins <<joao.m.martins@oracle.com>>
>> % Revision 1
>>
>> \clearpage
>>
>> --------------------------------------------------------------------
>> Status: **Experimental**
>>
>> Architecture(s): x86 and ARM
>>
> 
> Any.
OK.

> 
>> Component(s): Guest
>>
>> Hardware: Intel and AMD
> 
> No need to specify this.
OK.

> 
>> --------------------------------------------------------------------
>>
>> # Background and Motivation
>>
> 
> I skimmed through the middle -- I think you description of transmissions
> in both directions is accurate.
> 
> The proposal to replace some steps with explicit memcpy is also
> sensible.
Glad to hear that!

> 
>> \clearpage
>>
>> ## Performance
>>
>> Numbers that give a rough idea on the performance benefits of this extension.
>> These are Guest <-> Dom0 which test the communication between backend and
>> frontend, excluding other bottlenecks in the datapath (the software switch).
>>
>> ```
>> # grant copy
>> Guest TX (1vcpu,  64b, UDP in pps):  1 506 170 pps
>> Guest TX (4vcpu,  64b, UDP in pps):  4 988 563 pps
>> Guest TX (1vcpu, 256b, UDP in pps):  1 295 001 pps
>> Guest TX (4vcpu, 256b, UDP in pps):  4 249 211 pps
>>
>> # grant copy + grant map (see next subsection)
>> Guest TX (1vcpu, 260b, UDP in pps):    577 782 pps
>> Guest TX (4vcpu, 260b, UDP in pps):  1 218 273 pps
>>
>> # drop at the guest network stack
>> Guest RX (1vcpu,  64b, UDP in pps):  1 549 630 pps
>> Guest RX (4vcpu,  64b, UDP in pps):  2 870 947 pps
>> ```
>>
>> With this extension:
>> ```
>> # memcpy
>> data-len=256 TX (1vcpu,  64b, UDP in pps):  3 759 012 pps
>> data-len=256 TX (4vcpu,  64b, UDP in pps): 12 416 436 pps
> 
> This basically means we can almost get line rate for 10Gb link.
> 
> It is already a  good result. I'm interested in knowing if there is
> possibility to approach 40 or 100 Gb/s?
Certainly, so with bulk transfer we can already saturate a 40 Gbit/s NIC,
sending out from a guest to an external host. I got ~80 Gbit/s too but between
guests on the same host (some time ago back in xen 4.7). 100 Gbit/s is also on
my radar.

The problem comes with smaller packets <= MTU (and request/response workloads
with small payloads) and there is where we lack the performance. Specially
speaking of the workload with the very small packets, linux has a hard time
saturating those NICs (with XDP now rising up to the challenge); I think only
DPDK is able to at this point [*].

[*] Section 7.1,
https://download.01.org/packet-processing/ONPS2.1/Intel_ONP_Release_2.1_Performance_Test_Report_Rev1.0.pdf

> It would be good if we design this extension with higher goals in mind.
Totally agree!

>> data-len=256 TX (1vcpu, 256b, UDP in pps):  3 248 392 pps
>> data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps
>>
>> # memcpy + grant map (see next subsection)
>> data-len=256 TX (1vcpu, 260b, UDP in pps):    588 428 pps
>> data-len=256 TX (4vcpu, 260b, UDP in pps):  1 668 044 pps
>>
>> # (drop at the guest network stack)
>> data-len=256 RX (1vcpu,  64b, UDP in pps):  3 285 362 pps
>> data-len=256 RX (4vcpu,  64b, UDP in pps): 11 761 847 pps
>>
>> # (drop with guest XDP_DROP prog)
>> data-len=256 RX (1vcpu,  64b, UDP in pps):  9 466 591 pps
>> data-len=256 RX (4vcpu,  64b, UDP in pps): 33 006 157 pps
>> ```
>>
>> Latency measurements (netperf TCP_RR request size 1 and response size 1):
>> ```
>> 24 KTps vs 28 KTps
>> 39 KTps vs 50 KTps (with kernel busy poll)
>> ```
>>
>> TCP Bulk transfer measurements aren't showing a representative increase on
>> maximum throughput (sometimes ~10%), but rather less retransmissions and
>> more stable. This is probably because of being having a slight decrease in rtt
>> time (i.e. receiver acknowledging data quicker). Currently trying exploring
>> other data list sizes and probably will have a better idea on the effects of
>> this.
>>
>> ## Linux grant copy vs map remark
>>
>> Based on numbers above there's a sudden 2x performance drop when we switch from
>> grant copy to also grant map the ` gref`: 1 295 001 vs  577 782 for 256 and 260
>> packets bytes respectivally. Which is all the more visible when removing the grant
>> copy with memcpy in this extension (3 248 392 vs 588 428). While there's been
>> discussions of avoid the TLB unflush on unmap, one could wonder what the
>> threshold of that improvement would be. Chances are that this is the least of
>> our concerns in a fully poppulated host (or with an oversubscribed one). Would
>> it be worth experimenting increasing the threshold of the copy beyond the
>> header?
>>
> 
> Yes, it would be interesting to see more data points and provide
> sensible default. But I think this is secondary goal because "sensible
> default" can change overtime and on different environments.
Indeed; I am experimenting with more data points and other workloads to add up here.

>> \clearpage
>>
>> # References
>>
>> [0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html
>>
>> [1] https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362
>>
>> [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1
>>
>> [3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
>>
>> [4]
>> http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data
>>
>> [5] http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073
>>
>> [6] http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52
>>
>> # History
>>
>> A table of changes to the document, in chronological order.
>>
>> ------------------------------------------------------------------------
>> Date       Revision Version  Notes
>> ---------- -------- -------- -------------------------------------------
>> 2016-12-14 1        Xen 4.9  Initial version.
>> ---------- -------- -------- -------------------------------------------

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-04 19:40 ` Stefano Stabellini
  2017-01-05 11:54   ` Wei Liu
@ 2017-01-05 20:27   ` Joao Martins
  2017-01-06  0:30     ` Stefano Stabellini
  1 sibling, 1 reply; 17+ messages in thread
From: Joao Martins @ 2017-01-05 20:27 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Paul Durrant, Wei Liu, David Vrabel, Andrew Cooper

Hey Stefano,

Thanks a lot for the review and the comments!!

On 01/04/2017 07:40 PM, Stefano Stabellini wrote:
> On Wed, 14 Dec 2016, Joao Martins wrote:
>> # Proposed Extension
>>
>> ## Terminology
>>
>> `data gref` refers to the reusable/staging grants, whereas `gref` are the
>> ones that are granted/revoked for each packet being sent. `command ring`
>> refers to the current ring, whereas `data list` refers to the one proposed
>> below.
>>
>> ## Xenstore
>>
>> "feature-staging-grants" is a capability to allow the use of a set of
>> recycled buffers permanently mapped at the device setup. If
>> advertised, the frontend will grant a region equivalent to ```maximum number
>> of slots per ring * size of buffer``` where:
>>
>>  * `maximum number of slots per ring` is the number of request structure
>>    that can be fit within the command ring. By default a request
>>    structure consumes 16 bytes. The first 64 bytes of a ring are used by
>>    producer and consumer indicies.
> 
> This means that by default the number of slots is (4096-64)/16 = 252,
> correct?
That would be correct but I made a few mistakes in the writing, despite a
following paragraph mentioning the correct ring size for both TX and RX.
The ring slot size is determined by the size of the request or response (being a
union of request and response structures). This means the ring slot would either
8 octets for RX ring or 12 octets for TX ring which in effect translates to:

RX := (4096 - 64) / 8 = 504
TX := (4096 - 64) / 12 = 336

And as Wei correctly mentions both values are rounded down to a power of 2.
Which results in 256 slots for each ring. I will fix this in the spec.

>>  * `size of buffer` is the maximum portion of the packet the backend (and
>>    frontend) have negotiated which will be put for each slot of the
>>    `data ring`.
>>
>> Which means that for 256 possible packets in ring with 256 bytes of
>> chosen size the amount of memory to be permanently granted is a region of
>> 64KB i.e. 16 grefs.
>>
>> The lack of "feature-staging-grants" or a value of 0 means that it's not
>> supported. If feature is present then a second entry "staging-grants-sizes"
>> must exist and it contains the sizes that can be used per slot. To avoid
>> frontend clobbering with various different values, we limit to a set of fixed
>> ones semi-colon delimited.
>>
>> The allowed values are implementation specific. Examples of good values
>> include: 256 (protocol/header region), 2048 (fits 1 MSS which is common to be
>> in the linear part of linux packets), 4096 (grant per packet if feature-sg=0, for
>> DPDK/XDP/netmap buffers) and 65535 (maximum packet size i.e.
>> ```NETIF_NR_SLOTS_MIN * XEN_PAGE_SIZE``` for feature-sg=1).
> 
> I am not following. So far we have been talking about two values:
> - maximum number of slots per ring
> - size of buffer
> We have also discussed the number of bytes used for indexes at the
> beginning of the command ring, which is 64 bytes.
Right, but these 64 bytes were only mentioned to describe the calculation of
number of slots.

> 
> But now you are introducing a few other values:
> - protocol/header region
> - fits 1 MSS which is common to be in the linear part of linux packets
> - grant per packet
> - maximum packet size (I take this is the same as size of buffer)
> 
> Could you please described what these values represent in more details
> and how they relate to the two previous values?
These values I give above are examples/recomendations of valid "size of the
buffer" values. That multiplied for the number of slots gives how much it is
granted by the frontend.

Though I want to be able to say that the backend will copy up to N much bytes to
these staging grants (or data grefs as terminology used in the doc), before it
resorts to grant ops. Hence the value of 65536 is the same as 4096 in terms of
number of data grefs provided, but the backend will just copy less from/to the
staging area leaving the rest to be used with the command ring grefs (grant
map/unmap/etc). Does this answer your question?

I went with the simple approach to start with, but I also thought of having a
small descriptor to describe how much of this staging area is used for packet,
instead of a fixed negotiated value. But it would leave me with only a length
field. Unless we could have a hader with the most commonly used extra info slot
(GSO) as part of this descriptor too, and save 1 slot per packet for GSO/GRO
packets.

>> Individual size covered per entry by frontend is through ```tx-data-len``` or
>> ```rx-data-len``` which contains maximum amount of data on a single packet.
>> Chosen values of "tx-data-len" and "rx-data-len" are asymmetrical (hence can
>> be different between TX and RX) are validated against this list of valid sizes.
>> For it's use see [datapath](#datapath-changes) section further below.
>>
>> 	/local/domain/1/device/vif/0/feature-staging-grants = "1"
>> 	/local/domain/1/device/vif/0/staging-grants-sizes = "128;256;2048;4096"
> 
> I don't get what the 4 numbers mean. Are they possible supported options
> for "tx-data-len" and "rx-data-len", which in turn are the "size of
> buffer" as described above?
Exactly!

> If so, shouldn't they be written by the backend to backend paths? Why
> only 4 options? Could there be more? What is the max number of allowed
> option? And what is the minimum?
The path is wrong, it should be the backend path. The backend advertises this
set of values that can be used, and it can advertise as many values as it
supports (separated by semicolon). The values I selected were simply picked
based on what I explain in the previous paragraph. frontend will then pick one
of those.

/local/domain/0/backend/vif/1/0/feature-staging-grants = "1"
/local/domain/0/backend/vif/1/0/staging-grants-sizes = "128;256;2048;4096"

I could be less restrictive and instead turn these sizes into a tuple with
"min;max" or two separate xenstore entries if folks prefer.

And the frontend just advertises a feature entry like this:

/local/domain/1/device/vif/0/feature-staging-grants = "1"

Other alternatives on the table was to specify this region as part of the
request/response, and turn the TX/RX rings into multi-page ones. But I found
that a bit limitative, for example if we wanted to allow a guest to use a fix
set of pages (DPDK, netmap, XDP) which would result in around >256
"{tx,rx}-ring-ref" entries.

>> 	/local/domain/1/device/vif/0/queue-0/tx-data-ref0 = "<data-ref-tx0>"
>> 	/local/domain/1/device/vif/0/queue-0/tx-data-len = "128"
>> 	/local/domain/1/device/vif/0/queue-0/rx-data-ref0 = "<data-ref-rx0>"
>> 	/local/domain/1/device/vif/0/queue-0/rx-data-len = "256"
>>
>> The ```tx-data-ref%u``` and ```rx-data-ref%u``` describe the list of grants to be
>> used for each ring. The example above exemplifies a list composed of a single
> 
> For what ring, the data ring?
The command ring. There is no data ring, but a list of grefs given to the
backend, hence here I mention this as "data list". And the "data list" entries I
also refer to them as "staging/data grefs". Plus there's one list for TX and
another RX.

> 
>> page whereas multiple pages would be as:
>>
>> 	tx-data-ref0 = <data-ref-tx0>
>> 	tx-data-ref1 = <data-ref-tx1>
>> 	rx-data-ref0 = <data-ref-rx0>
>> 	rx-data-ref1 = <data-ref-rx1>
>>
>> Each slot in the `data-ref` list is formatted as following:
>>
>>             0     1     2     3     4     5     6     7  octet
>>          +-----+-----+-----+-----+-----+-----+-----+-----+
>>          | id        | gref                  | offset    |
>>          +-----+-----+-----+-----+-----+-----+-----+-----+
>>
>> 	id: the frame identifier.
> 
> What is a "frame identifier"?
It's the buffer we are sharing, the one we use to memcpy for backend to use.
Probably should use "buffer" instead. (See further below about its use)

>> 	gref: grant reference to the frame.
>> 	offset: offset within the frame.
> 
> I assumed that each grant used for the data ring would be used in its
> entirety. Does it really make sense to create a ring using portions of
> pages? Given that we can only share memory at a page granularity,
My main argument with having portions of pages is for guests that only want to
use this region to copy the linear region and hence have a much smaller
footprint per ring (with just 16 grefs). Otherwise we would be sharing 256
grants while only using 1/16 of it. In the rest of the spec I take this is the
general case, where we memcpy only a small portion of the packet.

> I don't think it makes sense: the rest of the page would still be
> accessible from the backend, which could compromise its content. I would
> share a list of pages that can be mapped contiguously to create a data
> ring.
Note that the permissions of the TX grefs are readonly, whereas the rx ring is
read/write, so I inherit this same property here for the data grefs too. I
wouldn't mind having a grant per packet; is your concern coming from the fact
that these RX data grefs are constantly mapped hence backend can mangle anytime
as opposed to RX grants are still being revoked when given back to the frontend?
But then nothing currently stops a backend to write to these pages irrespective
of the approach (with or without staging grants) until they are copied/revoked;
the frontend would always copy from the staging region into a guest-only visible
region.

>> This list has twice as max slots as would have `tx-ring-ref` or `rx-ring-ref`
>> respectively, and it is set up at device creation and removed at device
>> teardown, same as the command ring entries. This way we keep up with ring size
>> changes as it it expected to be in the command ring. A hypothetical multi-page
>                 ^ repetition
OK, will remove it.

>> command ring would increase number of slots and thus this data list would grow
>> accordingly. List is terminated by an entry which ```gref``` field is 0, having
>> ignored the other fields of this specific entry.
> 
> Could you please explain a bit more how the command ring increases the
> number of slot and why we need to allocate twice the number of required
> slots at the beginning?
"This list has twice as max slots as" is meant to say that the list has twice as
capacity as the command ring. The list entry is 8 octets, so one page fits 512
entries. The data ref is only filled with up to the command ring size which
means only half of it would be filled currently. I will adjust this sentence
here to make it more clear.

multi-page rings aren't supported on netback/netfront (yet) but as an example we
would have one `tx-data-ref0` to represent the command ring slots of
`tx-ring-ref0` and tx-ring-ref1`. I noticed that to not waste grants, I need to
assume the ```data-ref``` is full with the lack of a termanting entry; otherwise
I would endup needing `tx-data-ref1` to have a terminating entry. I will add a
sentence mentioning this.

>> ## Datapath Changes
>>
>> The packet is copied to/from the mapped data list grefs of up to `tx-data-len`
>> or `rx-data-len`. This means that the buffer (referenced by `gref` from within
>> the `struct netif_[tx|rx]_request`) has the same data up to `size`.  In other
>> words, *gref[0->size] contents is replicated on the `data-ring` at `idx`. Hence
>> netback should ignore up to `size` of the `gref` when processing as the
>> `data-ring` has the contents of it.
> 
> My lack of netfront/netback knowledge might show, but I am not
> following. What do gref, size and idx represent in this context? Please
> explain the new workflow in more details.
[ Please ignore the part saying "replicated on the `data-ring` at `idx`", since
there is no ring. It should instead be `data list` buffer identified by `struct
netif_[tx|rx]_request` `id` field. ]

The new workflow is not too different from the old one: (on TX) we *first*
memcpy the packet to staging grants region (or `data gref` like how I mention in
this doc) of up to the negotiated `{tx,rx}-data-len` (or `size` specified in the
command ring if smaller). The `data gref` to be used is identified by the `id`
field in netif_[tx|rx]_request struct. The rest will proceed as before, that is
granting the packet (within XEN_PAGE_SIZE boundaries) and setting `gref` and
`offset` accordingly in the rest of the command ring slots. Let me improve this
paragraph to make this more clear.

The other difference (see Zerocopy section) is that if frontend sets the flag
NETTXF_staging_buffer, then the `gref` field in netif_[tx|rx]_request struct
will have a value of the `data gref` id (the id field saying "frame identifier"
that you asked earlier in the doc). This is to allow a frontend to specify an RX
`data gref` to be used in the TX ring without involving any additional copy. I
haven't PoC-ed this zerocopy part yet, as it only covers two specific scenarios
(that is guest XDP_TX or on a DPDK frontend).

>> Values bigger then the 4096 page/grant boundary only have special meaning for
>> backend being how much it is required to be copied/pulled across the whole
>> packet (which can be composed of multiple slots). Hence (e.g.) a value of 65536
>> vs 4096 will have the same data list size and the latter value would lead to
>> only copy/pull one gref in the whole packet, whereas the former will be a
>> copy-only interface for all slots.
>>
>> ## Buffer Identification and Flags
>>
>> The data list ids must start from 0 and are global and continguous (across both
>> lists).  Data slot is identified by ring slot ```id``` field. Resultant data
>> gref id to be used in RX data list is computed by subtract of `struct
>> netif_[tx|rx]_request` ```id``` from total amount of tx data grefs. Example of
>> the lists layout  below:
> 
> What is the "resultant data gref id"? What is the "RX data list"?
> Please explain in more details.
"Resultant data gref id" corresponds to the case where we set
NETTXF_staging_buffer flag, in which case we set the command ring `gref` with
the `id` in the data list entry (see below diagram). "RX data list" is the list
described with `rx-data-ref`. I should have put that as separate paragrah as
this is making things more confusing.

>> ```
>>  [tx-data-ref-0, tx-data-len=256]
>>  { .id = 0, gref = 512, .offset = 0x0   }
>>  { .id = 1, gref = 512, .offset = 0x100 }
>>  { .id = 2, gref = 512, .offset = 0x200 }
>>  ...
>>  { .id = 256, gref = 0, .offset = 0x0   }
>>
>>  [rx-data-ref-0, rx-data-len=4096]
>>  { .id = 256, gref = 529, .offset = 0x0 }
>>  { .id = 257, gref = 530, .offset = 0x0 }
>>  { .id = 258, gref = 531, .offset = 0x0 }
>>  ...
>>  { .id = 512, gref = 0,   .offset = 0x0 }
>> ```
>>
>> Permissions of RX data grefs are read-write whereas TX data grefs is read-only.
>>
>> ## Zerocopy
>>
>> Frontend may wish to provide a bigger RX list than TX, and use RX buffers for
>> transmission in a zerocopy fashion for guests mainly doing forwarding. In such
>> cases backend set NETTXF_staging_buffer flag in ```netif_tx_request``` flags
>> field such that `gref` field instead designates the `id` of a data grefs.
>>
>> This is only valid when packets are solely described by the staging grants for
>> the slot packet size being written. Or when [tx|rx]-data-len is 4096 (for
>> feature-sg 0) or 65535 (for feature-sg 1) and thus no new `gref` is needed for
>> describing the packet payload.
>>
>> \clearpage
>>
>> ## Performance
>>
>> Numbers that give a rough idea on the performance benefits of this extension.
>> These are Guest <-> Dom0 which test the communication between backend and
>> frontend, excluding other bottlenecks in the datapath (the software switch).
>>
>> ```
>> # grant copy
>> Guest TX (1vcpu,  64b, UDP in pps):  1 506 170 pps
>> Guest TX (4vcpu,  64b, UDP in pps):  4 988 563 pps
>> Guest TX (1vcpu, 256b, UDP in pps):  1 295 001 pps
>> Guest TX (4vcpu, 256b, UDP in pps):  4 249 211 pps
>>
>> # grant copy + grant map (see next subsection)
>> Guest TX (1vcpu, 260b, UDP in pps):    577 782 pps
>> Guest TX (4vcpu, 260b, UDP in pps):  1 218 273 pps
>>
>> # drop at the guest network stack
>> Guest RX (1vcpu,  64b, UDP in pps):  1 549 630 pps
>> Guest RX (4vcpu,  64b, UDP in pps):  2 870 947 pps
>> ```
>>
>> With this extension:
>> ```
>> # memcpy
>> data-len=256 TX (1vcpu,  64b, UDP in pps):  3 759 012 pps
>> data-len=256 TX (4vcpu,  64b, UDP in pps): 12 416 436 pps
>> data-len=256 TX (1vcpu, 256b, UDP in pps):  3 248 392 pps
>> data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps
>>
>> # memcpy + grant map (see next subsection)
>> data-len=256 TX (1vcpu, 260b, UDP in pps):    588 428 pps
>> data-len=256 TX (4vcpu, 260b, UDP in pps):  1 668 044 pps
>>
>> # (drop at the guest network stack)
>> data-len=256 RX (1vcpu,  64b, UDP in pps):  3 285 362 pps
>> data-len=256 RX (4vcpu,  64b, UDP in pps): 11 761 847 pps
>>
>> # (drop with guest XDP_DROP prog)
>> data-len=256 RX (1vcpu,  64b, UDP in pps):  9 466 591 pps
>> data-len=256 RX (4vcpu,  64b, UDP in pps): 33 006 157 pps
>> ```
> 
> Very nice!
:D

>> Latency measurements (netperf TCP_RR request size 1 and response size 1):
>> ```
>> 24 KTps vs 28 KTps
>> 39 KTps vs 50 KTps (with kernel busy poll)
>> ```
>>
>> TCP Bulk transfer measurements aren't showing a representative increase on
>> maximum throughput (sometimes ~10%), but rather less retransmissions and
>> more stable. This is probably because of being having a slight decrease in rtt
>> time (i.e. receiver acknowledging data quicker). Currently trying exploring
>> other data list sizes and probably will have a better idea on the effects of
>> this.
> 
> This is strange. By TCP Bulk transfers, do you mean iperf?
Yeap.

> From my pvcalls experience, I would expect a great improvement there too.
Notice that here we are only memcpying a small portion of the packet (256 bytes
max, not all of it). Past experience (in a old proposal I made) it also showed
me an improvement like yours. I am a bit hesitant with memcpy all of way when
other things are involved in the workload; but then I currently don't see many
other alternatives to lessen the grants overhead. In the meantime I'll have more
data points and have a clearer idea on the ratio of the improvement vs the
compromise.

>> ## Linux grant copy vs map remark
>>
>> Based on numbers above there's a sudden 2x performance drop when we switch from
>> grant copy to also grant map the ` gref`: 1 295 001 vs  577 782 for 256 and 260
>> packets bytes respectivally. Which is all the more visible when removing the grant
>> copy with memcpy in this extension (3 248 392 vs 588 428). While there's been
>> discussions of avoid the TLB unflush on unmap, one could wonder what the
>> threshold of that improvement would be. Chances are that this is the least of
>> our concerns in a fully poppulated host (or with an oversubscribed one). Would
>> it be worth experimenting increasing the threshold of the copy beyond the
>> header?
> 
> +1
Cool!

>> \clearpage
>>
>> # References
>>
>> [0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html
>>
>> [1] https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362
>>
>> [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1
>>
>> [3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
>>
>> [4]
>> http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data
>>
>> [5] http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073
>>
>> [6] http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52
>>
>> # History
>>
>> A table of changes to the document, in chronological order.
>>
>> ------------------------------------------------------------------------
>> Date       Revision Version  Notes
>> ---------- -------- -------- -------------------------------------------
>> 2016-12-14 1        Xen 4.9  Initial version.
>> ---------- -------- -------- -------------------------------------------
>>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-05 20:27   ` Joao Martins
@ 2017-01-06  0:30     ` Stefano Stabellini
  2017-01-06 17:13       ` Joao Martins
  0 siblings, 1 reply; 17+ messages in thread
From: Stefano Stabellini @ 2017-01-06  0:30 UTC (permalink / raw)
  To: Joao Martins
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Paul Durrant,
	David Vrabel, xen-devel

On Thu, 5 Jan 2017, Joao Martins wrote:
> Hey Stefano,
> 
> Thanks a lot for the review and the comments!!
> 
> On 01/04/2017 07:40 PM, Stefano Stabellini wrote:
> > On Wed, 14 Dec 2016, Joao Martins wrote:
> >> # Proposed Extension
> >>
> >> ## Terminology
> >>
> >> `data gref` refers to the reusable/staging grants, whereas `gref` are the
> >> ones that are granted/revoked for each packet being sent. `command ring`
> >> refers to the current ring, whereas `data list` refers to the one proposed
> >> below.
> >>
> >> ## Xenstore
> >>
> >> "feature-staging-grants" is a capability to allow the use of a set of
> >> recycled buffers permanently mapped at the device setup. If
> >> advertised, the frontend will grant a region equivalent to ```maximum number
> >> of slots per ring * size of buffer``` where:
> >>
> >>  * `maximum number of slots per ring` is the number of request structure
> >>    that can be fit within the command ring. By default a request
> >>    structure consumes 16 bytes. The first 64 bytes of a ring are used by
> >>    producer and consumer indicies.
> > 
> > This means that by default the number of slots is (4096-64)/16 = 252,
> > correct?
> That would be correct but I made a few mistakes in the writing, despite a
> following paragraph mentioning the correct ring size for both TX and RX.
> The ring slot size is determined by the size of the request or response (being a
> union of request and response structures). This means the ring slot would either
> 8 octets for RX ring or 12 octets for TX ring which in effect translates to:
> 
> RX := (4096 - 64) / 8 = 504
> TX := (4096 - 64) / 12 = 336
> 
> And as Wei correctly mentions both values are rounded down to a power of 2.
> Which results in 256 slots for each ring. I will fix this in the spec.
> 
> >>  * `size of buffer` is the maximum portion of the packet the backend (and
> >>    frontend) have negotiated which will be put for each slot of the
> >>    `data ring`.
> >>
> >> Which means that for 256 possible packets in ring with 256 bytes of
> >> chosen size the amount of memory to be permanently granted is a region of
> >> 64KB i.e. 16 grefs.
> >>
> >> The lack of "feature-staging-grants" or a value of 0 means that it's not
> >> supported. If feature is present then a second entry "staging-grants-sizes"
> >> must exist and it contains the sizes that can be used per slot. To avoid
> >> frontend clobbering with various different values, we limit to a set of fixed
> >> ones semi-colon delimited.
> >>
> >> The allowed values are implementation specific. Examples of good values
> >> include: 256 (protocol/header region), 2048 (fits 1 MSS which is common to be
> >> in the linear part of linux packets), 4096 (grant per packet if feature-sg=0, for
> >> DPDK/XDP/netmap buffers) and 65535 (maximum packet size i.e.
> >> ```NETIF_NR_SLOTS_MIN * XEN_PAGE_SIZE``` for feature-sg=1).
> > 
> > I am not following. So far we have been talking about two values:
> > - maximum number of slots per ring
> > - size of buffer
> > We have also discussed the number of bytes used for indexes at the
> > beginning of the command ring, which is 64 bytes.
> Right, but these 64 bytes were only mentioned to describe the calculation of
> number of slots.
> 
> > 
> > But now you are introducing a few other values:
> > - protocol/header region
> > - fits 1 MSS which is common to be in the linear part of linux packets
> > - grant per packet
> > - maximum packet size (I take this is the same as size of buffer)
> > 
> > Could you please described what these values represent in more details
> > and how they relate to the two previous values?
> These values I give above are examples/recomendations of valid "size of the
> buffer" values. That multiplied for the number of slots gives how much it is
> granted by the frontend.
> 
> Though I want to be able to say that the backend will copy up to N much bytes to
> these staging grants (or data grefs as terminology used in the doc), before it
> resorts to grant ops.

I think I understand.


> Hence the value of 65536 is the same as 4096 in terms of
> number of data grefs provided, but the backend will just copy less from/to the
> staging area leaving the rest to be used with the command ring grefs (grant
> map/unmap/etc). Does this answer your question?

Which value of 65536 is the same as 4096?  Packet size? Because up to
4096 bytes are copied to staging grants, the rest is dealt with as
usual?


> I went with the simple approach to start with, but I also thought of having a
> small descriptor to describe how much of this staging area is used for packet,
> instead of a fixed negotiated value. But it would leave me with only a length
> field. Unless we could have a hader with the most commonly used extra info slot
> (GSO) as part of this descriptor too, and save 1 slot per packet for GSO/GRO
> packets.

I think that a fixed size is OK. I'll call "N" the max number of bytes
copied to the staging grants. Are we going to avoid granting the
original packet data if the packet size <= N?

In fact, wouldn't it be better to only do memcpy if packet size <= N,
and only do mapping or grant copies if packet size > N? That way we
could skip all grant operations when packet size <= N and behave
normally in other cases.


> >> Individual size covered per entry by frontend is through ```tx-data-len``` or
> >> ```rx-data-len``` which contains maximum amount of data on a single packet.
> >> Chosen values of "tx-data-len" and "rx-data-len" are asymmetrical (hence can
> >> be different between TX and RX) are validated against this list of valid sizes.
> >> For it's use see [datapath](#datapath-changes) section further below.
> >>
> >> 	/local/domain/1/device/vif/0/feature-staging-grants = "1"
> >> 	/local/domain/1/device/vif/0/staging-grants-sizes = "128;256;2048;4096"
> > 
> > I don't get what the 4 numbers mean. Are they possible supported options
> > for "tx-data-len" and "rx-data-len", which in turn are the "size of
> > buffer" as described above?
> Exactly!
> 
> > If so, shouldn't they be written by the backend to backend paths? Why
> > only 4 options? Could there be more? What is the max number of allowed
> > option? And what is the minimum?
> The path is wrong, it should be the backend path. The backend advertises this
> set of values that can be used, and it can advertise as many values as it
> supports (separated by semicolon). The values I selected were simply picked
> based on what I explain in the previous paragraph. frontend will then pick one
> of those.

All right, all this needs to be clearly written in the doc.


> /local/domain/0/backend/vif/1/0/feature-staging-grants = "1"
> /local/domain/0/backend/vif/1/0/staging-grants-sizes = "128;256;2048;4096"
> 
> I could be less restrictive and instead turn these sizes into a tuple with
> "min;max" or two separate xenstore entries if folks prefer.
> 
> And the frontend just advertises a feature entry like this:
> 
> /local/domain/1/device/vif/0/feature-staging-grants = "1"
> 
> Other alternatives on the table was to specify this region as part of the
> request/response, and turn the TX/RX rings into multi-page ones. But I found
> that a bit limitative, for example if we wanted to allow a guest to use a fix
> set of pages (DPDK, netmap, XDP) which would result in around >256
> "{tx,rx}-ring-ref" entries.
> 
> >> 	/local/domain/1/device/vif/0/queue-0/tx-data-ref0 = "<data-ref-tx0>"
> >> 	/local/domain/1/device/vif/0/queue-0/tx-data-len = "128"
> >> 	/local/domain/1/device/vif/0/queue-0/rx-data-ref0 = "<data-ref-rx0>"
> >> 	/local/domain/1/device/vif/0/queue-0/rx-data-len = "256"
> >>
> >> The ```tx-data-ref%u``` and ```rx-data-ref%u``` describe the list of grants to be
> >> used for each ring. The example above exemplifies a list composed of a single
> > 
> > For what ring, the data ring?
> The command ring. There is no data ring, but a list of grefs given to the
> backend, hence here I mention this as "data list". And the "data list" entries I
> also refer to them as "staging/data grefs". Plus there's one list for TX and
> another RX.
> 
> > 
> >> page whereas multiple pages would be as:
> >>
> >> 	tx-data-ref0 = <data-ref-tx0>
> >> 	tx-data-ref1 = <data-ref-tx1>
> >> 	rx-data-ref0 = <data-ref-rx0>
> >> 	rx-data-ref1 = <data-ref-rx1>
> >>
> >> Each slot in the `data-ref` list is formatted as following:
> >>
> >>             0     1     2     3     4     5     6     7  octet
> >>          +-----+-----+-----+-----+-----+-----+-----+-----+
> >>          | id        | gref                  | offset    |
> >>          +-----+-----+-----+-----+-----+-----+-----+-----+
> >>
> >> 	id: the frame identifier.
> > 
> > What is a "frame identifier"?
> It's the buffer we are sharing, the one we use to memcpy for backend to use.
> Probably should use "buffer" instead. (See further below about its use)
> 
> >> 	gref: grant reference to the frame.
> >> 	offset: offset within the frame.
> > 
> > I assumed that each grant used for the data ring would be used in its
> > entirety. Does it really make sense to create a ring using portions of
> > pages? Given that we can only share memory at a page granularity,
> My main argument with having portions of pages is for guests that only want to
> use this region to copy the linear region and hence have a much smaller
> footprint per ring (with just 16 grefs). Otherwise we would be sharing 256
> grants while only using 1/16 of it. In the rest of the spec I take this is the
> general case, where we memcpy only a small portion of the packet.

Even if 16 gref is your main use case, 16*256 = 4096: you can achieve
that by sharing one single page for all data slots.


> > I don't think it makes sense: the rest of the page would still be
> > accessible from the backend, which could compromise its content. I would
> > share a list of pages that can be mapped contiguously to create a data
> > ring.
> Note that the permissions of the TX grefs are readonly, whereas the rx ring is
> read/write, so I inherit this same property here for the data grefs too. I
> wouldn't mind having a grant per packet;

That's not what I am suggesting.

Let's suppose that the frontend chooses a slot size of 128 and that we
have 256 slots in total. The frontend shares 8 pages to the backend:

 128*256 = 32768 = 8*4096

these 8 pages are going to be used for all data slots. So slot number 34
(if we count from 1) is the second slot of the second page (32 data
slots per page).


> is your concern coming from the fact that these RX data grefs are
> constantly mapped hence backend can mangle anytime as opposed to RX
> grants are still being revoked when given back to the frontend?

Irrespective from what netfront/netback does today with the grants, it
is certainly better to avoid sharing unnecessary memory to the other
end.


> But then nothing currently stops a backend to write to these pages
> irrespective of the approach (with or without staging grants) until
> they are copied/revoked; the frontend would always copy from the
> staging region into a guest-only visible region.
>
> >> This list has twice as max slots as would have `tx-ring-ref` or `rx-ring-ref`
> >> respectively, and it is set up at device creation and removed at device
> >> teardown, same as the command ring entries. This way we keep up with ring size
> >> changes as it it expected to be in the command ring. A hypothetical multi-page
> >                 ^ repetition
> OK, will remove it.
> 
> >> command ring would increase number of slots and thus this data list would grow
> >> accordingly. List is terminated by an entry which ```gref``` field is 0, having
> >> ignored the other fields of this specific entry.
> > 
> > Could you please explain a bit more how the command ring increases the
> > number of slot and why we need to allocate twice the number of required
> > slots at the beginning?
> "This list has twice as max slots as" is meant to say that the list has twice as
> capacity as the command ring. The list entry is 8 octets, so one page fits 512
> entries. The data ref is only filled with up to the command ring size which
> means only half of it would be filled currently. I will adjust this sentence
> here to make it more clear.
> 
> multi-page rings aren't supported on netback/netfront (yet) but as an example we
> would have one `tx-data-ref0` to represent the command ring slots of
> `tx-ring-ref0` and tx-ring-ref1`. I noticed that to not waste grants, I need to
> assume the ```data-ref``` is full with the lack of a termanting entry; otherwise
> I would endup needing `tx-data-ref1` to have a terminating entry. I will add a
> sentence mentioning this.
> 
> >> ## Datapath Changes
> >>
> >> The packet is copied to/from the mapped data list grefs of up to `tx-data-len`
> >> or `rx-data-len`. This means that the buffer (referenced by `gref` from within
> >> the `struct netif_[tx|rx]_request`) has the same data up to `size`.  In other
> >> words, *gref[0->size] contents is replicated on the `data-ring` at `idx`. Hence
> >> netback should ignore up to `size` of the `gref` when processing as the
> >> `data-ring` has the contents of it.
> > 
> > My lack of netfront/netback knowledge might show, but I am not
> > following. What do gref, size and idx represent in this context? Please
> > explain the new workflow in more details.
> [ Please ignore the part saying "replicated on the `data-ring` at `idx`", since
> there is no ring. It should instead be `data list` buffer identified by `struct
> netif_[tx|rx]_request` `id` field. ]
> 
> The new workflow is not too different from the old one: (on TX) we *first*
> memcpy the packet to staging grants region (or `data gref` like how I mention in
> this doc) of up to the negotiated `{tx,rx}-data-len` (or `size` specified in the
> command ring if smaller). The `data gref` to be used is identified by the `id`
> field in netif_[tx|rx]_request struct. The rest will proceed as before, that is
> granting the packet (within XEN_PAGE_SIZE boundaries) and setting `gref` and
> `offset` accordingly in the rest of the command ring slots. Let me improve this
> paragraph to make this more clear.
> 
> The other difference (see Zerocopy section) is that if frontend sets the flag
> NETTXF_staging_buffer, then the `gref` field in netif_[tx|rx]_request struct
> will have a value of the `data gref` id (the id field saying "frame identifier"
> that you asked earlier in the doc). This is to allow a frontend to specify an RX
> `data gref` to be used in the TX ring without involving any additional copy. I
> haven't PoC-ed this zerocopy part yet, as it only covers two specific scenarios
> (that is guest XDP_TX or on a DPDK frontend).
> 
> >> Values bigger then the 4096 page/grant boundary only have special meaning for
> >> backend being how much it is required to be copied/pulled across the whole
> >> packet (which can be composed of multiple slots). Hence (e.g.) a value of 65536
> >> vs 4096 will have the same data list size and the latter value would lead to
> >> only copy/pull one gref in the whole packet, whereas the former will be a
> >> copy-only interface for all slots.
> >>
> >> ## Buffer Identification and Flags
> >>
> >> The data list ids must start from 0 and are global and continguous (across both
> >> lists).  Data slot is identified by ring slot ```id``` field. Resultant data
> >> gref id to be used in RX data list is computed by subtract of `struct
> >> netif_[tx|rx]_request` ```id``` from total amount of tx data grefs. Example of
> >> the lists layout  below:
> > 
> > What is the "resultant data gref id"? What is the "RX data list"?
> > Please explain in more details.
> "Resultant data gref id" corresponds to the case where we set
> NETTXF_staging_buffer flag, in which case we set the command ring `gref` with
> the `id` in the data list entry (see below diagram). "RX data list" is the list
> described with `rx-data-ref`. I should have put that as separate paragrah as
> this is making things more confusing.
> 
> >> ```
> >>  [tx-data-ref-0, tx-data-len=256]
> >>  { .id = 0, gref = 512, .offset = 0x0   }
> >>  { .id = 1, gref = 512, .offset = 0x100 }
> >>  { .id = 2, gref = 512, .offset = 0x200 }
> >>  ...
> >>  { .id = 256, gref = 0, .offset = 0x0   }
> >>
> >>  [rx-data-ref-0, rx-data-len=4096]
> >>  { .id = 256, gref = 529, .offset = 0x0 }
> >>  { .id = 257, gref = 530, .offset = 0x0 }
> >>  { .id = 258, gref = 531, .offset = 0x0 }
> >>  ...
> >>  { .id = 512, gref = 0,   .offset = 0x0 }
> >> ```
> >>
> >> Permissions of RX data grefs are read-write whereas TX data grefs is read-only.
> >>
> >> ## Zerocopy
> >>
> >> Frontend may wish to provide a bigger RX list than TX, and use RX buffers for
> >> transmission in a zerocopy fashion for guests mainly doing forwarding. In such
> >> cases backend set NETTXF_staging_buffer flag in ```netif_tx_request``` flags
> >> field such that `gref` field instead designates the `id` of a data grefs.
> >>
> >> This is only valid when packets are solely described by the staging grants for
> >> the slot packet size being written. Or when [tx|rx]-data-len is 4096 (for
> >> feature-sg 0) or 65535 (for feature-sg 1) and thus no new `gref` is needed for
> >> describing the packet payload.
> >>
> >> \clearpage
> >>
> >> ## Performance
> >>
> >> Numbers that give a rough idea on the performance benefits of this extension.
> >> These are Guest <-> Dom0 which test the communication between backend and
> >> frontend, excluding other bottlenecks in the datapath (the software switch).
> >>
> >> ```
> >> # grant copy
> >> Guest TX (1vcpu,  64b, UDP in pps):  1 506 170 pps
> >> Guest TX (4vcpu,  64b, UDP in pps):  4 988 563 pps
> >> Guest TX (1vcpu, 256b, UDP in pps):  1 295 001 pps
> >> Guest TX (4vcpu, 256b, UDP in pps):  4 249 211 pps
> >>
> >> # grant copy + grant map (see next subsection)
> >> Guest TX (1vcpu, 260b, UDP in pps):    577 782 pps
> >> Guest TX (4vcpu, 260b, UDP in pps):  1 218 273 pps
> >>
> >> # drop at the guest network stack
> >> Guest RX (1vcpu,  64b, UDP in pps):  1 549 630 pps
> >> Guest RX (4vcpu,  64b, UDP in pps):  2 870 947 pps
> >> ```
> >>
> >> With this extension:
> >> ```
> >> # memcpy
> >> data-len=256 TX (1vcpu,  64b, UDP in pps):  3 759 012 pps
> >> data-len=256 TX (4vcpu,  64b, UDP in pps): 12 416 436 pps
> >> data-len=256 TX (1vcpu, 256b, UDP in pps):  3 248 392 pps
> >> data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps
> >>
> >> # memcpy + grant map (see next subsection)
> >> data-len=256 TX (1vcpu, 260b, UDP in pps):    588 428 pps
> >> data-len=256 TX (4vcpu, 260b, UDP in pps):  1 668 044 pps
> >>
> >> # (drop at the guest network stack)
> >> data-len=256 RX (1vcpu,  64b, UDP in pps):  3 285 362 pps
> >> data-len=256 RX (4vcpu,  64b, UDP in pps): 11 761 847 pps
> >>
> >> # (drop with guest XDP_DROP prog)
> >> data-len=256 RX (1vcpu,  64b, UDP in pps):  9 466 591 pps
> >> data-len=256 RX (4vcpu,  64b, UDP in pps): 33 006 157 pps
> >> ```
> > 
> > Very nice!
> :D
> 
> >> Latency measurements (netperf TCP_RR request size 1 and response size 1):
> >> ```
> >> 24 KTps vs 28 KTps
> >> 39 KTps vs 50 KTps (with kernel busy poll)
> >> ```
> >>
> >> TCP Bulk transfer measurements aren't showing a representative increase on
> >> maximum throughput (sometimes ~10%), but rather less retransmissions and
> >> more stable. This is probably because of being having a slight decrease in rtt
> >> time (i.e. receiver acknowledging data quicker). Currently trying exploring
> >> other data list sizes and probably will have a better idea on the effects of
> >> this.
> > 
> > This is strange. By TCP Bulk transfers, do you mean iperf?
> Yeap.
> 
> > From my pvcalls experience, I would expect a great improvement there too.
> Notice that here we are only memcpying a small portion of the packet (256 bytes
> max, not all of it). 

Are we? Unless I am mistaken, the protocol doesn't have any limitations
on the amount of bytes we are memcpying. It is up to the backend, which
could theoretically support large amounts such as 64k and larger.


> Past experience (in a old proposal I made) it also showed
> me an improvement like yours. I am a bit hesitant with memcpy all of way when
> other things are involved in the workload; but then I currently don't see many
> other alternatives to lessen the grants overhead. In the meantime I'll have more
> data points and have a clearer idea on the ratio of the improvement vs the
> compromise.

I don't want to confuse you, but another alternative to consider would
be to use a curcular data ring like in PVCalls instead of slots. But I
don't know how difficult it would be to implement something like that in
netfront/netback. The command ring could still be used to send packets,
but the data could be transferred over the data ring instead. In this
model we would have to handle the case where a slot is available on the
command ring, but we don't have enough room on the data ring to copy the
data. For example, we could fall back to grant ops.


> >> ## Linux grant copy vs map remark
> >>
> >> Based on numbers above there's a sudden 2x performance drop when we switch from
> >> grant copy to also grant map the ` gref`: 1 295 001 vs  577 782 for 256 and 260
> >> packets bytes respectivally. Which is all the more visible when removing the grant
> >> copy with memcpy in this extension (3 248 392 vs 588 428). While there's been
> >> discussions of avoid the TLB unflush on unmap, one could wonder what the
> >> threshold of that improvement would be. Chances are that this is the least of
> >> our concerns in a fully poppulated host (or with an oversubscribed one). Would
> >> it be worth experimenting increasing the threshold of the copy beyond the
> >> header?
> > 
> > +1
> Cool!
> 
> >> \clearpage
> >>
> >> # References
> >>
> >> [0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html
> >>
> >> [1] https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362
> >>
> >> [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1
> >>
> >> [3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
> >>
> >> [4]
> >> http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data
> >>
> >> [5] http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073
> >>
> >> [6] http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52
> >>
> >> # History
> >>
> >> A table of changes to the document, in chronological order.
> >>
> >> ------------------------------------------------------------------------
> >> Date       Revision Version  Notes
> >> ---------- -------- -------- -------------------------------------------
> >> 2016-12-14 1        Xen 4.9  Initial version.
> >> ---------- -------- -------- -------------------------------------------
> >>
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2016-12-14 18:11 [RFC] netif: staging grants for requests Joao Martins
  2017-01-04 13:54 ` Wei Liu
  2017-01-04 19:40 ` Stefano Stabellini
@ 2017-01-06  9:33 ` Paul Durrant
  2017-01-06 19:18   ` Stefano Stabellini
  2017-01-06 20:08   ` Joao Martins
  2 siblings, 2 replies; 17+ messages in thread
From: Paul Durrant @ 2017-01-06  9:33 UTC (permalink / raw)
  To: Joao Martins, xen-devel; +Cc: Andrew Cooper, Stefano Stabellini, Wei Liu

> -----Original Message-----
> From: Joao Martins [mailto:joao.m.martins@oracle.com]
> Sent: 14 December 2016 18:11
> To: xen-devel@lists.xenproject.org
> Cc: David Vrabel <david.vrabel@citrix.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Paul Durrant
> <Paul.Durrant@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>
> Subject: [RFC] netif: staging grants for requests
> 
> Hey,
> 
> Back in the Xen hackaton '16 networking session there were a couple of ideas
> brought up. One of them was about exploring permanently mapped grants
> between
> xen-netback/xen-netfront.
> 
> I started experimenting and came up with sort of a design document (in
> pandoc)
> on what it would like to be proposed. This is meant as a seed for discussion
> and also requesting input to know if this is a good direction. Of course, I
> am willing to try alternatives that we come up beyond the contents of the
> spec, or any other suggested changes ;)
> 
> Any comments or feedback is welcome!
> 

Hi,

  Sorry for the delay... I've been OOTO for three weeks.

  I like the general approach or pre-granting buffers for RX so that the backend can simply memcpy and tell the frontend which buffer a packet appears in but IIUC you are proposing use of a single pre-granted area for TX also, which would presumably require the frontend to always copy on the TX side? I wonder if we might go for a slightly different scheme...

  The assumption is that the working set of TX buffers in the guest OS is fairly small (which is probably true for a small number of heavily used sockets and an OS that uses a slab allocator)...

  The guest TX code maintains a hash table of buffer addresses to grant refs. When a packet is sent the code looks to see if it has already granted the buffer and re-uses the existing ref if so, otherwise it grants the buffer and adds the new ref into the table.

  The backend also maintains a hash of grant refs to addresses and, whenever it sees a new ref, it grant maps it and adds the address into the table. Otherwise it does a hash lookup and thus has a buffer address it can immediately memcpy from.

  If the frontend wants the backend to release a grant ref (e.g. because it's starting to run out of grant table) then a control message can be used to ask for it back, at which point the backend removes the ref from its cache and unmaps it.

  Using this scheme we allow a guest OS to still use either a zero-copy approach if it wishes to do so, or a static pre-grant... or something between (e.g. pre-grant for headers, zero copy for bulk data).

  Does that sound reasonable?

    Paul
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-06  0:30     ` Stefano Stabellini
@ 2017-01-06 17:13       ` Joao Martins
  2017-01-06 19:02         ` Stefano Stabellini
  0 siblings, 1 reply; 17+ messages in thread
From: Joao Martins @ 2017-01-06 17:13 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Paul Durrant, Wei Liu, Andrew Cooper

On 01/06/2017 12:30 AM, Stefano Stabellini wrote:
> On Thu, 5 Jan 2017, Joao Martins wrote:
>> Hey Stefano,
>>
>> Thanks a lot for the review and the comments!!
>>
>> On 01/04/2017 07:40 PM, Stefano Stabellini wrote:
>>> On Wed, 14 Dec 2016, Joao Martins wrote:
>>>> # Proposed Extension
>>>>
>>>> ## Terminology
>>>>
>>>> `data gref` refers to the reusable/staging grants, whereas `gref` are the
>>>> ones that are granted/revoked for each packet being sent. `command ring`
>>>> refers to the current ring, whereas `data list` refers to the one proposed
>>>> below.
>>>>
>>>> ## Xenstore
>>>>
>>>> "feature-staging-grants" is a capability to allow the use of a set of
>>>> recycled buffers permanently mapped at the device setup. If
>>>> advertised, the frontend will grant a region equivalent to ```maximum number
>>>> of slots per ring * size of buffer``` where:
>>>>
>>>>  * `maximum number of slots per ring` is the number of request structure
>>>>    that can be fit within the command ring. By default a request
>>>>    structure consumes 16 bytes. The first 64 bytes of a ring are used by
>>>>    producer and consumer indicies.
>>>
>>> This means that by default the number of slots is (4096-64)/16 = 252,
>>> correct?
>> That would be correct but I made a few mistakes in the writing, despite a
>> following paragraph mentioning the correct ring size for both TX and RX.
>> The ring slot size is determined by the size of the request or response (being a
>> union of request and response structures). This means the ring slot would either
>> 8 octets for RX ring or 12 octets for TX ring which in effect translates to:
>>
>> RX := (4096 - 64) / 8 = 504
>> TX := (4096 - 64) / 12 = 336
>>
>> And as Wei correctly mentions both values are rounded down to a power of 2.
>> Which results in 256 slots for each ring. I will fix this in the spec.
>>
>>>>  * `size of buffer` is the maximum portion of the packet the backend (and
>>>>    frontend) have negotiated which will be put for each slot of the
>>>>    `data ring`.
>>>>
>>>> Which means that for 256 possible packets in ring with 256 bytes of
>>>> chosen size the amount of memory to be permanently granted is a region of
>>>> 64KB i.e. 16 grefs.
>>>>
>>>> The lack of "feature-staging-grants" or a value of 0 means that it's not
>>>> supported. If feature is present then a second entry "staging-grants-sizes"
>>>> must exist and it contains the sizes that can be used per slot. To avoid
>>>> frontend clobbering with various different values, we limit to a set of fixed
>>>> ones semi-colon delimited.
>>>>
>>>> The allowed values are implementation specific. Examples of good values
>>>> include: 256 (protocol/header region), 2048 (fits 1 MSS which is common to be
>>>> in the linear part of linux packets), 4096 (grant per packet if feature-sg=0, for
>>>> DPDK/XDP/netmap buffers) and 65535 (maximum packet size i.e.
>>>> ```NETIF_NR_SLOTS_MIN * XEN_PAGE_SIZE``` for feature-sg=1).
>>>
>>> I am not following. So far we have been talking about two values:
>>> - maximum number of slots per ring
>>> - size of buffer
>>> We have also discussed the number of bytes used for indexes at the
>>> beginning of the command ring, which is 64 bytes.
>> Right, but these 64 bytes were only mentioned to describe the calculation of
>> number of slots.
>>
>>>
>>> But now you are introducing a few other values:
>>> - protocol/header region
>>> - fits 1 MSS which is common to be in the linear part of linux packets
>>> - grant per packet
>>> - maximum packet size (I take this is the same as size of buffer)
>>>
>>> Could you please described what these values represent in more details
>>> and how they relate to the two previous values?
>> These values I give above are examples/recomendations of valid "size of the
>> buffer" values. That multiplied for the number of slots gives how much it is
>> granted by the frontend.
>>
>> Though I want to be able to say that the backend will copy up to N much bytes to
>> these staging grants (or data grefs as terminology used in the doc), before it
>> resorts to grant ops.
> 
> I think I understand.
> 
> 
>> Hence the value of 65536 is the same as 4096 in terms of
>> number of data grefs provided, but the backend will just copy less from/to the
>> staging area leaving the rest to be used with the command ring grefs (grant
>> map/unmap/etc). Does this answer your question?
> 
> Which value of 65536 is the same as 4096?  Packet size? Because up to
> 4096 bytes are copied to staging grants, the rest is dealt with as
> usual?
I mean the same in terms of the amount of staging grants: 4096 is one gref which
multiplied by number of slots gives 256 grefs provided by frontend for copying
data. 65536 represent the maximum packet size supported in by netback/netfront
with in a packet spanned in multiple slots. 65536 just has a different semantic
- which is to say that we would be copying more than 1 gref; it could be less
like 32768. But in reality with tx-data-len=4096 we are already granting the
maximum (e.g. one gref for ring slot), hence 65536 it's just a semantic
difference on the backend to copy more and grant less (or no grants in the
latter case).

>> I went with the simple approach to start with, but I also thought of having a
>> small descriptor to describe how much of this staging area is used for packet,
>> instead of a fixed negotiated value. But it would leave me with only a length
>> field. Unless we could have a hader with the most commonly used extra info slot
>> (GSO) as part of this descriptor too, and save 1 slot per packet for GSO/GRO
>> packets.
> 
> I think that a fixed size is OK. I'll call "N" the max number of bytes
> copied to the staging grants. Are we going to avoid granting the
> original packet data if the packet size <= N?
Yeap, we would only resort to grant ops for size > N.

> 
> In fact, wouldn't it be better to only do memcpy if packet size <= N,
> and only do mapping or grant copies if packet size > N? That way we
> could skip all grant operations when packet size <= N and behave
> normally in other cases.
That is the purpose - my explanation was a bit messy.

>>>> Individual size covered per entry by frontend is through ```tx-data-len``` or
>>>> ```rx-data-len``` which contains maximum amount of data on a single packet.
>>>> Chosen values of "tx-data-len" and "rx-data-len" are asymmetrical (hence can
>>>> be different between TX and RX) are validated against this list of valid sizes.
>>>> For it's use see [datapath](#datapath-changes) section further below.
>>>>
>>>> 	/local/domain/1/device/vif/0/feature-staging-grants = "1"
>>>> 	/local/domain/1/device/vif/0/staging-grants-sizes = "128;256;2048;4096"
>>>
>>> I don't get what the 4 numbers mean. Are they possible supported options
>>> for "tx-data-len" and "rx-data-len", which in turn are the "size of
>>> buffer" as described above?
>> Exactly!
>>
>>> If so, shouldn't they be written by the backend to backend paths? Why
>>> only 4 options? Could there be more? What is the max number of allowed
>>> option? And what is the minimum?
>> The path is wrong, it should be the backend path. The backend advertises this
>> set of values that can be used, and it can advertise as many values as it
>> supports (separated by semicolon). The values I selected were simply picked
>> based on what I explain in the previous paragraph. frontend will then pick one
>> of those.
> 
> All right, all this needs to be clearly written in the doc.
OK.

>> /local/domain/0/backend/vif/1/0/feature-staging-grants = "1"
>> /local/domain/0/backend/vif/1/0/staging-grants-sizes = "128;256;2048;4096"
>>
>> I could be less restrictive and instead turn these sizes into a tuple with
>> "min;max" or two separate xenstore entries if folks prefer.
>>
>> And the frontend just advertises a feature entry like this:
>>
>> /local/domain/1/device/vif/0/feature-staging-grants = "1"
>>
>> Other alternatives on the table was to specify this region as part of the
>> request/response, and turn the TX/RX rings into multi-page ones. But I found
>> that a bit limitative, for example if we wanted to allow a guest to use a fix
>> set of pages (DPDK, netmap, XDP) which would result in around >256
>> "{tx,rx}-ring-ref" entries.
>>
>>>> 	/local/domain/1/device/vif/0/queue-0/tx-data-ref0 = "<data-ref-tx0>"
>>>> 	/local/domain/1/device/vif/0/queue-0/tx-data-len = "128"
>>>> 	/local/domain/1/device/vif/0/queue-0/rx-data-ref0 = "<data-ref-rx0>"
>>>> 	/local/domain/1/device/vif/0/queue-0/rx-data-len = "256"
>>>>
>>>> The ```tx-data-ref%u``` and ```rx-data-ref%u``` describe the list of grants to be
>>>> used for each ring. The example above exemplifies a list composed of a single
>>>
>>> For what ring, the data ring?
>> The command ring. There is no data ring, but a list of grefs given to the
>> backend, hence here I mention this as "data list". And the "data list" entries I
>> also refer to them as "staging/data grefs". Plus there's one list for TX and
>> another RX.
>>
>>>
>>>> page whereas multiple pages would be as:
>>>>
>>>> 	tx-data-ref0 = <data-ref-tx0>
>>>> 	tx-data-ref1 = <data-ref-tx1>
>>>> 	rx-data-ref0 = <data-ref-rx0>
>>>> 	rx-data-ref1 = <data-ref-rx1>
>>>>
>>>> Each slot in the `data-ref` list is formatted as following:
>>>>
>>>>             0     1     2     3     4     5     6     7  octet
>>>>          +-----+-----+-----+-----+-----+-----+-----+-----+
>>>>          | id        | gref                  | offset    |
>>>>          +-----+-----+-----+-----+-----+-----+-----+-----+
>>>>
>>>> 	id: the frame identifier.
>>>
>>> What is a "frame identifier"?
>> It's the buffer we are sharing, the one we use to memcpy for backend to use.
>> Probably should use "buffer" instead. (See further below about its use)
>>
>>>> 	gref: grant reference to the frame.
>>>> 	offset: offset within the frame.
>>>
>>> I assumed that each grant used for the data ring would be used in its
>>> entirety. Does it really make sense to create a ring using portions of
>>> pages? Given that we can only share memory at a page granularity,
>> My main argument with having portions of pages is for guests that only want to
>> use this region to copy the linear region and hence have a much smaller
>> footprint per ring (with just 16 grefs). Otherwise we would be sharing 256
>> grants while only using 1/16 of it. In the rest of the spec I take this is the
>> general case, where we memcpy only a small portion of the packet.
> 
> Even if 16 gref is your main use case, 16*256 = 4096: you can achieve
> that by sharing one single page for all data slots.
What I am proposing for the general case would be (e.g. rx-data-len=256): 256 *
256 = 65536; so 64k is 16 grefs and that's where this number is coming from.

>>> I don't think it makes sense: the rest of the page would still be
>>> accessible from the backend, which could compromise its content. I would
>>> share a list of pages that can be mapped contiguously to create a data
>>> ring.
>> Note that the permissions of the TX grefs are readonly, whereas the rx ring is
>> read/write, so I inherit this same property here for the data grefs too. I
>> wouldn't mind having a grant per packet;
> 
> That's not what I am suggesting.
> 
> Let's suppose that the frontend chooses a slot size of 128 and that we
> have 256 slots in total. The frontend shares 8 pages to the backend:
> 
>  128*256 = 32768 = 8*4096
> 
> these 8 pages are going to be used for all data slots. So slot number 34
> (if we count from 1) is the second slot of the second page (32 data
> slots per page).
Exactly and that's what I am proposing :) The text appears to be a bit ambiguous
and causing a little confusion, my apologies.

>> is your concern coming from the fact that these RX data grefs are
>> constantly mapped hence backend can mangle anytime as opposed to RX
>> grants are still being revoked when given back to the frontend?
> 
> Irrespective from what netfront/netback does today with the grants, it
> is certainly better to avoid sharing unnecessary memory to the other
> end.
Yeap.

>> But then nothing currently stops a backend to write to these pages
>> irrespective of the approach (with or without staging grants) until
>> they are copied/revoked; the frontend would always copy from the
>> staging region into a guest-only visible region.
>>
>>>> This list has twice as max slots as would have `tx-ring-ref` or `rx-ring-ref`
>>>> respectively, and it is set up at device creation and removed at device
>>>> teardown, same as the command ring entries. This way we keep up with ring size
>>>> changes as it it expected to be in the command ring. A hypothetical multi-page
>>>                 ^ repetition
>> OK, will remove it.
>>
>>>> command ring would increase number of slots and thus this data list would grow
>>>> accordingly. List is terminated by an entry which ```gref``` field is 0, having
>>>> ignored the other fields of this specific entry.
>>>
>>> Could you please explain a bit more how the command ring increases the
>>> number of slot and why we need to allocate twice the number of required
>>> slots at the beginning?
>> "This list has twice as max slots as" is meant to say that the list has twice as
>> capacity as the command ring. The list entry is 8 octets, so one page fits 512
>> entries. The data ref is only filled with up to the command ring size which
>> means only half of it would be filled currently. I will adjust this sentence
>> here to make it more clear.
>>
>> multi-page rings aren't supported on netback/netfront (yet) but as an example we
>> would have one `tx-data-ref0` to represent the command ring slots of
>> `tx-ring-ref0` and tx-ring-ref1`. I noticed that to not waste grants, I need to
>> assume the ```data-ref``` is full with the lack of a termanting entry; otherwise
>> I would endup needing `tx-data-ref1` to have a terminating entry. I will add a
>> sentence mentioning this.
>>
>>>> ## Datapath Changes
>>>>
>>>> The packet is copied to/from the mapped data list grefs of up to `tx-data-len`
>>>> or `rx-data-len`. This means that the buffer (referenced by `gref` from within
>>>> the `struct netif_[tx|rx]_request`) has the same data up to `size`.  In other
>>>> words, *gref[0->size] contents is replicated on the `data-ring` at `idx`. Hence
>>>> netback should ignore up to `size` of the `gref` when processing as the
>>>> `data-ring` has the contents of it.
>>>
>>> My lack of netfront/netback knowledge might show, but I am not
>>> following. What do gref, size and idx represent in this context? Please
>>> explain the new workflow in more details.
>> [ Please ignore the part saying "replicated on the `data-ring` at `idx`", since
>> there is no ring. It should instead be `data list` buffer identified by `struct
>> netif_[tx|rx]_request` `id` field. ]
>>
>> The new workflow is not too different from the old one: (on TX) we *first*
>> memcpy the packet to staging grants region (or `data gref` like how I mention in
>> this doc) of up to the negotiated `{tx,rx}-data-len` (or `size` specified in the
>> command ring if smaller). The `data gref` to be used is identified by the `id`
>> field in netif_[tx|rx]_request struct. The rest will proceed as before, that is
>> granting the packet (within XEN_PAGE_SIZE boundaries) and setting `gref` and
>> `offset` accordingly in the rest of the command ring slots. Let me improve this
>> paragraph to make this more clear.
>>
>> The other difference (see Zerocopy section) is that if frontend sets the flag
>> NETTXF_staging_buffer, then the `gref` field in netif_[tx|rx]_request struct
>> will have a value of the `data gref` id (the id field saying "frame identifier"
>> that you asked earlier in the doc). This is to allow a frontend to specify an RX
>> `data gref` to be used in the TX ring without involving any additional copy. I
>> haven't PoC-ed this zerocopy part yet, as it only covers two specific scenarios
>> (that is guest XDP_TX or on a DPDK frontend).
>>
>>>> Values bigger then the 4096 page/grant boundary only have special meaning for
>>>> backend being how much it is required to be copied/pulled across the whole
>>>> packet (which can be composed of multiple slots). Hence (e.g.) a value of 65536
>>>> vs 4096 will have the same data list size and the latter value would lead to
>>>> only copy/pull one gref in the whole packet, whereas the former will be a
>>>> copy-only interface for all slots.
>>>>
>>>> ## Buffer Identification and Flags
>>>>
>>>> The data list ids must start from 0 and are global and continguous (across both
>>>> lists).  Data slot is identified by ring slot ```id``` field. Resultant data
>>>> gref id to be used in RX data list is computed by subtract of `struct
>>>> netif_[tx|rx]_request` ```id``` from total amount of tx data grefs. Example of
>>>> the lists layout  below:
>>>
>>> What is the "resultant data gref id"? What is the "RX data list"?
>>> Please explain in more details.
>> "Resultant data gref id" corresponds to the case where we set
>> NETTXF_staging_buffer flag, in which case we set the command ring `gref` with
>> the `id` in the data list entry (see below diagram). "RX data list" is the list
>> described with `rx-data-ref`. I should have put that as separate paragrah as
>> this is making things more confusing.
>>
>>>> ```
>>>>  [tx-data-ref-0, tx-data-len=256]
>>>>  { .id = 0, gref = 512, .offset = 0x0   }
>>>>  { .id = 1, gref = 512, .offset = 0x100 }
>>>>  { .id = 2, gref = 512, .offset = 0x200 }
>>>>  ...
>>>>  { .id = 256, gref = 0, .offset = 0x0   }
>>>>
>>>>  [rx-data-ref-0, rx-data-len=4096]
>>>>  { .id = 256, gref = 529, .offset = 0x0 }
>>>>  { .id = 257, gref = 530, .offset = 0x0 }
>>>>  { .id = 258, gref = 531, .offset = 0x0 }
>>>>  ...
>>>>  { .id = 512, gref = 0,   .offset = 0x0 }
>>>> ```
>>>>
>>>> Permissions of RX data grefs are read-write whereas TX data grefs is read-only.
>>>>
>>>> ## Zerocopy
>>>>
>>>> Frontend may wish to provide a bigger RX list than TX, and use RX buffers for
>>>> transmission in a zerocopy fashion for guests mainly doing forwarding. In such
>>>> cases backend set NETTXF_staging_buffer flag in ```netif_tx_request``` flags
>>>> field such that `gref` field instead designates the `id` of a data grefs.
>>>>
>>>> This is only valid when packets are solely described by the staging grants for
>>>> the slot packet size being written. Or when [tx|rx]-data-len is 4096 (for
>>>> feature-sg 0) or 65535 (for feature-sg 1) and thus no new `gref` is needed for
>>>> describing the packet payload.
>>>>
>>>> \clearpage
>>>>
>>>> ## Performance
>>>>
>>>> Numbers that give a rough idea on the performance benefits of this extension.
>>>> These are Guest <-> Dom0 which test the communication between backend and
>>>> frontend, excluding other bottlenecks in the datapath (the software switch).
>>>>
>>>> ```
>>>> # grant copy
>>>> Guest TX (1vcpu,  64b, UDP in pps):  1 506 170 pps
>>>> Guest TX (4vcpu,  64b, UDP in pps):  4 988 563 pps
>>>> Guest TX (1vcpu, 256b, UDP in pps):  1 295 001 pps
>>>> Guest TX (4vcpu, 256b, UDP in pps):  4 249 211 pps
>>>>
>>>> # grant copy + grant map (see next subsection)
>>>> Guest TX (1vcpu, 260b, UDP in pps):    577 782 pps
>>>> Guest TX (4vcpu, 260b, UDP in pps):  1 218 273 pps
>>>>
>>>> # drop at the guest network stack
>>>> Guest RX (1vcpu,  64b, UDP in pps):  1 549 630 pps
>>>> Guest RX (4vcpu,  64b, UDP in pps):  2 870 947 pps
>>>> ```
>>>>
>>>> With this extension:
>>>> ```
>>>> # memcpy
>>>> data-len=256 TX (1vcpu,  64b, UDP in pps):  3 759 012 pps
>>>> data-len=256 TX (4vcpu,  64b, UDP in pps): 12 416 436 pps
>>>> data-len=256 TX (1vcpu, 256b, UDP in pps):  3 248 392 pps
>>>> data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps
>>>>
>>>> # memcpy + grant map (see next subsection)
>>>> data-len=256 TX (1vcpu, 260b, UDP in pps):    588 428 pps
>>>> data-len=256 TX (4vcpu, 260b, UDP in pps):  1 668 044 pps
>>>>
>>>> # (drop at the guest network stack)
>>>> data-len=256 RX (1vcpu,  64b, UDP in pps):  3 285 362 pps
>>>> data-len=256 RX (4vcpu,  64b, UDP in pps): 11 761 847 pps
>>>>
>>>> # (drop with guest XDP_DROP prog)
>>>> data-len=256 RX (1vcpu,  64b, UDP in pps):  9 466 591 pps
>>>> data-len=256 RX (4vcpu,  64b, UDP in pps): 33 006 157 pps
>>>> ```
>>>
>>> Very nice!
>> :D
>>
>>>> Latency measurements (netperf TCP_RR request size 1 and response size 1):
>>>> ```
>>>> 24 KTps vs 28 KTps
>>>> 39 KTps vs 50 KTps (with kernel busy poll)
>>>> ```
>>>>
>>>> TCP Bulk transfer measurements aren't showing a representative increase on
>>>> maximum throughput (sometimes ~10%), but rather less retransmissions and
>>>> more stable. This is probably because of being having a slight decrease in rtt
>>>> time (i.e. receiver acknowledging data quicker). Currently trying exploring
>>>> other data list sizes and probably will have a better idea on the effects of
>>>> this.
>>>
>>> This is strange. By TCP Bulk transfers, do you mean iperf?
>> Yeap.
>>
>>> From my pvcalls experience, I would expect a great improvement there too.
>> Notice that here we are only memcpying a small portion of the packet (256 bytes
>> max, not all of it). 
> 
> Are we? Unless I am mistaken, the protocol doesn't have any limitations
> on the amount of bytes we are memcpying. It is up to the backend, which
> could theoretically support large amounts such as 64k and larger.
The case I gave above was when the backend was only memcpying maximum 256 bytes
(that is for the linear part of the skbs). But correct, the protocol doesn't
have limitations if the backend allow a frontend to copy the whole packet, but
only if the frontend selects it too tx-data-len|rx-data-len as 65536.

>> Past experience (in a old proposal I made) it also showed
>> me an improvement like yours. I am a bit hesitant with memcpy all of way when
>> other things are involved in the workload; but then I currently don't see many
>> other alternatives to lessen the grants overhead. In the meantime I'll have more
>> data points and have a clearer idea on the ratio of the improvement vs the
>> compromise.
> 
> I don't want to confuse you, but another alternative to consider would
> be to use a curcular data ring like in PVCalls instead of slots. But I
> don't know how difficult it would be to implement something like that in
> netfront/netback. The command ring could still be used to send packets,
> but the data could be transferred over the data ring instead. In this
> model we would have to handle the case where a slot is available on the
> command ring, but we don't have enough room on the data ring to copy the
> data. For example, we could fall back to grant ops.
I thought of that option but I was a bit afraid for starters to introduce two
data rings hence I kept things simple (despite text being a bit messy). The only
reservation I had having data rings is that the backend wouldn't be able to hold
on to the pages in a zerocopy fashion like it does now.

Thanks!

>>>> ## Linux grant copy vs map remark
>>>>
>>>> Based on numbers above there's a sudden 2x performance drop when we switch from
>>>> grant copy to also grant map the ` gref`: 1 295 001 vs  577 782 for 256 and 260
>>>> packets bytes respectivally. Which is all the more visible when removing the grant
>>>> copy with memcpy in this extension (3 248 392 vs 588 428). While there's been
>>>> discussions of avoid the TLB unflush on unmap, one could wonder what the
>>>> threshold of that improvement would be. Chances are that this is the least of
>>>> our concerns in a fully poppulated host (or with an oversubscribed one). Would
>>>> it be worth experimenting increasing the threshold of the copy beyond the
>>>> header?
>>>
>>> +1
>> Cool!
>>
>>>> \clearpage
>>>>
>>>> # References
>>>>
>>>> [0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html
>>>>
>>>> [1] https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362
>>>>
>>>> [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1
>>>>
>>>> [3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
>>>>
>>>> [4]
>>>> http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data
>>>>
>>>> [5] http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073
>>>>
>>>> [6] http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52
>>>>
>>>> # History
>>>>
>>>> A table of changes to the document, in chronological order.
>>>>
>>>> ------------------------------------------------------------------------
>>>> Date       Revision Version  Notes
>>>> ---------- -------- -------- -------------------------------------------
>>>> 2016-12-14 1        Xen 4.9  Initial version.
>>>> ---------- -------- -------- -------------------------------------------

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-06 17:13       ` Joao Martins
@ 2017-01-06 19:02         ` Stefano Stabellini
  0 siblings, 0 replies; 17+ messages in thread
From: Stefano Stabellini @ 2017-01-06 19:02 UTC (permalink / raw)
  To: Joao Martins
  Cc: xen-devel, Stefano Stabellini, Wei Liu, Paul Durrant, Andrew Cooper

On Fri, 6 Jan 2017, Joao Martins wrote:
> >>> I don't think it makes sense: the rest of the page would still be
> >>> accessible from the backend, which could compromise its content. I would
> >>> share a list of pages that can be mapped contiguously to create a data
> >>> ring.
> >> Note that the permissions of the TX grefs are readonly, whereas the rx ring is
> >> read/write, so I inherit this same property here for the data grefs too. I
> >> wouldn't mind having a grant per packet;
> > 
> > That's not what I am suggesting.
> > 
> > Let's suppose that the frontend chooses a slot size of 128 and that we
> > have 256 slots in total. The frontend shares 8 pages to the backend:
> > 
> >  128*256 = 32768 = 8*4096
> > 
> > these 8 pages are going to be used for all data slots. So slot number 34
> > (if we count from 1) is the second slot of the second page (32 data
> > slots per page).
> Exactly and that's what I am proposing :) The text appears to be a bit ambiguous
> and causing a little confusion, my apologies.

All right, I really misunderstood the document. Please clarify it in the
next version :-)


> >> But then nothing currently stops a backend to write to these pages
> >> irrespective of the approach (with or without staging grants) until
> >> they are copied/revoked; the frontend would always copy from the
> >> staging region into a guest-only visible region.
> >>
> >>>> This list has twice as max slots as would have `tx-ring-ref` or `rx-ring-ref`
> >>>> respectively, and it is set up at device creation and removed at device
> >>>> teardown, same as the command ring entries. This way we keep up with ring size
> >>>> changes as it it expected to be in the command ring. A hypothetical multi-page
> >>>                 ^ repetition
> >> OK, will remove it.
> >>
> >>>> command ring would increase number of slots and thus this data list would grow
> >>>> accordingly. List is terminated by an entry which ```gref``` field is 0, having
> >>>> ignored the other fields of this specific entry.
> >>>
> >>> Could you please explain a bit more how the command ring increases the
> >>> number of slot and why we need to allocate twice the number of required
> >>> slots at the beginning?
> >> "This list has twice as max slots as" is meant to say that the list has twice as
> >> capacity as the command ring. The list entry is 8 octets, so one page fits 512
> >> entries. The data ref is only filled with up to the command ring size which
> >> means only half of it would be filled currently. I will adjust this sentence
> >> here to make it more clear.
> >>
> >> multi-page rings aren't supported on netback/netfront (yet) but as an example we
> >> would have one `tx-data-ref0` to represent the command ring slots of
> >> `tx-ring-ref0` and tx-ring-ref1`. I noticed that to not waste grants, I need to
> >> assume the ```data-ref``` is full with the lack of a termanting entry; otherwise
> >> I would endup needing `tx-data-ref1` to have a terminating entry. I will add a
> >> sentence mentioning this.
> >>
> >>>> ## Datapath Changes
> >>>>
> >>>> The packet is copied to/from the mapped data list grefs of up to `tx-data-len`
> >>>> or `rx-data-len`. This means that the buffer (referenced by `gref` from within
> >>>> the `struct netif_[tx|rx]_request`) has the same data up to `size`.  In other
> >>>> words, *gref[0->size] contents is replicated on the `data-ring` at `idx`. Hence
> >>>> netback should ignore up to `size` of the `gref` when processing as the
> >>>> `data-ring` has the contents of it.
> >>>
> >>> My lack of netfront/netback knowledge might show, but I am not
> >>> following. What do gref, size and idx represent in this context? Please
> >>> explain the new workflow in more details.
> >> [ Please ignore the part saying "replicated on the `data-ring` at `idx`", since
> >> there is no ring. It should instead be `data list` buffer identified by `struct
> >> netif_[tx|rx]_request` `id` field. ]
> >>
> >> The new workflow is not too different from the old one: (on TX) we *first*
> >> memcpy the packet to staging grants region (or `data gref` like how I mention in
> >> this doc) of up to the negotiated `{tx,rx}-data-len` (or `size` specified in the
> >> command ring if smaller). The `data gref` to be used is identified by the `id`
> >> field in netif_[tx|rx]_request struct. The rest will proceed as before, that is
> >> granting the packet (within XEN_PAGE_SIZE boundaries) and setting `gref` and
> >> `offset` accordingly in the rest of the command ring slots. Let me improve this
> >> paragraph to make this more clear.
> >>
> >> The other difference (see Zerocopy section) is that if frontend sets the flag
> >> NETTXF_staging_buffer, then the `gref` field in netif_[tx|rx]_request struct
> >> will have a value of the `data gref` id (the id field saying "frame identifier"
> >> that you asked earlier in the doc). This is to allow a frontend to specify an RX
> >> `data gref` to be used in the TX ring without involving any additional copy. I
> >> haven't PoC-ed this zerocopy part yet, as it only covers two specific scenarios
> >> (that is guest XDP_TX or on a DPDK frontend).
> >>
> >>>> Values bigger then the 4096 page/grant boundary only have special meaning for
> >>>> backend being how much it is required to be copied/pulled across the whole
> >>>> packet (which can be composed of multiple slots). Hence (e.g.) a value of 65536
> >>>> vs 4096 will have the same data list size and the latter value would lead to
> >>>> only copy/pull one gref in the whole packet, whereas the former will be a
> >>>> copy-only interface for all slots.
> >>>>
> >>>> ## Buffer Identification and Flags
> >>>>
> >>>> The data list ids must start from 0 and are global and continguous (across both
> >>>> lists).  Data slot is identified by ring slot ```id``` field. Resultant data
> >>>> gref id to be used in RX data list is computed by subtract of `struct
> >>>> netif_[tx|rx]_request` ```id``` from total amount of tx data grefs. Example of
> >>>> the lists layout  below:
> >>>
> >>> What is the "resultant data gref id"? What is the "RX data list"?
> >>> Please explain in more details.
> >> "Resultant data gref id" corresponds to the case where we set
> >> NETTXF_staging_buffer flag, in which case we set the command ring `gref` with
> >> the `id` in the data list entry (see below diagram). "RX data list" is the list
> >> described with `rx-data-ref`. I should have put that as separate paragrah as
> >> this is making things more confusing.
> >>
> >>>> ```
> >>>>  [tx-data-ref-0, tx-data-len=256]
> >>>>  { .id = 0, gref = 512, .offset = 0x0   }
> >>>>  { .id = 1, gref = 512, .offset = 0x100 }
> >>>>  { .id = 2, gref = 512, .offset = 0x200 }
> >>>>  ...
> >>>>  { .id = 256, gref = 0, .offset = 0x0   }
> >>>>
> >>>>  [rx-data-ref-0, rx-data-len=4096]
> >>>>  { .id = 256, gref = 529, .offset = 0x0 }
> >>>>  { .id = 257, gref = 530, .offset = 0x0 }
> >>>>  { .id = 258, gref = 531, .offset = 0x0 }
> >>>>  ...
> >>>>  { .id = 512, gref = 0,   .offset = 0x0 }
> >>>> ```
> >>>>
> >>>> Permissions of RX data grefs are read-write whereas TX data grefs is read-only.
> >>>>
> >>>> ## Zerocopy
> >>>>
> >>>> Frontend may wish to provide a bigger RX list than TX, and use RX buffers for
> >>>> transmission in a zerocopy fashion for guests mainly doing forwarding. In such
> >>>> cases backend set NETTXF_staging_buffer flag in ```netif_tx_request``` flags
> >>>> field such that `gref` field instead designates the `id` of a data grefs.
> >>>>
> >>>> This is only valid when packets are solely described by the staging grants for
> >>>> the slot packet size being written. Or when [tx|rx]-data-len is 4096 (for
> >>>> feature-sg 0) or 65535 (for feature-sg 1) and thus no new `gref` is needed for
> >>>> describing the packet payload.
> >>>>
> >>>> \clearpage
> >>>>
> >>>> ## Performance
> >>>>
> >>>> Numbers that give a rough idea on the performance benefits of this extension.
> >>>> These are Guest <-> Dom0 which test the communication between backend and
> >>>> frontend, excluding other bottlenecks in the datapath (the software switch).
> >>>>
> >>>> ```
> >>>> # grant copy
> >>>> Guest TX (1vcpu,  64b, UDP in pps):  1 506 170 pps
> >>>> Guest TX (4vcpu,  64b, UDP in pps):  4 988 563 pps
> >>>> Guest TX (1vcpu, 256b, UDP in pps):  1 295 001 pps
> >>>> Guest TX (4vcpu, 256b, UDP in pps):  4 249 211 pps
> >>>>
> >>>> # grant copy + grant map (see next subsection)
> >>>> Guest TX (1vcpu, 260b, UDP in pps):    577 782 pps
> >>>> Guest TX (4vcpu, 260b, UDP in pps):  1 218 273 pps
> >>>>
> >>>> # drop at the guest network stack
> >>>> Guest RX (1vcpu,  64b, UDP in pps):  1 549 630 pps
> >>>> Guest RX (4vcpu,  64b, UDP in pps):  2 870 947 pps
> >>>> ```
> >>>>
> >>>> With this extension:
> >>>> ```
> >>>> # memcpy
> >>>> data-len=256 TX (1vcpu,  64b, UDP in pps):  3 759 012 pps
> >>>> data-len=256 TX (4vcpu,  64b, UDP in pps): 12 416 436 pps
> >>>> data-len=256 TX (1vcpu, 256b, UDP in pps):  3 248 392 pps
> >>>> data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps
> >>>>
> >>>> # memcpy + grant map (see next subsection)
> >>>> data-len=256 TX (1vcpu, 260b, UDP in pps):    588 428 pps
> >>>> data-len=256 TX (4vcpu, 260b, UDP in pps):  1 668 044 pps
> >>>>
> >>>> # (drop at the guest network stack)
> >>>> data-len=256 RX (1vcpu,  64b, UDP in pps):  3 285 362 pps
> >>>> data-len=256 RX (4vcpu,  64b, UDP in pps): 11 761 847 pps
> >>>>
> >>>> # (drop with guest XDP_DROP prog)
> >>>> data-len=256 RX (1vcpu,  64b, UDP in pps):  9 466 591 pps
> >>>> data-len=256 RX (4vcpu,  64b, UDP in pps): 33 006 157 pps
> >>>> ```
> >>>
> >>> Very nice!
> >> :D
> >>
> >>>> Latency measurements (netperf TCP_RR request size 1 and response size 1):
> >>>> ```
> >>>> 24 KTps vs 28 KTps
> >>>> 39 KTps vs 50 KTps (with kernel busy poll)
> >>>> ```
> >>>>
> >>>> TCP Bulk transfer measurements aren't showing a representative increase on
> >>>> maximum throughput (sometimes ~10%), but rather less retransmissions and
> >>>> more stable. This is probably because of being having a slight decrease in rtt
> >>>> time (i.e. receiver acknowledging data quicker). Currently trying exploring
> >>>> other data list sizes and probably will have a better idea on the effects of
> >>>> this.
> >>>
> >>> This is strange. By TCP Bulk transfers, do you mean iperf?
> >> Yeap.
> >>
> >>> From my pvcalls experience, I would expect a great improvement there too.
> >> Notice that here we are only memcpying a small portion of the packet (256 bytes
> >> max, not all of it). 
> > 
> > Are we? Unless I am mistaken, the protocol doesn't have any limitations
> > on the amount of bytes we are memcpying. It is up to the backend, which
> > could theoretically support large amounts such as 64k and larger.
> The case I gave above was when the backend was only memcpying maximum 256 bytes
> (that is for the linear part of the skbs). But correct, the protocol doesn't
> have limitations if the backend allow a frontend to copy the whole packet, but
> only if the frontend selects it too tx-data-len|rx-data-len as 65536.

Sure. My point is that changing the backend settings, increasing the
amount of data copied per packet, you should be able to get very good
TCP bulk transfers performance with your current code. In fact, you
should be able to get close to PVCalls. If that doesn't happen, it's a
problem.


> >> Past experience (in a old proposal I made) it also showed
> >> me an improvement like yours. I am a bit hesitant with memcpy all of way when
> >> other things are involved in the workload; but then I currently don't see many
> >> other alternatives to lessen the grants overhead. In the meantime I'll have more
> >> data points and have a clearer idea on the ratio of the improvement vs the
> >> compromise.
> > 
> > I don't want to confuse you, but another alternative to consider would
> > be to use a curcular data ring like in PVCalls instead of slots. But I
> > don't know how difficult it would be to implement something like that in
> > netfront/netback. The command ring could still be used to send packets,
> > but the data could be transferred over the data ring instead. In this
> > model we would have to handle the case where a slot is available on the
> > command ring, but we don't have enough room on the data ring to copy the
> > data. For example, we could fall back to grant ops.
> I thought of that option but I was a bit afraid for starters to introduce two
> data rings hence I kept things simple (despite text being a bit messy). The only
> reservation I had having data rings is that the backend wouldn't be able to hold
> on to the pages in a zerocopy fashion like it does now.

The advantage of that approach is that the amount of data memcpy'ed
could be dynamic per packet and potentially much larger than 256, while
the total amount of shared memory could be lower (because we don't have
to share all the slots all the time).

If you are unable to get good TCP Bulk transfer performance with the
current approach, I would consider revisiting this proposal and
introduce a data ring buffer instead.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-06  9:33 ` Paul Durrant
@ 2017-01-06 19:18   ` Stefano Stabellini
  2017-01-06 20:19     ` Joao Martins
  2017-01-09  9:03     ` Paul Durrant
  2017-01-06 20:08   ` Joao Martins
  1 sibling, 2 replies; 17+ messages in thread
From: Stefano Stabellini @ 2017-01-06 19:18 UTC (permalink / raw)
  To: Paul Durrant
  Cc: xen-devel, Joao Martins, Stefano Stabellini, Wei Liu, Andrew Cooper

On Fri, 6 Jan 2017, Paul Durrant wrote:
> > -----Original Message-----
> > From: Joao Martins [mailto:joao.m.martins@oracle.com]
> > Sent: 14 December 2016 18:11
> > To: xen-devel@lists.xenproject.org
> > Cc: David Vrabel <david.vrabel@citrix.com>; Andrew Cooper
> > <Andrew.Cooper3@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Paul Durrant
> > <Paul.Durrant@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>
> > Subject: [RFC] netif: staging grants for requests
> > 
> > Hey,
> > 
> > Back in the Xen hackaton '16 networking session there were a couple of ideas
> > brought up. One of them was about exploring permanently mapped grants
> > between
> > xen-netback/xen-netfront.
> > 
> > I started experimenting and came up with sort of a design document (in
> > pandoc)
> > on what it would like to be proposed. This is meant as a seed for discussion
> > and also requesting input to know if this is a good direction. Of course, I
> > am willing to try alternatives that we come up beyond the contents of the
> > spec, or any other suggested changes ;)
> > 
> > Any comments or feedback is welcome!
> > 
> 
> Hi,
> 
>   Sorry for the delay... I've been OOTO for three weeks.
> 
>   I like the general approach or pre-granting buffers for RX so that the backend can simply memcpy and tell the frontend which buffer a packet appears in but IIUC you are proposing use of a single pre-granted area for TX also, which would presumably require the frontend to always copy on the TX side? I wonder if we might go for a slightly different scheme...
> 
>   The assumption is that the working set of TX buffers in the guest OS is fairly small (which is probably true for a small number of heavily used sockets and an OS that uses a slab allocator)...
> 
>   The guest TX code maintains a hash table of buffer addresses to grant refs. When a packet is sent the code looks to see if it has already granted the buffer and re-uses the existing ref if so, otherwise it grants the buffer and adds the new ref into the table.
> 
>   The backend also maintains a hash of grant refs to addresses and, whenever it sees a new ref, it grant maps it and adds the address into the table. Otherwise it does a hash lookup and thus has a buffer address it can immediately memcpy from.
> 
>   If the frontend wants the backend to release a grant ref (e.g. because it's starting to run out of grant table) then a control message can be used to ask for it back, at which point the backend removes the ref from its cache and unmaps it.
> 
>   Using this scheme we allow a guest OS to still use either a zero-copy approach if it wishes to do so, or a static pre-grant... or something between (e.g. pre-grant for headers, zero copy for bulk data).
> 
>   Does that sound reasonable?

Wouldn't it be better to introduce the new memcpy based scheme for RX
only?

This suggestion increases the total amount of grants and the amount of
non-packet data simultaneously shared from frontend to backend by
"accident" (because it happens to be on the same pages that are granted
to the backend). Am I right? If so, it could be a security concern.
Let's keep in mind that if we trust the backend then we might as well go
all the way and assume that the backend is in Dom0 and can map any mfn
in memory without requiring grants. That would be a simple extension
guaranteed to have great performance (I am in favor of this FYI).

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-06  9:33 ` Paul Durrant
  2017-01-06 19:18   ` Stefano Stabellini
@ 2017-01-06 20:08   ` Joao Martins
  2017-01-09  8:56     ` Paul Durrant
  1 sibling, 1 reply; 17+ messages in thread
From: Joao Martins @ 2017-01-06 20:08 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel, Stefano Stabellini, Wei Liu, Andrew Cooper

On 01/06/2017 09:33 AM, Paul Durrant wrote:
>> -----Original Message-----
>> From: Joao Martins [mailto:joao.m.martins@oracle.com]
>> Sent: 14 December 2016 18:11
>> To: xen-devel@lists.xenproject.org
>> Cc: David Vrabel <david.vrabel@citrix.com>; Andrew Cooper
>> <Andrew.Cooper3@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Paul Durrant
>> <Paul.Durrant@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>
>> Subject: [RFC] netif: staging grants for requests
>>
>> Hey,
>>
>> Back in the Xen hackaton '16 networking session there were a couple of ideas
>> brought up. One of them was about exploring permanently mapped grants
>> between
>> xen-netback/xen-netfront.
>>
>> I started experimenting and came up with sort of a design document (in
>> pandoc)
>> on what it would like to be proposed. This is meant as a seed for discussion
>> and also requesting input to know if this is a good direction. Of course, I
>> am willing to try alternatives that we come up beyond the contents of the
>> spec, or any other suggested changes ;)
>>
>> Any comments or feedback is welcome!
>>
> 
> Hi,
Hey!

> 
> Sorry for the delay... I've been OOTO for three weeks.
Thanks for the comments!

> I like the general approach or pre-granting buffers for RX so that the backend
> can simply memcpy and tell the frontend which buffer a packet appears in
Cool,

> but IIUC you are proposing use of a single pre-granted area for TX also, which would
> presumably require the frontend to always copy on the TX side? I wonder if we 
> might go for a slightly different scheme...
I see.

> 
> The assumption is that the working set of TX buffers in the guest OS is fairly
> small (which is probably true for a small number of heavily used sockets and an
> OS that uses a slab allocator)...
Hmm, [speaking about linux] maybe for the skb allocation cache. For the
remaining packet pages maybe not for say a scather-gather list...? But I guess
it would need to be validated whether this working set is indeed kept small as
this seems like a very strong assumption to comply with its various
possibilities in workloads. Plus wouldn't we leak info from these pages if it
wasn't used on the device but rather elsewhere in the guest stack?

> The guest TX code maintains a hash table of buffer addresses to grant refs. When
> a packet is sent the code looks to see if it has already granted the buffer and
> re-uses the existing ref if so, otherwise it grants the buffer and adds the new
> ref into the table.

> The backend also maintains a hash of grant refs to addresses and, whenever it
> sees a new ref, it grant maps it and adds the address into the table. Otherwise
> it does a hash lookup and thus has a buffer address it can immediately memcpy
> from.
> 
> If the frontend wants the backend to release a grant ref (e.g. because it's
> starting to run out of grant table) then a control message can be used to ask
> for it back, at which point the backend removes the ref from its cache and
> unmaps it.
Wouldn't this be somewhat similar to the persistent grants in xen block drivers?

> Using this scheme we allow a guest OS to still use either a zero-copy approach
> if it wishes to do so, or a static pre-grant... or something between 
> (e.g. pre-grant for headers, zero copy for bulk data).
> 
> Does that sound reasonable?
Not sure yet but it looks nice if we can indeed achieve the zero copy part. But
I have two concerns: say a backend could be forced to always remove refs as its
cache is always full having frontend not being able to reuse these pages
(subject to its own allocator behavior, in case assumption above wouldn't be
satisfied) nullifying backend effort into maintaining its mapped grefs table.
One other concern is whether those pages (assumed to be reused) might be leaking
off guest data to the backend (when not used on netfront).

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-06 19:18   ` Stefano Stabellini
@ 2017-01-06 20:19     ` Joao Martins
  2017-01-09  9:03     ` Paul Durrant
  1 sibling, 0 replies; 17+ messages in thread
From: Joao Martins @ 2017-01-06 20:19 UTC (permalink / raw)
  To: Stefano Stabellini, Paul Durrant; +Cc: xen-devel, Wei Liu, Andrew Cooper



On 01/06/2017 07:18 PM, Stefano Stabellini wrote:
> On Fri, 6 Jan 2017, Paul Durrant wrote:
>>> -----Original Message-----
>>> From: Joao Martins [mailto:joao.m.martins@oracle.com]
>>> Sent: 14 December 2016 18:11
>>> To: xen-devel@lists.xenproject.org
>>> Cc: David Vrabel <david.vrabel@citrix.com>; Andrew Cooper
>>> <Andrew.Cooper3@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Paul Durrant
>>> <Paul.Durrant@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>
>>> Subject: [RFC] netif: staging grants for requests
>>>
>>> Hey,
>>>
>>> Back in the Xen hackaton '16 networking session there were a couple of ideas
>>> brought up. One of them was about exploring permanently mapped grants
>>> between
>>> xen-netback/xen-netfront.
>>>
>>> I started experimenting and came up with sort of a design document (in
>>> pandoc)
>>> on what it would like to be proposed. This is meant as a seed for discussion
>>> and also requesting input to know if this is a good direction. Of course, I
>>> am willing to try alternatives that we come up beyond the contents of the
>>> spec, or any other suggested changes ;)
>>>
>>> Any comments or feedback is welcome!
>>>
>>
>> Hi,
>>
>>   Sorry for the delay... I've been OOTO for three weeks.
>>
>>   I like the general approach or pre-granting buffers for RX so that the backend can simply memcpy and tell the frontend which buffer a packet appears in but IIUC you are proposing use of a single pre-granted area for TX also, which would presumably require the frontend to always copy on the TX side? I wonder if we might go for a slightly different scheme...
>>
>>   The assumption is that the working set of TX buffers in the guest OS is fairly small (which is probably true for a small number of heavily used sockets and an OS that uses a slab allocator)...
>>
>>   The guest TX code maintains a hash table of buffer addresses to grant refs. When a packet is sent the code looks to see if it has already granted the buffer and re-uses the existing ref if so, otherwise it grants the buffer and adds the new ref into the table.
>>
>>   The backend also maintains a hash of grant refs to addresses and, whenever it sees a new ref, it grant maps it and adds the address into the table. Otherwise it does a hash lookup and thus has a buffer address it can immediately memcpy from.
>>
>>   If the frontend wants the backend to release a grant ref (e.g. because it's starting to run out of grant table) then a control message can be used to ask for it back, at which point the backend removes the ref from its cache and unmaps it.
>>
>>   Using this scheme we allow a guest OS to still use either a zero-copy approach if it wishes to do so, or a static pre-grant... or something between (e.g. pre-grant for headers, zero copy for bulk data).
>>
>>   Does that sound reasonable?
> 
> Wouldn't it be better to introduce the new memcpy based scheme for RX
> only?
On transmit there's also a copy (from backend) at least for the linear part
today (even without this extension). And if numbers do show up to be faster to
copy all the packet then to allow such setting subject to backend policy on how
it allows perhaps?

> This suggestion increases the total amount of grants and the amount of
> non-packet data simultaneously shared from frontend to backend by
> "accident" (because it happens to be on the same pages that are granted
> to the backend). Am I right? If so, it could be a security concern.
>
> Let's keep in mind that if we trust the backend then we might as well go
> all the way and assume that the backend is in Dom0 and can map any mfn
> in memory without requiring grants. That would be a simple extension
> guaranteed to have great performance (I am in favor of this FYI).
> 
+1 on this last paragraph.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-06 20:08   ` Joao Martins
@ 2017-01-09  8:56     ` Paul Durrant
  2017-01-09 13:01       ` Joao Martins
  0 siblings, 1 reply; 17+ messages in thread
From: Paul Durrant @ 2017-01-09  8:56 UTC (permalink / raw)
  To: Joao Martins; +Cc: xen-devel, Stefano Stabellini, Wei Liu, Andrew Cooper

> -----Original Message-----
> From: Joao Martins [mailto:joao.m.martins@oracle.com]
> Sent: 06 January 2017 20:09
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Stefano
> Stabellini <sstabellini@kernel.org>
> Subject: Re: [RFC] netif: staging grants for requests
> 
> On 01/06/2017 09:33 AM, Paul Durrant wrote:
> >> -----Original Message-----
> >> From: Joao Martins [mailto:joao.m.martins@oracle.com]
> >> Sent: 14 December 2016 18:11
> >> To: xen-devel@lists.xenproject.org
> >> Cc: David Vrabel <david.vrabel@citrix.com>; Andrew Cooper
> >> <Andrew.Cooper3@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Paul
> Durrant
> >> <Paul.Durrant@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>
> >> Subject: [RFC] netif: staging grants for requests
> >>
> >> Hey,
> >>
> >> Back in the Xen hackaton '16 networking session there were a couple of
> ideas
> >> brought up. One of them was about exploring permanently mapped
> grants
> >> between
> >> xen-netback/xen-netfront.
> >>
> >> I started experimenting and came up with sort of a design document (in
> >> pandoc)
> >> on what it would like to be proposed. This is meant as a seed for
> discussion
> >> and also requesting input to know if this is a good direction. Of course, I
> >> am willing to try alternatives that we come up beyond the contents of the
> >> spec, or any other suggested changes ;)
> >>
> >> Any comments or feedback is welcome!
> >>
> >
> > Hi,
> Hey!
> 
> >
> > Sorry for the delay... I've been OOTO for three weeks.
> Thanks for the comments!
> 
> > I like the general approach or pre-granting buffers for RX so that the
> backend
> > can simply memcpy and tell the frontend which buffer a packet appears in
> Cool,
> 
> > but IIUC you are proposing use of a single pre-granted area for TX also,
> which would
> > presumably require the frontend to always copy on the TX side? I wonder if
> we
> > might go for a slightly different scheme...
> I see.
> 
> >
> > The assumption is that the working set of TX buffers in the guest OS is fairly
> > small (which is probably true for a small number of heavily used sockets
> and an
> > OS that uses a slab allocator)...
> Hmm, [speaking about linux] maybe for the skb allocation cache. For the
> remaining packet pages maybe not for say a scather-gather list...? But I guess
> it would need to be validated whether this working set is indeed kept small
> as
> this seems like a very strong assumption to comply with its various
> possibilities in workloads. Plus wouldn't we leak info from these pages if it
> wasn't used on the device but rather elsewhere in the guest stack?

Yes, potentially there is an information leak but I am assuming that the backend is also trusted by the frontend, which is pretty will baked into the protocol anyway. Also, if the working set (which is going to be OS/stack dependent) turned out to be a bit too large then the frontend can always fall back to a copy into a locally allocated buffer, as in your proposal, anyway.

> 
> > The guest TX code maintains a hash table of buffer addresses to grant refs.
> When
> > a packet is sent the code looks to see if it has already granted the buffer
> and
> > re-uses the existing ref if so, otherwise it grants the buffer and adds the
> new
> > ref into the table.
> 
> > The backend also maintains a hash of grant refs to addresses and,
> whenever it
> > sees a new ref, it grant maps it and adds the address into the table.
> Otherwise
> > it does a hash lookup and thus has a buffer address it can immediately
> memcpy
> > from.
> >
> > If the frontend wants the backend to release a grant ref (e.g. because it's
> > starting to run out of grant table) then a control message can be used to
> ask
> > for it back, at which point the backend removes the ref from its cache and
> > unmaps it.
> Wouldn't this be somewhat similar to the persistent grants in xen block
> drivers?

Yes, it would, and I'd rather that protocol was also re-worked in this fashion.

> 
> > Using this scheme we allow a guest OS to still use either a zero-copy
> approach
> > if it wishes to do so, or a static pre-grant... or something between
> > (e.g. pre-grant for headers, zero copy for bulk data).
> >
> > Does that sound reasonable?
> Not sure yet but it looks nice if we can indeed achieve the zero copy part. But
> I have two concerns: say a backend could be forced to always remove refs as
> its
> cache is always full having frontend not being able to reuse these pages
> (subject to its own allocator behavior, in case assumption above wouldn't be
> satisfied) nullifying backend effort into maintaining its mapped grefs table.
> One other concern is whether those pages (assumed to be reused) might be
> leaking
> off guest data to the backend (when not used on netfront).

As I said, the protocol already requires the backend to be trusted by the frontend (since grants cannot be revoked, if for no other reason) so information leakage is not a particular concern. What I want to avoid is a protocol that denies any possibility of zero-copy, even in the best case, which is the way things currently are with persistent grants in blkif.

  Paul

> 
> Joao
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-06 19:18   ` Stefano Stabellini
  2017-01-06 20:19     ` Joao Martins
@ 2017-01-09  9:03     ` Paul Durrant
  2017-01-09 18:25       ` Stefano Stabellini
  1 sibling, 1 reply; 17+ messages in thread
From: Paul Durrant @ 2017-01-09  9:03 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Joao Martins, Wei Liu, Andrew Cooper

> -----Original Message-----
> From: Stefano Stabellini [mailto:sstabellini@kernel.org]
> Sent: 06 January 2017 19:18
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: Joao Martins <joao.m.martins@oracle.com>; xen-
> devel@lists.xenproject.org; Andrew Cooper <Andrew.Cooper3@citrix.com>;
> Wei Liu <wei.liu2@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>
> Subject: RE: [RFC] netif: staging grants for requests
> 
> On Fri, 6 Jan 2017, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Joao Martins [mailto:joao.m.martins@oracle.com]
> > > Sent: 14 December 2016 18:11
> > > To: xen-devel@lists.xenproject.org
> > > Cc: David Vrabel <david.vrabel@citrix.com>; Andrew Cooper
> > > <Andrew.Cooper3@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Paul
> Durrant
> > > <Paul.Durrant@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>
> > > Subject: [RFC] netif: staging grants for requests
> > >
> > > Hey,
> > >
> > > Back in the Xen hackaton '16 networking session there were a couple of
> ideas
> > > brought up. One of them was about exploring permanently mapped
> grants
> > > between
> > > xen-netback/xen-netfront.
> > >
> > > I started experimenting and came up with sort of a design document (in
> > > pandoc)
> > > on what it would like to be proposed. This is meant as a seed for
> discussion
> > > and also requesting input to know if this is a good direction. Of course, I
> > > am willing to try alternatives that we come up beyond the contents of the
> > > spec, or any other suggested changes ;)
> > >
> > > Any comments or feedback is welcome!
> > >
> >
> > Hi,
> >
> >   Sorry for the delay... I've been OOTO for three weeks.
> >
> >   I like the general approach or pre-granting buffers for RX so that the
> backend can simply memcpy and tell the frontend which buffer a packet
> appears in but IIUC you are proposing use of a single pre-granted area for TX
> also, which would presumably require the frontend to always copy on the TX
> side? I wonder if we might go for a slightly different scheme...
> >
> >   The assumption is that the working set of TX buffers in the guest OS is
> fairly small (which is probably true for a small number of heavily used sockets
> and an OS that uses a slab allocator)...
> >
> >   The guest TX code maintains a hash table of buffer addresses to grant
> refs. When a packet is sent the code looks to see if it has already granted the
> buffer and re-uses the existing ref if so, otherwise it grants the buffer and
> adds the new ref into the table.
> >
> >   The backend also maintains a hash of grant refs to addresses and,
> whenever it sees a new ref, it grant maps it and adds the address into the
> table. Otherwise it does a hash lookup and thus has a buffer address it can
> immediately memcpy from.
> >
> >   If the frontend wants the backend to release a grant ref (e.g. because it's
> starting to run out of grant table) then a control message can be used to ask
> for it back, at which point the backend removes the ref from its cache and
> unmaps it.
> >
> >   Using this scheme we allow a guest OS to still use either a zero-copy
> approach if it wishes to do so, or a static pre-grant... or something between
> (e.g. pre-grant for headers, zero copy for bulk data).
> >
> >   Does that sound reasonable?
> 
> Wouldn't it be better to introduce the new memcpy based scheme for RX
> only?

It would certainly be reasonable to tackle RX first. But I don't want to force a copy in the TX path when it may be possible to avoid it.

> 
> This suggestion increases the total amount of grants and the amount of
> non-packet data simultaneously shared from frontend to backend by
> "accident" (because it happens to be on the same pages that are granted
> to the backend). Am I right? If so, it could be a security concern.
> Let's keep in mind that if we trust the backend then we might as well go
> all the way and assume that the backend is in Dom0 and can map any mfn
> in memory without requiring grants. That would be a simple extension
> guaranteed to have great performance (I am in favor of this FYI).

I'm not sure that we want to do away with grants altogether. In my scheme the frontend can ask for its grants back so that it doesn't leak, but if it's willing to place more trust in the backend then it can grant pages from the stack which may contain other information. Certainly bypassing grants would be a quick performance boost but it's an all-or-nothing approach, whereas I'm shooting for something more flexible.

  Paul

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-09  8:56     ` Paul Durrant
@ 2017-01-09 13:01       ` Joao Martins
  0 siblings, 0 replies; 17+ messages in thread
From: Joao Martins @ 2017-01-09 13:01 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel, Stefano Stabellini, Wei Liu, Andrew Cooper

On 01/09/2017 08:56 AM, Paul Durrant wrote:
>> -----Original Message-----
>> From: Joao Martins [mailto:joao.m.martins@oracle.com]
>> Sent: 06 January 2017 20:09
>> To: Paul Durrant <Paul.Durrant@citrix.com>
>> Cc: xen-devel@lists.xenproject.org; Andrew Cooper
>> <Andrew.Cooper3@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Stefano
>> Stabellini <sstabellini@kernel.org>
>> Subject: Re: [RFC] netif: staging grants for requests
>>
>> On 01/06/2017 09:33 AM, Paul Durrant wrote:
>>>> -----Original Message-----
>>>> From: Joao Martins [mailto:joao.m.martins@oracle.com]
>>>> Sent: 14 December 2016 18:11
>>>> To: xen-devel@lists.xenproject.org
>>>> Cc: David Vrabel <david.vrabel@citrix.com>; Andrew Cooper
>>>> <Andrew.Cooper3@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Paul
>> Durrant
>>>> <Paul.Durrant@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>
>>>> Subject: [RFC] netif: staging grants for requests
>>>>
>>>> Hey,
>>>>
>>>> Back in the Xen hackaton '16 networking session there were a couple of
>> ideas
>>>> brought up. One of them was about exploring permanently mapped
>> grants
>>>> between
>>>> xen-netback/xen-netfront.
>>>>
>>>> I started experimenting and came up with sort of a design document (in
>>>> pandoc)
>>>> on what it would like to be proposed. This is meant as a seed for
>> discussion
>>>> and also requesting input to know if this is a good direction. Of course, I
>>>> am willing to try alternatives that we come up beyond the contents of the
>>>> spec, or any other suggested changes ;)
>>>>
>>>> Any comments or feedback is welcome!
>>>>
>>>
>>> Hi,
>> Hey!
>>
>>>
>>> Sorry for the delay... I've been OOTO for three weeks.
>> Thanks for the comments!
>>
>>> I like the general approach or pre-granting buffers for RX so that the
>> backend
>>> can simply memcpy and tell the frontend which buffer a packet appears in
>> Cool,
>>
>>> but IIUC you are proposing use of a single pre-granted area for TX also,
>> which would
>>> presumably require the frontend to always copy on the TX side? I wonder if
>> we
>>> might go for a slightly different scheme...
>> I see.
>>
>>>
>>> The assumption is that the working set of TX buffers in the guest OS is fairly
>>> small (which is probably true for a small number of heavily used sockets
>> and an
>>> OS that uses a slab allocator)...
>> Hmm, [speaking about linux] maybe for the skb allocation cache. For the
>> remaining packet pages maybe not for say a scather-gather list...? But I guess
>> it would need to be validated whether this working set is indeed kept small
>> as
>> this seems like a very strong assumption to comply with its various
>> possibilities in workloads. Plus wouldn't we leak info from these pages if it
>> wasn't used on the device but rather elsewhere in the guest stack?
> 
> Yes, potentially there is an information leak but I am assuming that the backend
> is also trusted by the frontend, which is pretty will baked into the protocol
> anyway.
I assumed the same - just thought it was worth clarifying.

> Also, if the working set (which is going to be OS/stack dependent) turned
> out to be a bit too large then the frontend can always fall back to a copy into a
> locally allocated buffer, as in your proposal, anyway.
Yeap.

>>> The guest TX code maintains a hash table of buffer addresses to grant refs.
>> When
>>> a packet is sent the code looks to see if it has already granted the buffer
>> and
>>> re-uses the existing ref if so, otherwise it grants the buffer and adds the
>> new
>>> ref into the table.
>>
>>> The backend also maintains a hash of grant refs to addresses and,
>> whenever it
>>> sees a new ref, it grant maps it and adds the address into the table.
>> Otherwise
>>> it does a hash lookup and thus has a buffer address it can immediately
>> memcpy
>>> from.
>>>
>>> If the frontend wants the backend to release a grant ref (e.g. because it's
>>> starting to run out of grant table) then a control message can be used to
>> ask
>>> for it back, at which point the backend removes the ref from its cache and
>>> unmaps it.
>> Wouldn't this be somewhat similar to the persistent grants in xen block
>> drivers?
> 
> Yes, it would, and I'd rather that protocol was also re-worked in this fashion.
I guess then I could reuse part of my old series (persistent grants) in this
reworked fashion you suggest. I didn't went that route as I had the (apparently
wrong) impression that a persistent grants based approach were undesirable (as I
took it from past sessions)

>>> Using this scheme we allow a guest OS to still use either a zero-copy
>> approach
>>> if it wishes to do so, or a static pre-grant... or something between
>>> (e.g. pre-grant for headers, zero copy for bulk data).
>>>
>>> Does that sound reasonable?
>> Not sure yet but it looks nice if we can indeed achieve the zero copy part. But
>> I have two concerns: say a backend could be forced to always remove refs as
>> its
>> cache is always full having frontend not being able to reuse these pages
>> (subject to its own allocator behavior, in case assumption above wouldn't be
>> satisfied) nullifying backend effort into maintaining its mapped grefs table.
>> One other concern is whether those pages (assumed to be reused) might be
>> leaking
>> off guest data to the backend (when not used on netfront).
> 
> As I said, the protocol already requires the backend to be trusted by the frontend
> (since grants cannot be revoked, if for no other reason) so information leakage is
> not a particular concern. What I want to avoid is a protocol that denies any
> possibility of zero-copy, even in the best case, which is the way things currently
> are with persistent grants in blkif.
I'll be more reassured once I've checked/verified this frontend assumption can
be achieved by a somewhat visible percentage of the sent pages; but still you've
got a valid point there that the protocol shouldn't be forcing the guest OS in
always doing a copy. Hence I like in having a control message to co-manage these
grefs table.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] netif: staging grants for requests
  2017-01-09  9:03     ` Paul Durrant
@ 2017-01-09 18:25       ` Stefano Stabellini
  0 siblings, 0 replies; 17+ messages in thread
From: Stefano Stabellini @ 2017-01-09 18:25 UTC (permalink / raw)
  To: Paul Durrant
  Cc: xen-devel, Joao Martins, Stefano Stabellini, Wei Liu, Andrew Cooper

On Mon, 9 Jan 2017, Paul Durrant wrote:
> > Wouldn't it be better to introduce the new memcpy based scheme for RX
> > only?
> 
> It would certainly be reasonable to tackle RX first. But I don't want to force a copy in the TX path when it may be possible to avoid it.
> 
> > 
> > This suggestion increases the total amount of grants and the amount of
> > non-packet data simultaneously shared from frontend to backend by
> > "accident" (because it happens to be on the same pages that are granted
> > to the backend). Am I right? If so, it could be a security concern.
> > Let's keep in mind that if we trust the backend then we might as well go
> > all the way and assume that the backend is in Dom0 and can map any mfn
> > in memory without requiring grants. That would be a simple extension
> > guaranteed to have great performance (I am in favor of this FYI).
> 
> I'm not sure that we want to do away with grants altogether. In my scheme the frontend can ask for its grants back so that it doesn't leak, but if it's willing to place more trust in the backend then it can grant pages from the stack which may contain other information. Certainly bypassing grants would be a quick performance boost but it's an all-or-nothing approach, whereas I'm shooting for something more flexible.

I understand the intent but this is a suboptimal choice. If we do trust
the backend, then we can do better by using MFNs instead of grants. It
is simpler and more efficient. I prefer to make decisions based on
numbers, rather than gut feeling, but I bet that the MFN solution would
be significantly faster. If we do not trust the backend, then we would
be better off with a pure memcpy approach. This half-way proposal tries
to please both the security crowd and the performance crowd but leaves
them both wanting.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-01-09 18:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-14 18:11 [RFC] netif: staging grants for requests Joao Martins
2017-01-04 13:54 ` Wei Liu
2017-01-05 20:27   ` Joao Martins
2017-01-04 19:40 ` Stefano Stabellini
2017-01-05 11:54   ` Wei Liu
2017-01-05 20:27   ` Joao Martins
2017-01-06  0:30     ` Stefano Stabellini
2017-01-06 17:13       ` Joao Martins
2017-01-06 19:02         ` Stefano Stabellini
2017-01-06  9:33 ` Paul Durrant
2017-01-06 19:18   ` Stefano Stabellini
2017-01-06 20:19     ` Joao Martins
2017-01-09  9:03     ` Paul Durrant
2017-01-09 18:25       ` Stefano Stabellini
2017-01-06 20:08   ` Joao Martins
2017-01-09  8:56     ` Paul Durrant
2017-01-09 13:01       ` Joao Martins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.