Re: [RFC] netif: staging grants for requests

From: Joao Martins <joao.m.martins@oracle.com>
To: Wei Liu <wei.liu2@citrix.com>
Cc: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	Paul Durrant <paul.durrant@citrix.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	David Vrabel <david.vrabel@citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: [RFC] netif: staging grants for requests
Date: Thu, 5 Jan 2017 20:27:07 +0000	[thread overview]
Message-ID: <586EAC1B.2000905@oracle.com> (raw)
In-Reply-To: <20170104135456.GM13806@citrix.com>

On 01/04/2017 01:54 PM, Wei Liu wrote:
> Hey!
Hey!

> Thanks for writing this detailed document!
Thanks a lot for the review and comments!

> 
> On Wed, Dec 14, 2016 at 06:11:12PM +0000, Joao Martins wrote:
>> Hey,
>>
>> Back in the Xen hackaton '16 networking session there were a couple of ideas
>> brought up. One of them was about exploring permanently mapped grants between
>> xen-netback/xen-netfront.
>>
>> I started experimenting and came up with sort of a design document (in pandoc)
>> on what it would like to be proposed. This is meant as a seed for discussion
>> and also requesting input to know if this is a good direction. Of course, I
>> am willing to try alternatives that we come up beyond the contents of the
>> spec, or any other suggested changes ;)
>>
>> Any comments or feedback is welcome!
>>
>> Cheers,
>> Joao
>>
>> ---
>> % Staging grants for network I/O requests
>> % Joao Martins <<joao.m.martins@oracle.com>>
>> % Revision 1
>>
>> \clearpage
>>
>> --------------------------------------------------------------------
>> Status: **Experimental**
>>
>> Architecture(s): x86 and ARM
>>
> 
> Any.
OK.

> 
>> Component(s): Guest
>>
>> Hardware: Intel and AMD
> 
> No need to specify this.
OK.

> 
>> --------------------------------------------------------------------
>>
>> # Background and Motivation
>>
> 
> I skimmed through the middle -- I think you description of transmissions
> in both directions is accurate.
> 
> The proposal to replace some steps with explicit memcpy is also
> sensible.
Glad to hear that!

> 
>> \clearpage
>>
>> ## Performance
>>
>> Numbers that give a rough idea on the performance benefits of this extension.
>> These are Guest <-> Dom0 which test the communication between backend and
>> frontend, excluding other bottlenecks in the datapath (the software switch).
>>
>> ```
>> # grant copy
>> Guest TX (1vcpu,  64b, UDP in pps):  1 506 170 pps
>> Guest TX (4vcpu,  64b, UDP in pps):  4 988 563 pps
>> Guest TX (1vcpu, 256b, UDP in pps):  1 295 001 pps
>> Guest TX (4vcpu, 256b, UDP in pps):  4 249 211 pps
>>
>> # grant copy + grant map (see next subsection)
>> Guest TX (1vcpu, 260b, UDP in pps):    577 782 pps
>> Guest TX (4vcpu, 260b, UDP in pps):  1 218 273 pps
>>
>> # drop at the guest network stack
>> Guest RX (1vcpu,  64b, UDP in pps):  1 549 630 pps
>> Guest RX (4vcpu,  64b, UDP in pps):  2 870 947 pps
>> ```
>>
>> With this extension:
>> ```
>> # memcpy
>> data-len=256 TX (1vcpu,  64b, UDP in pps):  3 759 012 pps
>> data-len=256 TX (4vcpu,  64b, UDP in pps): 12 416 436 pps
> 
> This basically means we can almost get line rate for 10Gb link.
> 
> It is already a  good result. I'm interested in knowing if there is
> possibility to approach 40 or 100 Gb/s?
Certainly, so with bulk transfer we can already saturate a 40 Gbit/s NIC,
sending out from a guest to an external host. I got ~80 Gbit/s too but between
guests on the same host (some time ago back in xen 4.7). 100 Gbit/s is also on
my radar.

The problem comes with smaller packets <= MTU (and request/response workloads
with small payloads) and there is where we lack the performance. Specially
speaking of the workload with the very small packets, linux has a hard time
saturating those NICs (with XDP now rising up to the challenge); I think only
DPDK is able to at this point [*].

[*] Section 7.1,
https://download.01.org/packet-processing/ONPS2.1/Intel_ONP_Release_2.1_Performance_Test_Report_Rev1.0.pdf

> It would be good if we design this extension with higher goals in mind.
Totally agree!

>> data-len=256 TX (1vcpu, 256b, UDP in pps):  3 248 392 pps
>> data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps
>>
>> # memcpy + grant map (see next subsection)
>> data-len=256 TX (1vcpu, 260b, UDP in pps):    588 428 pps
>> data-len=256 TX (4vcpu, 260b, UDP in pps):  1 668 044 pps
>>
>> # (drop at the guest network stack)
>> data-len=256 RX (1vcpu,  64b, UDP in pps):  3 285 362 pps
>> data-len=256 RX (4vcpu,  64b, UDP in pps): 11 761 847 pps
>>
>> # (drop with guest XDP_DROP prog)
>> data-len=256 RX (1vcpu,  64b, UDP in pps):  9 466 591 pps
>> data-len=256 RX (4vcpu,  64b, UDP in pps): 33 006 157 pps
>> ```
>>
>> Latency measurements (netperf TCP_RR request size 1 and response size 1):
>> ```
>> 24 KTps vs 28 KTps
>> 39 KTps vs 50 KTps (with kernel busy poll)
>> ```
>>
>> TCP Bulk transfer measurements aren't showing a representative increase on
>> maximum throughput (sometimes ~10%), but rather less retransmissions and
>> more stable. This is probably because of being having a slight decrease in rtt
>> time (i.e. receiver acknowledging data quicker). Currently trying exploring
>> other data list sizes and probably will have a better idea on the effects of
>> this.
>>
>> ## Linux grant copy vs map remark
>>
>> Based on numbers above there's a sudden 2x performance drop when we switch from
>> grant copy to also grant map the ` gref`: 1 295 001 vs  577 782 for 256 and 260
>> packets bytes respectivally. Which is all the more visible when removing the grant
>> copy with memcpy in this extension (3 248 392 vs 588 428). While there's been
>> discussions of avoid the TLB unflush on unmap, one could wonder what the
>> threshold of that improvement would be. Chances are that this is the least of
>> our concerns in a fully poppulated host (or with an oversubscribed one). Would
>> it be worth experimenting increasing the threshold of the copy beyond the
>> header?
>>
> 
> Yes, it would be interesting to see more data points and provide
> sensible default. But I think this is secondary goal because "sensible
> default" can change overtime and on different environments.
Indeed; I am experimenting with more data points and other workloads to add up here.

>> \clearpage
>>
>> # References
>>
>> [0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html
>>
>> [1] https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362
>>
>> [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1
>>
>> [3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
>>
>> [4]
>> http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data
>>
>> [5] http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073
>>
>> [6] http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52
>>
>> # History
>>
>> A table of changes to the document, in chronological order.
>>
>> ------------------------------------------------------------------------
>> Date       Revision Version  Notes
>> ---------- -------- -------- -------------------------------------------
>> 2016-12-14 1        Xen 4.9  Initial version.
>> ---------- -------- -------- -------------------------------------------

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel