From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wei Liu Subject: Re: Interesting observation with network event notification and batching Date: Fri, 28 Jun 2013 17:15:42 +0100 Message-ID: <20130628161542.GF16643@zion.uk.xensource.com> References: <20130612101451.GF2765@zion.uk.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: <20130612101451.GF2765@zion.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: xen-devel@lists.xen.org Cc: wei.liu2@citrix.com, ian.campbell@citrix.com, stefano.stabellini@eu.citrix.com, annie.li@oracle.com, andrew.bennieston@citrix.com List-Id: xen-devel@lists.xenproject.org Hi all, After collecting more stats and comparing copying / mapping cases, I now have some more interesting finds, which might contradict what I said before. I tuned the runes I used for benchmark to make sure iperf and netperf generate large packets (~64K). Here are the runes I use: iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 COPY MAP iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) PPI 2.90 1.07 SPI 37.75 13.69 PPN 2.90 1.07 SPN 37.75 13.69 tx_count 31808 174769 nr_napi_schedule 31805 174697 total_packets 92354 187408 total_reqs 1200793 2392614 netperf Tput: 5.8Gb/s 10.5Gb/s PPI 2.13 1.00 SPI 36.70 16.73 PPN 2.13 1.31 SPN 36.70 16.75 tx_count 57635 205599 nr_napi_schedule 57633 205311 total_packets 122800 270254 total_reqs 2115068 3439751 PPI: packets processed per interrupt SPI: slots processed per interrupt PPN: packets processed per napi schedule SPN: slots processed per napi schedule tx_count: interrupt count total_reqs: total slots used during test * Notification and batching Is notification and batching really a problem? I'm not so sure now. My first thought when I didn't measure PPI / PPN / SPI / SPN in copying case was that "in that case netback *must* have better batching" which turned out not very true -- copying mode makes netback slower, however the batching gained is not hugh. Ideally we still want to batch as much as possible. Possible way includes playing with the 'weight' parameter in NAPI. But as the figures show batching seems not to be very important for throughput, at least for now. If the NAPI framework and netfront / netback are doing their jobs as designed we might not need to worry about this now. Andrew, do you have any thought on this? You found out that NAPI didn't scale well with multi-threaded iperf in DomU, do you have any handle how that can happen? * Thoughts on zero-copy TX With this hack we are able to achieve 10Gb/s single stream, which is good. But, with classic XenoLinux kernel which has zero copy TX we didn't able to achieve this. I also developed another zero copy netback prototype one year ago with Ian's out-of-tree skb frag destructor patch series. That prototype couldn't achieve 10Gb/s either (IIRC the performance was more or less the same as copying mode, about 6~7Gb/s). My hack maps all necessary pages permantently, there is no unmap, we skip lots of page table manipulation and TLB flushes. So my basic conclusion is that page table manipulation and TLB flushes do incur heavy performance penalty. This hack can be upstreamed in no way. If we're to re-introduce zero-copy TX, we would need to implement some sort of lazy flushing mechanism. I haven't thought this through. Presumably this mechanism would also benefit blk somehow? I'm not sure yet. Could persistent mapping (with the to-be-developed reclaim / MRU list mechanism) be useful here? So that we can unify blk and net drivers? * Changes required to introduce zero-copy TX 1. SKB frag destructor series: to track life cycle of SKB frags. This is not yet upstreamed. 2. Mechanism to negotiate max slots frontend can use: mapping requires backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS. 3. Lazy flushing mechanism or persistent grants: ??? Wei. * Note In my previous tests I only ran iperf and didn't have the right rune to generate large packets. Iperf seems to have a behavior to increase packet size as time goes by. In the copying case the packet size was increased to 64K eventually while in the mapping case odd thing happened (I believe that must due to the bug in my hack :-/) -- packet size was always the default size (8K). Adding '-l 131072' to iperf makes sure that the packet is always 64K.