From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefano Stabellini Subject: Re: Interesting observation with network event notification and batching Date: Mon, 1 Jul 2013 15:54:50 +0100 Message-ID: References: <20130612101451.GF2765@zion.uk.xensource.com> <20130628161542.GF16643@zion.uk.xensource.com> <51D13456.1040609@oracle.com> <20130701085436.GA7483@zion.uk.xensource.com> <20130701143919.GG7483@zion.uk.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20130701143919.GG7483@zion.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Wei Liu Cc: ian.campbell@citrix.com, Stefano Stabellini , xen-devel@lists.xen.org, annie li , andrew.bennieston@citrix.com List-Id: xen-devel@lists.xenproject.org On Mon, 1 Jul 2013, Wei Liu wrote: > On Mon, Jul 01, 2013 at 03:29:45PM +0100, Stefano Stabellini wrote: > > On Mon, 1 Jul 2013, Wei Liu wrote: > > > On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote: > > > > > > > > On 2013-6-29 0:15, Wei Liu wrote: > > > > >Hi all, > > > > > > > > > >After collecting more stats and comparing copying / mapping cases, I now > > > > >have some more interesting finds, which might contradict what I said > > > > >before. > > > > > > > > > >I tuned the runes I used for benchmark to make sure iperf and netperf > > > > >generate large packets (~64K). Here are the runes I use: > > > > > > > > > > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) > > > > > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 > > > > > > > > > > COPY MAP > > > > >iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) > > > > > > > > So with default iperf setting, copy is about 7.9G, and map is about > > > > 2.5G? How about the result of netperf without large packets? > > > > > > > > > > First question, yes. > > > > > > Second question, 5.8Gb/s. And I believe for the copying scheme without > > > large packet the throuput is more or less the same. > > > > > > > > PPI 2.90 1.07 > > > > > SPI 37.75 13.69 > > > > > PPN 2.90 1.07 > > > > > SPN 37.75 13.69 > > > > > tx_count 31808 174769 > > > > > > > > Seems interrupt count does not affect the performance at all with -l > > > > 131072 -w 128k. > > > > > > > > > > Right. > > > > > > > > nr_napi_schedule 31805 174697 > > > > > total_packets 92354 187408 > > > > > total_reqs 1200793 2392614 > > > > > > > > > >netperf Tput: 5.8Gb/s 10.5Gb/s > > > > > PPI 2.13 1.00 > > > > > SPI 36.70 16.73 > > > > > PPN 2.13 1.31 > > > > > SPN 36.70 16.75 > > > > > tx_count 57635 205599 > > > > > nr_napi_schedule 57633 205311 > > > > > total_packets 122800 270254 > > > > > total_reqs 2115068 3439751 > > > > > > > > > > PPI: packets processed per interrupt > > > > > SPI: slots processed per interrupt > > > > > PPN: packets processed per napi schedule > > > > > SPN: slots processed per napi schedule > > > > > tx_count: interrupt count > > > > > total_reqs: total slots used during test > > > > > > > > > >* Notification and batching > > > > > > > > > >Is notification and batching really a problem? I'm not so sure now. My > > > > >first thought when I didn't measure PPI / PPN / SPI / SPN in copying > > > > >case was that "in that case netback *must* have better batching" which > > > > >turned out not very true -- copying mode makes netback slower, however > > > > >the batching gained is not hugh. > > > > > > > > > >Ideally we still want to batch as much as possible. Possible way > > > > >includes playing with the 'weight' parameter in NAPI. But as the figures > > > > >show batching seems not to be very important for throughput, at least > > > > >for now. If the NAPI framework and netfront / netback are doing their > > > > >jobs as designed we might not need to worry about this now. > > > > > > > > > >Andrew, do you have any thought on this? You found out that NAPI didn't > > > > >scale well with multi-threaded iperf in DomU, do you have any handle how > > > > >that can happen? > > > > > > > > > >* Thoughts on zero-copy TX > > > > > > > > > >With this hack we are able to achieve 10Gb/s single stream, which is > > > > >good. But, with classic XenoLinux kernel which has zero copy TX we > > > > >didn't able to achieve this. I also developed another zero copy netback > > > > >prototype one year ago with Ian's out-of-tree skb frag destructor patch > > > > >series. That prototype couldn't achieve 10Gb/s either (IIRC the > > > > >performance was more or less the same as copying mode, about 6~7Gb/s). > > > > > > > > > >My hack maps all necessary pages permantently, there is no unmap, we > > > > >skip lots of page table manipulation and TLB flushes. So my basic > > > > >conclusion is that page table manipulation and TLB flushes do incur > > > > >heavy performance penalty. > > > > > > > > > >This hack can be upstreamed in no way. If we're to re-introduce > > > > >zero-copy TX, we would need to implement some sort of lazy flushing > > > > >mechanism. I haven't thought this through. Presumably this mechanism > > > > >would also benefit blk somehow? I'm not sure yet. > > > > > > > > > >Could persistent mapping (with the to-be-developed reclaim / MRU list > > > > >mechanism) be useful here? So that we can unify blk and net drivers? > > > > > > > > > >* Changes required to introduce zero-copy TX > > > > > > > > > >1. SKB frag destructor series: to track life cycle of SKB frags. This is > > > > >not yet upstreamed. > > > > > > > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? > > > > > > > > > > > > > > > > > > Yes. But I believe there's been several versions posted. The link you > > > have is not the latest version. > > > > > > > > > > > > >2. Mechanism to negotiate max slots frontend can use: mapping requires > > > > >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS. > > > > > > > > > >3. Lazy flushing mechanism or persistent grants: ??? > > > > > > > > I did some test with persistent grants before, it did not show > > > > better performance than grant copy. But I was using the default > > > > params of netperf, and not tried large packet size. Your results > > > > reminds me that maybe persistent grants would get similar results > > > > with larger packet size too. > > > > > > > > > > "No better performance" -- that's because both mechanisms are copying? > > > However I presume persistent grant can scale better? From an earlier > > > email last week, I read that copying is done by the guest so that this > > > mechanism scales much better than hypervisor copying in blk's case. > > > > Yes, I always expected persistent grants to be faster then > > gnttab_copy but I was very surprised by the difference in performances: > > > > http://marc.info/?l=xen-devel&m=137234605929944 > > > > I think it's worth trying persistent grants on PV network, although it's > > very unlikely that they are going to improve the throughput by 5 Gb/s. > > > > I think it can improve aggregated throughput, however its not likely to > improve single stream throughput. you are probably right