From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wei Liu Subject: Re: Interesting observation with network event notification and batching Date: Mon, 1 Jul 2013 09:54:36 +0100 Message-ID: <20130701085436.GA7483@zion.uk.xensource.com> References: <20130612101451.GF2765@zion.uk.xensource.com> <20130628161542.GF16643@zion.uk.xensource.com> <51D13456.1040609@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: <51D13456.1040609@oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: annie li Cc: Wei Liu , ian.campbell@citrix.com, stefano.stabellini@eu.citrix.com, xen-devel@lists.xen.org, andrew.bennieston@citrix.com List-Id: xen-devel@lists.xenproject.org On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote: > > On 2013-6-29 0:15, Wei Liu wrote: > >Hi all, > > > >After collecting more stats and comparing copying / mapping cases, I now > >have some more interesting finds, which might contradict what I said > >before. > > > >I tuned the runes I used for benchmark to make sure iperf and netperf > >generate large packets (~64K). Here are the runes I use: > > > > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) > > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 > > > > COPY MAP > >iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) > > So with default iperf setting, copy is about 7.9G, and map is about > 2.5G? How about the result of netperf without large packets? > First question, yes. Second question, 5.8Gb/s. And I believe for the copying scheme without large packet the throuput is more or less the same. > > PPI 2.90 1.07 > > SPI 37.75 13.69 > > PPN 2.90 1.07 > > SPN 37.75 13.69 > > tx_count 31808 174769 > > Seems interrupt count does not affect the performance at all with -l > 131072 -w 128k. > Right. > > nr_napi_schedule 31805 174697 > > total_packets 92354 187408 > > total_reqs 1200793 2392614 > > > >netperf Tput: 5.8Gb/s 10.5Gb/s > > PPI 2.13 1.00 > > SPI 36.70 16.73 > > PPN 2.13 1.31 > > SPN 36.70 16.75 > > tx_count 57635 205599 > > nr_napi_schedule 57633 205311 > > total_packets 122800 270254 > > total_reqs 2115068 3439751 > > > > PPI: packets processed per interrupt > > SPI: slots processed per interrupt > > PPN: packets processed per napi schedule > > SPN: slots processed per napi schedule > > tx_count: interrupt count > > total_reqs: total slots used during test > > > >* Notification and batching > > > >Is notification and batching really a problem? I'm not so sure now. My > >first thought when I didn't measure PPI / PPN / SPI / SPN in copying > >case was that "in that case netback *must* have better batching" which > >turned out not very true -- copying mode makes netback slower, however > >the batching gained is not hugh. > > > >Ideally we still want to batch as much as possible. Possible way > >includes playing with the 'weight' parameter in NAPI. But as the figures > >show batching seems not to be very important for throughput, at least > >for now. If the NAPI framework and netfront / netback are doing their > >jobs as designed we might not need to worry about this now. > > > >Andrew, do you have any thought on this? You found out that NAPI didn't > >scale well with multi-threaded iperf in DomU, do you have any handle how > >that can happen? > > > >* Thoughts on zero-copy TX > > > >With this hack we are able to achieve 10Gb/s single stream, which is > >good. But, with classic XenoLinux kernel which has zero copy TX we > >didn't able to achieve this. I also developed another zero copy netback > >prototype one year ago with Ian's out-of-tree skb frag destructor patch > >series. That prototype couldn't achieve 10Gb/s either (IIRC the > >performance was more or less the same as copying mode, about 6~7Gb/s). > > > >My hack maps all necessary pages permantently, there is no unmap, we > >skip lots of page table manipulation and TLB flushes. So my basic > >conclusion is that page table manipulation and TLB flushes do incur > >heavy performance penalty. > > > >This hack can be upstreamed in no way. If we're to re-introduce > >zero-copy TX, we would need to implement some sort of lazy flushing > >mechanism. I haven't thought this through. Presumably this mechanism > >would also benefit blk somehow? I'm not sure yet. > > > >Could persistent mapping (with the to-be-developed reclaim / MRU list > >mechanism) be useful here? So that we can unify blk and net drivers? > > > >* Changes required to introduce zero-copy TX > > > >1. SKB frag destructor series: to track life cycle of SKB frags. This is > >not yet upstreamed. > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? > > > Yes. But I believe there's been several versions posted. The link you have is not the latest version. > > > >2. Mechanism to negotiate max slots frontend can use: mapping requires > >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS. > > > >3. Lazy flushing mechanism or persistent grants: ??? > > I did some test with persistent grants before, it did not show > better performance than grant copy. But I was using the default > params of netperf, and not tried large packet size. Your results > reminds me that maybe persistent grants would get similar results > with larger packet size too. > "No better performance" -- that's because both mechanisms are copying? However I presume persistent grant can scale better? From an earlier email last week, I read that copying is done by the guest so that this mechanism scales much better than hypervisor copying in blk's case. Wei. > Thanks > Annie >