From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefano Stabellini Subject: Re: Interesting observation with network event notification and batching Date: Mon, 1 Jul 2013 15:19:48 +0100 Message-ID: References: <20130612101451.GF2765@zion.uk.xensource.com> <20130628161542.GF16643@zion.uk.xensource.com> <51D13456.1040609@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <51D13456.1040609@oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: annie li Cc: Wei Liu , ian.campbell@citrix.com, stefano.stabellini@eu.citrix.com, xen-devel@lists.xen.org, andrew.bennieston@citrix.com List-Id: xen-devel@lists.xenproject.org Could you please use plain text emails in the future? On Mon, 1 Jul 2013, annie li wrote: > On 2013-6-29 0:15, Wei Liu wrote: > > Hi all, > > After collecting more stats and comparing copying / mapping cases, I now > have some more interesting finds, which might contradict what I said > before. > > I tuned the runes I used for benchmark to make sure iperf and netperf > generate large packets (~64K). Here are the runes I use: > > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 > > COPY MAP > iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) > > > So with default iperf setting, copy is about 7.9G, and map is about 2.5G? How about the result of netperf without large packets? > > PPI 2.90 1.07 > SPI 37.75 13.69 > PPN 2.90 1.07 > SPN 37.75 13.69 > tx_count 31808 174769 > > > Seems interrupt count does not affect the performance at all with -l 131072 -w 128k. > > nr_napi_schedule 31805 174697 > total_packets 92354 187408 > total_reqs 1200793 2392614 > > netperf Tput: 5.8Gb/s 10.5Gb/s > PPI 2.13 1.00 > SPI 36.70 16.73 > PPN 2.13 1.31 > SPN 36.70 16.75 > tx_count 57635 205599 > nr_napi_schedule 57633 205311 > total_packets 122800 270254 > total_reqs 2115068 3439751 > > PPI: packets processed per interrupt > SPI: slots processed per interrupt > PPN: packets processed per napi schedule > SPN: slots processed per napi schedule > tx_count: interrupt count > total_reqs: total slots used during test > > * Notification and batching > > Is notification and batching really a problem? I'm not so sure now. My > first thought when I didn't measure PPI / PPN / SPI / SPN in copying > case was that "in that case netback *must* have better batching" which > turned out not very true -- copying mode makes netback slower, however > the batching gained is not hugh. > > Ideally we still want to batch as much as possible. Possible way > includes playing with the 'weight' parameter in NAPI. But as the figures > show batching seems not to be very important for throughput, at least > for now. If the NAPI framework and netfront / netback are doing their > jobs as designed we might not need to worry about this now. > > Andrew, do you have any thought on this? You found out that NAPI didn't > scale well with multi-threaded iperf in DomU, do you have any handle how > that can happen? > > * Thoughts on zero-copy TX > > With this hack we are able to achieve 10Gb/s single stream, which is > good. But, with classic XenoLinux kernel which has zero copy TX we > didn't able to achieve this. I also developed another zero copy netback > prototype one year ago with Ian's out-of-tree skb frag destructor patch > series. That prototype couldn't achieve 10Gb/s either (IIRC the > performance was more or less the same as copying mode, about 6~7Gb/s). > > My hack maps all necessary pages permantently, there is no unmap, we > skip lots of page table manipulation and TLB flushes. So my basic > conclusion is that page table manipulation and TLB flushes do incur > heavy performance penalty. > > This hack can be upstreamed in no way. If we're to re-introduce > zero-copy TX, we would need to implement some sort of lazy flushing > mechanism. I haven't thought this through. Presumably this mechanism > would also benefit blk somehow? I'm not sure yet. > > Could persistent mapping (with the to-be-developed reclaim / MRU list > mechanism) be useful here? So that we can unify blk and net drivers? > > * Changes required to introduce zero-copy TX > > 1. SKB frag destructor series: to track life cycle of SKB frags. This is > not yet upstreamed. > > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? > > > 2. Mechanism to negotiate max slots frontend can use: mapping requires > backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS. > > 3. Lazy flushing mechanism or persistent grants: ??? > > > I did some test with persistent grants before, it did not show better performance than grant copy. But I was using the default > params of netperf, and not tried large packet size. Your results reminds me that maybe persistent grants would get similar > results with larger packet size too. > > Thanks > Annie > > >