Re: Interesting observation with network event notification and batching

From: Wei Liu <wei.liu2@citrix.com>
To: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>,
	ian.campbell@citrix.com, xen-devel@lists.xen.org,
	annie li <annie.li@oracle.com>,
	andrew.bennieston@citrix.com
Subject: Re: Interesting observation with network event notification and batching
Date: Mon, 1 Jul 2013 15:39:19 +0100	[thread overview]
Message-ID: <20130701143919.GG7483@zion.uk.xensource.com> (raw)
In-Reply-To: <alpine.DEB.2.02.1307011522460.4525@kaball.uk.xensource.com>

On Mon, Jul 01, 2013 at 03:29:45PM +0100, Stefano Stabellini wrote:
> On Mon, 1 Jul 2013, Wei Liu wrote:
> > On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> > > 
> > > On 2013-6-29 0:15, Wei Liu wrote:
> > > >Hi all,
> > > >
> > > >After collecting more stats and comparing copying / mapping cases, I now
> > > >have some more interesting finds, which might contradict what I said
> > > >before.
> > > >
> > > >I tuned the runes I used for benchmark to make sure iperf and netperf
> > > >generate large packets (~64K). Here are the runes I use:
> > > >
> > > >   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> > > >   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> > > >
> > > >                           COPY                    MAP
> > > >iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
> > > 
> > > So with default iperf setting, copy is about 7.9G, and map is about
> > > 2.5G? How about the result of netperf without large packets?
> > > 
> > 
> > First question, yes.
> > 
> > Second question, 5.8Gb/s. And I believe for the copying scheme without
> > large packet the throuput is more or less the same.
> > 
> > > >          PPI               2.90                  1.07
> > > >          SPI               37.75                 13.69
> > > >          PPN               2.90                  1.07
> > > >          SPN               37.75                 13.69
> > > >          tx_count           31808                174769
> > > 
> > > Seems interrupt count does not affect the performance at all with -l
> > > 131072 -w 128k.
> > > 
> > 
> > Right.
> > 
> > > >          nr_napi_schedule   31805                174697
> > > >          total_packets      92354                187408
> > > >          total_reqs         1200793              2392614
> > > >
> > > >netperf  Tput:            5.8Gb/s             10.5Gb/s
> > > >          PPI               2.13                   1.00
> > > >          SPI               36.70                  16.73
> > > >          PPN               2.13                   1.31
> > > >          SPN               36.70                  16.75
> > > >          tx_count           57635                205599
> > > >          nr_napi_schedule   57633                205311
> > > >          total_packets      122800               270254
> > > >          total_reqs         2115068              3439751
> > > >
> > > >   PPI: packets processed per interrupt
> > > >   SPI: slots processed per interrupt
> > > >   PPN: packets processed per napi schedule
> > > >   SPN: slots processed per napi schedule
> > > >   tx_count: interrupt count
> > > >   total_reqs: total slots used during test
> > > >
> > > >* Notification and batching
> > > >
> > > >Is notification and batching really a problem? I'm not so sure now. My
> > > >first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> > > >case was that "in that case netback *must* have better batching" which
> > > >turned out not very true -- copying mode makes netback slower, however
> > > >the batching gained is not hugh.
> > > >
> > > >Ideally we still want to batch as much as possible. Possible way
> > > >includes playing with the 'weight' parameter in NAPI. But as the figures
> > > >show batching seems not to be very important for throughput, at least
> > > >for now. If the NAPI framework and netfront / netback are doing their
> > > >jobs as designed we might not need to worry about this now.
> > > >
> > > >Andrew, do you have any thought on this? You found out that NAPI didn't 
> > > >scale well with multi-threaded iperf in DomU, do you have any handle how
> > > >that can happen?
> > > >
> > > >* Thoughts on zero-copy TX
> > > >
> > > >With this hack we are able to achieve 10Gb/s single stream, which is
> > > >good. But, with classic XenoLinux kernel which has zero copy TX we
> > > >didn't able to achieve this.  I also developed another zero copy netback
> > > >prototype one year ago with Ian's out-of-tree skb frag destructor patch
> > > >series. That prototype couldn't achieve 10Gb/s either (IIRC the
> > > >performance was more or less the same as copying mode, about 6~7Gb/s).
> > > >
> > > >My hack maps all necessary pages permantently, there is no unmap, we
> > > >skip lots of page table manipulation and TLB flushes. So my basic
> > > >conclusion is that page table manipulation and TLB flushes do incur
> > > >heavy performance penalty.
> > > >
> > > >This hack can be upstreamed in no way. If we're to re-introduce
> > > >zero-copy TX, we would need to implement some sort of lazy flushing
> > > >mechanism. I haven't thought this through. Presumably this mechanism
> > > >would also benefit blk somehow? I'm not sure yet.
> > > >
> > > >Could persistent mapping (with the to-be-developed reclaim / MRU list
> > > >mechanism) be useful here? So that we can unify blk and net drivers?
> > > >
> > > >* Changes required to introduce zero-copy TX
> > > >
> > > >1. SKB frag destructor series: to track life cycle of SKB frags. This is
> > > >not yet upstreamed.
> > > 
> > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> > > 
> > > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> > > 
> > 
> > Yes. But I believe there's been several versions posted. The link you
> > have is not the latest version.
> > 
> > > >
> > > >2. Mechanism to negotiate max slots frontend can use: mapping requires
> > > >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> > > >
> > > >3. Lazy flushing mechanism or persistent grants: ???
> > > 
> > > I did some test with persistent grants before, it did not show
> > > better performance than grant copy. But I was using the default
> > > params of netperf, and not tried large packet size. Your results
> > > reminds me that maybe persistent grants would get similar results
> > > with larger packet size too.
> > > 
> > 
> > "No better performance" -- that's because both mechanisms are copying?
> > However I presume persistent grant can scale better? From an earlier
> > email last week, I read that copying is done by the guest so that this
> > mechanism scales much better than hypervisor copying in blk's case.
> 
> Yes, I always expected persistent grants to be faster then
> gnttab_copy but I was very surprised by the difference in performances:
> 
> http://marc.info/?l=xen-devel&m=137234605929944
> 
> I think it's worth trying persistent grants on PV network, although it's
> very unlikely that they are going to improve the throughput by 5 Gb/s.
> 

I think it can improve aggregated throughput, however its not likely to
improve single stream throughput.

> Also once we have both PV block and network using persistent grants,
> we might incur the grant table limit, see this email:
> 
> http://marc.info/?l=xen-devel&m=137183474618974

Yes, indeed.