From mboxrd@z Thu Jan  1 00:00:00 1970
From: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Subject: Re: Interesting observation with network event
 notification and batching
Date: Mon, 1 Jul 2013 15:54:50 +0100
Message-ID: <alpine.DEB.2.02.1307011554380.4525@kaball.uk.xensource.com>
References: <20130612101451.GF2765@zion.uk.xensource.com>
	<20130628161542.GF16643@zion.uk.xensource.com>
	<51D13456.1040609@oracle.com>
	<20130701085436.GA7483@zion.uk.xensource.com>
	<alpine.DEB.2.02.1307011522460.4525@kaball.uk.xensource.com>
	<20130701143919.GG7483@zion.uk.xensource.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <20130701143919.GG7483@zion.uk.xensource.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Wei Liu <wei.liu2@citrix.com>
Cc: ian.campbell@citrix.com, Stefano Stabellini <stefano.stabellini@eu.citrix.com>, xen-devel@lists.xen.org, annie li <annie.li@oracle.com>, andrew.bennieston@citrix.com
List-Id: xen-devel@lists.xenproject.org

On Mon, 1 Jul 2013, Wei Liu wrote:
> On Mon, Jul 01, 2013 at 03:29:45PM +0100, Stefano Stabellini wrote:
> > On Mon, 1 Jul 2013, Wei Liu wrote:
> > > On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> > > > 
> > > > On 2013-6-29 0:15, Wei Liu wrote:
> > > > >Hi all,
> > > > >
> > > > >After collecting more stats and comparing copying / mapping cases, I now
> > > > >have some more interesting finds, which might contradict what I said
> > > > >before.
> > > > >
> > > > >I tuned the runes I used for benchmark to make sure iperf and netperf
> > > > >generate large packets (~64K). Here are the runes I use:
> > > > >
> > > > >   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> > > > >   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> > > > >
> > > > >                           COPY                    MAP
> > > > >iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
> > > > 
> > > > So with default iperf setting, copy is about 7.9G, and map is about
> > > > 2.5G? How about the result of netperf without large packets?
> > > > 
> > > 
> > > First question, yes.
> > > 
> > > Second question, 5.8Gb/s. And I believe for the copying scheme without
> > > large packet the throuput is more or less the same.
> > > 
> > > > >          PPI               2.90                  1.07
> > > > >          SPI               37.75                 13.69
> > > > >          PPN               2.90                  1.07
> > > > >          SPN               37.75                 13.69
> > > > >          tx_count           31808                174769
> > > > 
> > > > Seems interrupt count does not affect the performance at all with -l
> > > > 131072 -w 128k.
> > > > 
> > > 
> > > Right.
> > > 
> > > > >          nr_napi_schedule   31805                174697
> > > > >          total_packets      92354                187408
> > > > >          total_reqs         1200793              2392614
> > > > >
> > > > >netperf  Tput:            5.8Gb/s             10.5Gb/s
> > > > >          PPI               2.13                   1.00
> > > > >          SPI               36.70                  16.73
> > > > >          PPN               2.13                   1.31
> > > > >          SPN               36.70                  16.75
> > > > >          tx_count           57635                205599
> > > > >          nr_napi_schedule   57633                205311
> > > > >          total_packets      122800               270254
> > > > >          total_reqs         2115068              3439751
> > > > >
> > > > >   PPI: packets processed per interrupt
> > > > >   SPI: slots processed per interrupt
> > > > >   PPN: packets processed per napi schedule
> > > > >   SPN: slots processed per napi schedule
> > > > >   tx_count: interrupt count
> > > > >   total_reqs: total slots used during test
> > > > >
> > > > >* Notification and batching
> > > > >
> > > > >Is notification and batching really a problem? I'm not so sure now. My
> > > > >first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> > > > >case was that "in that case netback *must* have better batching" which
> > > > >turned out not very true -- copying mode makes netback slower, however
> > > > >the batching gained is not hugh.
> > > > >
> > > > >Ideally we still want to batch as much as possible. Possible way
> > > > >includes playing with the 'weight' parameter in NAPI. But as the figures
> > > > >show batching seems not to be very important for throughput, at least
> > > > >for now. If the NAPI framework and netfront / netback are doing their
> > > > >jobs as designed we might not need to worry about this now.
> > > > >
> > > > >Andrew, do you have any thought on this? You found out that NAPI didn't 
> > > > >scale well with multi-threaded iperf in DomU, do you have any handle how
> > > > >that can happen?
> > > > >
> > > > >* Thoughts on zero-copy TX
> > > > >
> > > > >With this hack we are able to achieve 10Gb/s single stream, which is
> > > > >good. But, with classic XenoLinux kernel which has zero copy TX we
> > > > >didn't able to achieve this.  I also developed another zero copy netback
> > > > >prototype one year ago with Ian's out-of-tree skb frag destructor patch
> > > > >series. That prototype couldn't achieve 10Gb/s either (IIRC the
> > > > >performance was more or less the same as copying mode, about 6~7Gb/s).
> > > > >
> > > > >My hack maps all necessary pages permantently, there is no unmap, we
> > > > >skip lots of page table manipulation and TLB flushes. So my basic
> > > > >conclusion is that page table manipulation and TLB flushes do incur
> > > > >heavy performance penalty.
> > > > >
> > > > >This hack can be upstreamed in no way. If we're to re-introduce
> > > > >zero-copy TX, we would need to implement some sort of lazy flushing
> > > > >mechanism. I haven't thought this through. Presumably this mechanism
> > > > >would also benefit blk somehow? I'm not sure yet.
> > > > >
> > > > >Could persistent mapping (with the to-be-developed reclaim / MRU list
> > > > >mechanism) be useful here? So that we can unify blk and net drivers?
> > > > >
> > > > >* Changes required to introduce zero-copy TX
> > > > >
> > > > >1. SKB frag destructor series: to track life cycle of SKB frags. This is
> > > > >not yet upstreamed.
> > > > 
> > > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> > > > 
> > > > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> > > > 
> > > 
> > > Yes. But I believe there's been several versions posted. The link you
> > > have is not the latest version.
> > > 
> > > > >
> > > > >2. Mechanism to negotiate max slots frontend can use: mapping requires
> > > > >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> > > > >
> > > > >3. Lazy flushing mechanism or persistent grants: ???
> > > > 
> > > > I did some test with persistent grants before, it did not show
> > > > better performance than grant copy. But I was using the default
> > > > params of netperf, and not tried large packet size. Your results
> > > > reminds me that maybe persistent grants would get similar results
> > > > with larger packet size too.
> > > > 
> > > 
> > > "No better performance" -- that's because both mechanisms are copying?
> > > However I presume persistent grant can scale better? From an earlier
> > > email last week, I read that copying is done by the guest so that this
> > > mechanism scales much better than hypervisor copying in blk's case.
> > 
> > Yes, I always expected persistent grants to be faster then
> > gnttab_copy but I was very surprised by the difference in performances:
> > 
> > http://marc.info/?l=xen-devel&m=137234605929944
> > 
> > I think it's worth trying persistent grants on PV network, although it's
> > very unlikely that they are going to improve the throughput by 5 Gb/s.
> > 
> 
> I think it can improve aggregated throughput, however its not likely to
> improve single stream throughput.

you are probably right