From mboxrd@z Thu Jan  1 00:00:00 1970
From: Wei Liu <wei.liu2@citrix.com>
Subject: Re: Interesting observation with network event
 notification and batching
Date: Fri, 28 Jun 2013 17:15:42 +0100
Message-ID: <20130628161542.GF16643@zion.uk.xensource.com>
References: <20130612101451.GF2765@zion.uk.xensource.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Content-Disposition: inline
In-Reply-To: <20130612101451.GF2765@zion.uk.xensource.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: xen-devel@lists.xen.org
Cc: wei.liu2@citrix.com, ian.campbell@citrix.com, stefano.stabellini@eu.citrix.com, annie.li@oracle.com, andrew.bennieston@citrix.com
List-Id: xen-devel@lists.xenproject.org

Hi all,

After collecting more stats and comparing copying / mapping cases, I now
have some more interesting finds, which might contradict what I said
before.

I tuned the runes I used for benchmark to make sure iperf and netperf
generate large packets (~64K). Here are the runes I use:

  iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
  netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072

                          COPY                    MAP
iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
         PPI               2.90                  1.07
         SPI               37.75                 13.69
         PPN               2.90                  1.07
         SPN               37.75                 13.69
         tx_count           31808                174769
         nr_napi_schedule   31805                174697
         total_packets      92354                187408
         total_reqs         1200793              2392614

netperf  Tput:            5.8Gb/s             10.5Gb/s
         PPI               2.13                   1.00
         SPI               36.70                  16.73
         PPN               2.13                   1.31
         SPN               36.70                  16.75
         tx_count           57635                205599
         nr_napi_schedule   57633                205311
         total_packets      122800               270254
         total_reqs         2115068              3439751

  PPI: packets processed per interrupt
  SPI: slots processed per interrupt
  PPN: packets processed per napi schedule
  SPN: slots processed per napi schedule
  tx_count: interrupt count
  total_reqs: total slots used during test

* Notification and batching

Is notification and batching really a problem? I'm not so sure now. My
first thought when I didn't measure PPI / PPN / SPI / SPN in copying
case was that "in that case netback *must* have better batching" which
turned out not very true -- copying mode makes netback slower, however
the batching gained is not hugh.

Ideally we still want to batch as much as possible. Possible way
includes playing with the 'weight' parameter in NAPI. But as the figures
show batching seems not to be very important for throughput, at least
for now. If the NAPI framework and netfront / netback are doing their
jobs as designed we might not need to worry about this now.

Andrew, do you have any thought on this? You found out that NAPI didn't
scale well with multi-threaded iperf in DomU, do you have any handle how
that can happen?

* Thoughts on zero-copy TX

With this hack we are able to achieve 10Gb/s single stream, which is
good. But, with classic XenoLinux kernel which has zero copy TX we
didn't able to achieve this.  I also developed another zero copy netback
prototype one year ago with Ian's out-of-tree skb frag destructor patch
series. That prototype couldn't achieve 10Gb/s either (IIRC the
performance was more or less the same as copying mode, about 6~7Gb/s).

My hack maps all necessary pages permantently, there is no unmap, we
skip lots of page table manipulation and TLB flushes. So my basic
conclusion is that page table manipulation and TLB flushes do incur
heavy performance penalty.

This hack can be upstreamed in no way. If we're to re-introduce
zero-copy TX, we would need to implement some sort of lazy flushing
mechanism. I haven't thought this through. Presumably this mechanism
would also benefit blk somehow? I'm not sure yet.

Could persistent mapping (with the to-be-developed reclaim / MRU list
mechanism) be useful here? So that we can unify blk and net drivers?

* Changes required to introduce zero-copy TX

1. SKB frag destructor series: to track life cycle of SKB frags. This is
not yet upstreamed.

2. Mechanism to negotiate max slots frontend can use: mapping requires
backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.

3. Lazy flushing mechanism or persistent grants: ???


Wei.

* Note
In my previous tests I only ran iperf and didn't have the right rune to
generate large packets. Iperf seems to have a behavior to increase
packet size as time goes by. In the copying case the packet size was
increased to 64K eventually while in the mapping case odd thing happened
(I believe that must due to the bug in my hack :-/) -- packet size was
always the default size (8K). Adding '-l 131072' to iperf makes sure
that the packet is always 64K.