From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jesper Dangaard Brouer <brouer@redhat.com>
Subject: Re: [RFC PATCH 5/5] Add sample for adding simple drop program to
 link
Date: Wed, 6 Apr 2016 22:01:00 +0200
Message-ID: <20160406220100.0df04925@redhat.com>
References: <1459560118-5582-1-git-send-email-bblanco@plumgrid.com>
	<1459560118-5582-6-git-send-email-bblanco@plumgrid.com>
	<20160406214848.7568235b@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: davem@davemloft.net, netdev@vger.kernel.org, tom@herbertland.com,
	alexei.starovoitov@gmail.com, Or Gerlitz <ogerlitz@mellanox.com>,
	daniel@iogearbox.net, john.fastabend@gmail.com, brouer@redhat.com
To: Brenden Blanco <bblanco@plumgrid.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:38334 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752333AbcDFUBH (ORCPT <rfc822;netdev@vger.kernel.org>);
	Wed, 6 Apr 2016 16:01:07 -0400
In-Reply-To: <20160406214848.7568235b@redhat.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Wed, 6 Apr 2016 21:48:48 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> I'm testing with this program and these patches, after getting past the
> challenge of compiling the samples/bpf files ;-)
> 
> 
> On Fri,  1 Apr 2016 18:21:58 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
> 
> > Add a sample program that only drops packets at the
> > BPF_PROG_TYPE_PHYS_DEV hook of a link. With the drop-only program,
> > observed single core rate is ~14.6Mpps.  
> 
> On my i7-4790K CPU @ 4.00GHz I'm seeing 9.7Mpps (single flow/cpu).
> (generator: pktgen_sample03_burst_single_flow.sh)
> 
>  # ./netdrvx1 $(</sys/class/net/mlx4p1/ifindex)
>  sh: /sys/kernel/debug/tracing/kprobe_events: No such file or directory
>  Success: Loaded file ./netdrvx1_kern.o
>  proto 17:    9776320 drops/s
> 
> These numbers are quite impressive. Compared to: sending it to local
> socket that drop packets 1.7Mpps. Compared to: dropping with iptables
> in "raw" table 3.7Mpps.
> 
> If I do multiple flows, via ./pktgen_sample05_flow_per_thread.sh
> then I hit this strange 14.5Mpps limit (proto 17:   14505558 drops/s).
> And the RX 4x CPUs are starting to NOT use 100% in softirq, they have
> some cycles attributed to %idle. (I verified generator is sending at
> 24Mpps).
> 
> 
> > Other tests were run, for instance without the dropcnt increment or
> > without reading from the packet header, the packet rate was mostly
> > unchanged.  
> 
> If I change the program to not touch packet data (don't call
> load_byte()) then the performance increase to 14.6Mpps (single
> flow/cpu).  And the RX CPU is mostly idle... mlx4_en_process_rx_cq()
> and page alloc/free functions taking the time.
> 
> > $ perf record -a samples/bpf/netdrvx1 $(</sys/class/net/eth0/ifindex)
> > proto 17:   14597724 drops/s
> > 
> > ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
> > Running... ctrl^C to stop
> > Device: eth4@0
> > Result: OK: 6486875(c6485849+d1026) usec, 23689465 (60byte,0frags)
> >   3651906pps 1752Mb/sec (1752914880bps) errors: 0
> > Device: eth4@1
> > Result: OK: 6486874(c6485656+d1217) usec, 23689489 (60byte,0frags)
> >   3651911pps 1752Mb/sec (1752917280bps) errors: 0
> > Device: eth4@2
> > Result: OK: 6486851(c6485730+d1120) usec, 23687853 (60byte,0frags)
> >   3651672pps 1752Mb/sec (1752802560bps) errors: 0
> > Device: eth4@3
> > Result: OK: 6486879(c6485807+d1071) usec, 23688954 (60byte,0frags)
> >   3651825pps 1752Mb/sec (1752876000bps) errors: 0
> > 
> > perf report --no-children:
> >   18.36%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_process_rx_cq
> >   15.98%  swapper        [kernel.vmlinux]  [k] poll_idle
> >   12.71%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_alloc_frags
> >    6.87%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_free_frag
> >    4.20%  ksoftirqd/1    [kernel.vmlinux]  [k] get_page_from_freelist
> >    4.09%  swapper        [mlx4_en]         [k] mlx4_en_process_rx_cq
> >    3.32%  ksoftirqd/1    [kernel.vmlinux]  [k] sk_load_byte_positive_offset
> >    2.39%  ksoftirqd/1    [mdio]            [k] 0x00000000000074cd
> >    2.23%  swapper        [mlx4_en]         [k] mlx4_en_alloc_frags
> >    2.20%  ksoftirqd/1    [kernel.vmlinux]  [k] free_pages_prepare
> >    2.08%  ksoftirqd/1    [mlx4_en]         [k] mlx4_call_bpf
> >    1.57%  ksoftirqd/1    [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
> >    1.35%  ksoftirqd/1    [mdio]            [k] 0x00000000000074fa
> >    1.09%  ksoftirqd/1    [kernel.vmlinux]  [k] free_one_page
> >    1.02%  ksoftirqd/1    [kernel.vmlinux]  [k] bpf_map_lookup_elem
> >    0.90%  ksoftirqd/1    [kernel.vmlinux]  [k] __alloc_pages_nodemask
> >    0.88%  swapper        [kernel.vmlinux]  [k] intel_idle
> >    0.82%  ksoftirqd/1    [mdio]            [k] 0x00000000000074be
> >    0.80%  swapper        [mlx4_en]         [k] mlx4_en_free_frag  
> 
> My picture (single flow/cpu) looks a little bit different:
> 
>  +   64.33%  ksoftirqd/7    [kernel.vmlinux]  [k] __bpf_prog_run
>  +    9.60%  ksoftirqd/7    [mlx4_en]         [k] mlx4_en_alloc_frags
>  +    7.71%  ksoftirqd/7    [mlx4_en]         [k] mlx4_en_process_rx_cq
>  +    5.47%  ksoftirqd/7    [mlx4_en]         [k] mlx4_en_free_frag
>  +    1.68%  ksoftirqd/7    [kernel.vmlinux]  [k] get_page_from_freelist
>  +    1.52%  ksoftirqd/7    [mlx4_en]         [k] mlx4_call_bpf
>  +    1.02%  ksoftirqd/7    [kernel.vmlinux]  [k] free_pages_prepare
>  +    0.72%  ksoftirqd/7    [mlx4_en]         [k] mlx4_alloc_pages.isra.20
>  +    0.70%  ksoftirqd/7    [kernel.vmlinux]  [k] __rcu_read_unlock
>  +    0.65%  ksoftirqd/7    [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
> 
> On my i7-4790K CPU, I don't have DDIO, thus I assume this high cost in
> __bpf_prog_run is due to a cache-miss on the packet data.

Before someone else point out the obvious... I forgot to enable JIT.
Enable it::

 # echo 1 > /proc/sys/net/core/bpf_jit_enable

Performance increased to: 10.8Mpps (proto 17:   10819446 drops/s)

 Samples: 51K of event 'cycles', Event count (approx.): 56775706510
   Overhead  Command      Shared Object     Symbol
 +   55.90%  ksoftirqd/7  [kernel.vmlinux]  [k] sk_load_byte_positive_offset
 +   10.71%  ksoftirqd/7  [mlx4_en]         [k] mlx4_en_alloc_frags
 +    8.26%  ksoftirqd/7  [mlx4_en]         [k] mlx4_en_process_rx_cq
 +    5.94%  ksoftirqd/7  [mlx4_en]         [k] mlx4_en_free_frag
 +    2.04%  ksoftirqd/7  [kernel.vmlinux]  [k] get_page_from_freelist
 +    2.03%  ksoftirqd/7  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
 +    1.42%  ksoftirqd/7  [mlx4_en]         [k] mlx4_call_bpf
 +    1.04%  ksoftirqd/7  [kernel.vmlinux]  [k] free_pages_prepare
 +    1.03%  ksoftirqd/7  [kernel.vmlinux]  [k] __rcu_read_unlock
 +    0.97%  ksoftirqd/7  [mlx4_en]         [k] mlx4_alloc_pages.isra.20
 +    0.95%  ksoftirqd/7  [devlink]         [k] 0x0000000000005f87
 +    0.58%  ksoftirqd/7  [devlink]         [k] 0x0000000000005f8f
 +    0.49%  ksoftirqd/7  [kernel.vmlinux]  [k] __free_pages_ok
 +    0.47%  ksoftirqd/7  [kernel.vmlinux]  [k] __rcu_read_lock
 +    0.46%  ksoftirqd/7  [kernel.vmlinux]  [k] free_one_page
 +    0.38%  ksoftirqd/7  [kernel.vmlinux]  [k] net_rx_action
 +    0.36%  ksoftirqd/7  [kernel.vmlinux]  [k] bpf_map_lookup_elem
 +    0.36%  ksoftirqd/7  [kernel.vmlinux]  [k] __mod_zone_page_state
 +    0.34%  ksoftirqd/7  [kernel.vmlinux]  [k] __alloc_pages_nodemask
 +    0.32%  ksoftirqd/7  [kernel.vmlinux]  [k] _raw_spin_lock
 +    0.31%  ksoftirqd/7  [devlink]         [k] 0x0000000000005f0a
 +    0.29%  ksoftirqd/7  [kernel.vmlinux]  [k] next_zones_zonelist

It is a very likely cache-miss in sk_load_byte_positive_offset().

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer