Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network performance

From: "Paweł Staszewski" <pstaszewski@itcare.pl>
To: netdev@vger.kernel.org
Subject: Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network performance
Date: Fri, 29 Nov 2019 23:00:01 +0100	[thread overview]
Message-ID: <81ad4acf-c9b4-b2e8-d6b1-7e1245bce8a5@itcare.pl> (raw)

As always - each year i need to summarize network performance for 
routing applications like linux router on native Linux kernel (without 
xdp/dpdk/vpp etc) :)

HW setup:

Server (Supermicro SYS-1019P-WTR)

1x Intel 6146

2x Mellanox connect-x 5 (100G) (installed in two different x16 pcie 
gen3.1 slots)

6x 8GB DDR4 2666 (it really matters cause 100G is about 12.5GB/s of 
memory bandwidth one direction)

And here it is:

perf top at 72Gbit.s RX and 72Gbit/s TX (at same time)

    PerfTop:   91202 irqs/sec  kernel:99.7%  exact: 100.0% [4000Hz 
cycles:ppp],  (all, 24 CPUs)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 

      7.56%  [kernel]       [k] __dev_queue_xmit
      5.27%  [kernel]       [k] build_skb
      4.41%  [kernel]       [k] rr_transmit
      4.17%  [kernel]       [k] fib_table_lookup
      3.83%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
      3.30%  [kernel]       [k] mlx5e_sq_xmit
      3.14%  [kernel]       [k] __netif_receive_skb_core
      2.48%  [kernel]       [k] netif_skb_features
      2.36%  [kernel]       [k] _raw_spin_trylock
      2.27%  [kernel]       [k] dev_hard_start_xmit
      2.26%  [kernel]       [k] dev_gro_receive
      2.20%  [kernel]       [k] mlx5e_handle_rx_cqe_mpwrq
      1.92%  [kernel]       [k] mlx5_eq_comp_int
      1.91%  [kernel]       [k] mlx5e_poll_tx_cq
      1.74%  [kernel]       [k] tcp_gro_receive
      1.68%  [kernel]       [k] memcpy_erms
      1.64%  [kernel]       [k] kmem_cache_free_bulk
      1.57%  [kernel]       [k] inet_gro_receive
      1.55%  [kernel]       [k] netdev_pick_tx
      1.52%  [kernel]       [k] ip_forward
      1.45%  [kernel]       [k] team_xmit
      1.40%  [kernel]       [k] vlan_do_receive
      1.37%  [kernel]       [k] team_handle_frame
      1.36%  [kernel]       [k] __build_skb
      1.33%  [kernel]       [k] ipt_do_table
      1.33%  [kernel]       [k] mlx5e_poll_rx_cq
      1.28%  [kernel]       [k] ip_finish_output2
      1.26%  [kernel]       [k] vlan_passthru_hard_header
      1.20%  [kernel]       [k] netdev_core_pick_tx
      0.93%  [kernel]       [k] ip_rcv_core.isra.22.constprop.27
      0.87%  [kernel]       [k] validate_xmit_skb.isra.148
      0.87%  [kernel]       [k] ip_route_input_rcu
      0.78%  [kernel]       [k] kmem_cache_alloc
      0.77%  [kernel]       [k] mlx5e_handle_rx_dim
      0.71%  [kernel]       [k] iommu_need_mapping
      0.69%  [kernel]       [k] tasklet_action_common.isra.21
      0.66%  [kernel]       [k] mlx5e_xmit
      0.65%  [kernel]       [k] mlx5e_post_rx_mpwqes
      0.63%  [kernel]       [k] _raw_spin_lock
      0.61%  [kernel]       [k] ip_sublist_rcv
      0.57%  [kernel]       [k] skb_release_data
      0.53%  [kernel]       [k] __local_bh_enable_ip
      0.53%  [kernel]       [k] tcp4_gro_receive
      0.51%  [kernel]       [k] pfifo_fast_dequeue
      0.51%  [kernel]       [k] page_frag_free
      0.50%  [kernel]       [k] kmem_cache_free
      0.47%  [kernel]       [k] dma_direct_map_page
      0.45%  [kernel]       [k] native_irq_return_iret
      0.44%  [kernel]       [k] __slab_free.isra.89
      0.43%  [kernel]       [k] skb_gro_receive
      0.43%  [kernel]       [k] napi_gro_receive
      0.43%  [kernel]       [k] __do_softirq
      0.41%  [kernel]       [k] sch_direct_xmit
      0.41%  [kernel]       [k] ip_rcv_finish_core.isra.19
      0.40%  [kernel]       [k] skb_network_protocol
      0.40%  [kernel]       [k] __get_xps_queue_idx

Im useing team (2x 100G LAG)- that is why here is some load:

      4.41%  [kernel]       [k] rr_transmit

No discards on interfaces:

ethtool -S enp179s0f0 | grep disc
      rx_discards_phy: 0
      tx_discards_phy: 0

ethtool -S enp179s0f1 | grep disc
      rx_discards_phy: 0
      tx_discards_phy: 0

io/stream test at 72G/72G traffic:

-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           38948.8     0.004368     0.004108     0.004533
Scale:          37914.6     0.004473     0.004220     0.004802
Add:            43134.6     0.005801     0.005564     0.006086
Triad:          42934.1     0.005696     0.005590     0.005901
-------------------------------------------------------------

And some links to screenshoots

Softirqs

https://pasteboard.co/IIZkGrw.png

And bandwidth / cpu / pps grapsh

https://pasteboard.co/IIZl6XP.png

Currently it looks like the biggest problem for 100G is cpu->mem->nic 
bandwidth or nic doorbell / page cache at RX processing - cause what i 
can see is that if I run iperf on this host i can TX full 100G - but I 
cant RX 100G when i flood this host from some packet generator (it will 
start to drop packets at arount 82Gbit/s) - and this is not a problem 
with ppp but it is bandwidth problem.

For example i can flood RX with 14Mpps or 64b packets without nic 
discards but i cant flood it with 1000b frames and same pps - cause when 
it reaches 82Gbit/s nic's start to report discards.

Thanks

-- 
Paweł Staszewski