Netdev Archive on lore.kernel.org
 help / color / Atom feed
* Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network performance
@ 2019-11-29 22:00 Paweł Staszewski
  2019-11-29 22:13 ` Paweł Staszewski
  2019-12-01 16:05 ` David Ahern
  0 siblings, 2 replies; 6+ messages in thread
From: Paweł Staszewski @ 2019-11-29 22:00 UTC (permalink / raw)
  To: netdev

As always - each year i need to summarize network performance for 
routing applications like linux router on native Linux kernel (without 
xdp/dpdk/vpp etc) :)

HW setup:

Server (Supermicro SYS-1019P-WTR)

1x Intel 6146

2x Mellanox connect-x 5 (100G) (installed in two different x16 pcie 
gen3.1 slots)

6x 8GB DDR4 2666 (it really matters cause 100G is about 12.5GB/s of 
memory bandwidth one direction)


And here it is:

perf top at 72Gbit.s RX and 72Gbit/s TX (at same time)

    PerfTop:   91202 irqs/sec  kernel:99.7%  exact: 100.0% [4000Hz 
cycles:ppp],  (all, 24 CPUs)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 


      7.56%  [kernel]       [k] __dev_queue_xmit
      5.27%  [kernel]       [k] build_skb
      4.41%  [kernel]       [k] rr_transmit
      4.17%  [kernel]       [k] fib_table_lookup
      3.83%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
      3.30%  [kernel]       [k] mlx5e_sq_xmit
      3.14%  [kernel]       [k] __netif_receive_skb_core
      2.48%  [kernel]       [k] netif_skb_features
      2.36%  [kernel]       [k] _raw_spin_trylock
      2.27%  [kernel]       [k] dev_hard_start_xmit
      2.26%  [kernel]       [k] dev_gro_receive
      2.20%  [kernel]       [k] mlx5e_handle_rx_cqe_mpwrq
      1.92%  [kernel]       [k] mlx5_eq_comp_int
      1.91%  [kernel]       [k] mlx5e_poll_tx_cq
      1.74%  [kernel]       [k] tcp_gro_receive
      1.68%  [kernel]       [k] memcpy_erms
      1.64%  [kernel]       [k] kmem_cache_free_bulk
      1.57%  [kernel]       [k] inet_gro_receive
      1.55%  [kernel]       [k] netdev_pick_tx
      1.52%  [kernel]       [k] ip_forward
      1.45%  [kernel]       [k] team_xmit
      1.40%  [kernel]       [k] vlan_do_receive
      1.37%  [kernel]       [k] team_handle_frame
      1.36%  [kernel]       [k] __build_skb
      1.33%  [kernel]       [k] ipt_do_table
      1.33%  [kernel]       [k] mlx5e_poll_rx_cq
      1.28%  [kernel]       [k] ip_finish_output2
      1.26%  [kernel]       [k] vlan_passthru_hard_header
      1.20%  [kernel]       [k] netdev_core_pick_tx
      0.93%  [kernel]       [k] ip_rcv_core.isra.22.constprop.27
      0.87%  [kernel]       [k] validate_xmit_skb.isra.148
      0.87%  [kernel]       [k] ip_route_input_rcu
      0.78%  [kernel]       [k] kmem_cache_alloc
      0.77%  [kernel]       [k] mlx5e_handle_rx_dim
      0.71%  [kernel]       [k] iommu_need_mapping
      0.69%  [kernel]       [k] tasklet_action_common.isra.21
      0.66%  [kernel]       [k] mlx5e_xmit
      0.65%  [kernel]       [k] mlx5e_post_rx_mpwqes
      0.63%  [kernel]       [k] _raw_spin_lock
      0.61%  [kernel]       [k] ip_sublist_rcv
      0.57%  [kernel]       [k] skb_release_data
      0.53%  [kernel]       [k] __local_bh_enable_ip
      0.53%  [kernel]       [k] tcp4_gro_receive
      0.51%  [kernel]       [k] pfifo_fast_dequeue
      0.51%  [kernel]       [k] page_frag_free
      0.50%  [kernel]       [k] kmem_cache_free
      0.47%  [kernel]       [k] dma_direct_map_page
      0.45%  [kernel]       [k] native_irq_return_iret
      0.44%  [kernel]       [k] __slab_free.isra.89
      0.43%  [kernel]       [k] skb_gro_receive
      0.43%  [kernel]       [k] napi_gro_receive
      0.43%  [kernel]       [k] __do_softirq
      0.41%  [kernel]       [k] sch_direct_xmit
      0.41%  [kernel]       [k] ip_rcv_finish_core.isra.19
      0.40%  [kernel]       [k] skb_network_protocol
      0.40%  [kernel]       [k] __get_xps_queue_idx


Im useing team (2x 100G LAG)- that is why here is some load:

      4.41%  [kernel]       [k] rr_transmit



No discards on interfaces:

ethtool -S enp179s0f0 | grep disc
      rx_discards_phy: 0
      tx_discards_phy: 0

ethtool -S enp179s0f1 | grep disc
      rx_discards_phy: 0
      tx_discards_phy: 0

io/stream test at 72G/72G traffic:

-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           38948.8     0.004368     0.004108     0.004533
Scale:          37914.6     0.004473     0.004220     0.004802
Add:            43134.6     0.005801     0.005564     0.006086
Triad:          42934.1     0.005696     0.005590     0.005901
-------------------------------------------------------------


And some links to screenshoots

Softirqs

https://pasteboard.co/IIZkGrw.png

And bandwidth / cpu / pps grapsh

https://pasteboard.co/IIZl6XP.png


Currently it looks like the biggest problem for 100G is cpu->mem->nic 
bandwidth or nic doorbell / page cache at RX processing - cause what i 
can see is that if I run iperf on this host i can TX full 100G - but I 
cant RX 100G when i flood this host from some packet generator (it will 
start to drop packets at arount 82Gbit/s) - and this is not a problem 
with ppp but it is bandwidth problem.

For example i can flood RX with 14Mpps or 64b packets without nic 
discards but i cant flood it with 1000b frames and same pps - cause when 
it reaches 82Gbit/s nic's start to report discards.


Thanks



-- 
Paweł Staszewski


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network performance
  2019-11-29 22:00 Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network performance Paweł Staszewski
@ 2019-11-29 22:13 ` Paweł Staszewski
  2019-12-01 16:05 ` David Ahern
  1 sibling, 0 replies; 6+ messages in thread
From: Paweł Staszewski @ 2019-11-29 22:13 UTC (permalink / raw)
  To: netdev


W dniu 29.11.2019 o 23:00, Paweł Staszewski pisze:
> As always - each year i need to summarize network performance for 
> routing applications like linux router on native Linux kernel (without 
> xdp/dpdk/vpp etc) :)
>
> HW setup:
>
> Server (Supermicro SYS-1019P-WTR)
>
> 1x Intel 6146
>
> 2x Mellanox connect-x 5 (100G) (installed in two different x16 pcie 
> gen3.1 slots)
>
> 6x 8GB DDR4 2666 (it really matters cause 100G is about 12.5GB/s of 
> memory bandwidth one direction)
>
>
> And here it is:
>
> perf top at 72Gbit.s RX and 72Gbit/s TX (at same time)
>
>    PerfTop:   91202 irqs/sec  kernel:99.7%  exact: 100.0% [4000Hz 
> cycles:ppp],  (all, 24 CPUs)
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
>
>
>      7.56%  [kernel]       [k] __dev_queue_xmit
>      5.27%  [kernel]       [k] build_skb
>      4.41%  [kernel]       [k] rr_transmit
>      4.17%  [kernel]       [k] fib_table_lookup
>      3.83%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
>      3.30%  [kernel]       [k] mlx5e_sq_xmit
>      3.14%  [kernel]       [k] __netif_receive_skb_core
>      2.48%  [kernel]       [k] netif_skb_features
>      2.36%  [kernel]       [k] _raw_spin_trylock
>      2.27%  [kernel]       [k] dev_hard_start_xmit
>      2.26%  [kernel]       [k] dev_gro_receive
>      2.20%  [kernel]       [k] mlx5e_handle_rx_cqe_mpwrq
>      1.92%  [kernel]       [k] mlx5_eq_comp_int
>      1.91%  [kernel]       [k] mlx5e_poll_tx_cq
>      1.74%  [kernel]       [k] tcp_gro_receive
>      1.68%  [kernel]       [k] memcpy_erms
>      1.64%  [kernel]       [k] kmem_cache_free_bulk
>      1.57%  [kernel]       [k] inet_gro_receive
>      1.55%  [kernel]       [k] netdev_pick_tx
>      1.52%  [kernel]       [k] ip_forward
>      1.45%  [kernel]       [k] team_xmit
>      1.40%  [kernel]       [k] vlan_do_receive
>      1.37%  [kernel]       [k] team_handle_frame
>      1.36%  [kernel]       [k] __build_skb
>      1.33%  [kernel]       [k] ipt_do_table
>      1.33%  [kernel]       [k] mlx5e_poll_rx_cq
>      1.28%  [kernel]       [k] ip_finish_output2
>      1.26%  [kernel]       [k] vlan_passthru_hard_header
>      1.20%  [kernel]       [k] netdev_core_pick_tx
>      0.93%  [kernel]       [k] ip_rcv_core.isra.22.constprop.27
>      0.87%  [kernel]       [k] validate_xmit_skb.isra.148
>      0.87%  [kernel]       [k] ip_route_input_rcu
>      0.78%  [kernel]       [k] kmem_cache_alloc
>      0.77%  [kernel]       [k] mlx5e_handle_rx_dim
>      0.71%  [kernel]       [k] iommu_need_mapping
>      0.69%  [kernel]       [k] tasklet_action_common.isra.21
>      0.66%  [kernel]       [k] mlx5e_xmit
>      0.65%  [kernel]       [k] mlx5e_post_rx_mpwqes
>      0.63%  [kernel]       [k] _raw_spin_lock
>      0.61%  [kernel]       [k] ip_sublist_rcv
>      0.57%  [kernel]       [k] skb_release_data
>      0.53%  [kernel]       [k] __local_bh_enable_ip
>      0.53%  [kernel]       [k] tcp4_gro_receive
>      0.51%  [kernel]       [k] pfifo_fast_dequeue
>      0.51%  [kernel]       [k] page_frag_free
>      0.50%  [kernel]       [k] kmem_cache_free
>      0.47%  [kernel]       [k] dma_direct_map_page
>      0.45%  [kernel]       [k] native_irq_return_iret
>      0.44%  [kernel]       [k] __slab_free.isra.89
>      0.43%  [kernel]       [k] skb_gro_receive
>      0.43%  [kernel]       [k] napi_gro_receive
>      0.43%  [kernel]       [k] __do_softirq
>      0.41%  [kernel]       [k] sch_direct_xmit
>      0.41%  [kernel]       [k] ip_rcv_finish_core.isra.19
>      0.40%  [kernel]       [k] skb_network_protocol
>      0.40%  [kernel]       [k] __get_xps_queue_idx
>
>
> Im useing team (2x 100G LAG)- that is why here is some load:
>
>      4.41%  [kernel]       [k] rr_transmit
>
>
>
> No discards on interfaces:
>
> ethtool -S enp179s0f0 | grep disc
>      rx_discards_phy: 0
>      tx_discards_phy: 0
>
> ethtool -S enp179s0f1 | grep disc
>      rx_discards_phy: 0
>      tx_discards_phy: 0
>
> io/stream test at 72G/72G traffic:
>
> -------------------------------------------------------------
> Function    Best Rate MB/s  Avg time     Min time     Max time
> Copy:           38948.8     0.004368     0.004108     0.004533
> Scale:          37914.6     0.004473     0.004220     0.004802
> Add:            43134.6     0.005801     0.005564     0.006086
> Triad:          42934.1     0.005696     0.005590     0.005901
> -------------------------------------------------------------
>
>
> And some links to screenshoots
>
> Softirqs
>
> https://pasteboard.co/IIZkGrw.png
>
> And bandwidth / cpu / pps grapsh
>
> https://pasteboard.co/IIZl6XP.png
>
>
> Currently it looks like the biggest problem for 100G is cpu->mem->nic 
> bandwidth or nic doorbell / page cache at RX processing - cause what i 
> can see is that if I run iperf on this host i can TX full 100G - but I 
> cant RX 100G when i flood this host from some packet generator (it 
> will start to drop packets at arount 82Gbit/s) - and this is not a 
> problem with ppp but it is bandwidth problem.
>
> For example i can flood RX with 14Mpps or 64b packets without nic 
> discards but i cant flood it with 1000b frames and same pps - cause 
> when it reaches 82Gbit/s nic's start to report discards.
>
>
> Thanks
>
>
>
Forgot to add this is forwarding scenario - so router is routing packets 
from one 100G interface to another 100G interface and vice-versa (full 
BGP feed x4 from 4 different upstreams) - 700k+ flows.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network performance
  2019-11-29 22:00 Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network performance Paweł Staszewski
  2019-11-29 22:13 ` Paweł Staszewski
@ 2019-12-01 16:05 ` David Ahern
  2019-12-02 10:09   ` Paweł Staszewski
  1 sibling, 1 reply; 6+ messages in thread
From: David Ahern @ 2019-12-01 16:05 UTC (permalink / raw)
  To: Paweł Staszewski, netdev

On 11/29/19 4:00 PM, Paweł Staszewski wrote:
> As always - each year i need to summarize network performance for
> routing applications like linux router on native Linux kernel (without
> xdp/dpdk/vpp etc) :)
> 

Do you keep past profiles? How does this profile (and traffic rates)
compare to older kernels - e.g., 5.0 or 4.19?


> HW setup:
> 
> Server (Supermicro SYS-1019P-WTR)
> 
> 1x Intel 6146
> 
> 2x Mellanox connect-x 5 (100G) (installed in two different x16 pcie
> gen3.1 slots)
> 
> 6x 8GB DDR4 2666 (it really matters cause 100G is about 12.5GB/s of
> memory bandwidth one direction)
> 
> 
> And here it is:
> 
> perf top at 72Gbit.s RX and 72Gbit/s TX (at same time)
> 
>    PerfTop:   91202 irqs/sec  kernel:99.7%  exact: 100.0% [4000Hz
> cycles:ppp],  (all, 24 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> 
>      7.56%  [kernel]       [k] __dev_queue_xmit
>      5.27%  [kernel]       [k] build_skb
>      4.41%  [kernel]       [k] rr_transmit
>      4.17%  [kernel]       [k] fib_table_lookup
>      3.83%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
>      3.30%  [kernel]       [k] mlx5e_sq_xmit
>      3.14%  [kernel]       [k] __netif_receive_skb_core
>      2.48%  [kernel]       [k] netif_skb_features
>      2.36%  [kernel]       [k] _raw_spin_trylock
>      2.27%  [kernel]       [k] dev_hard_start_xmit


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network performance
  2019-12-01 16:05 ` David Ahern
@ 2019-12-02 10:09   ` Paweł Staszewski
  2019-12-02 10:53     ` Paolo Abeni
  0 siblings, 1 reply; 6+ messages in thread
From: Paweł Staszewski @ 2019-12-02 10:09 UTC (permalink / raw)
  To: David Ahern, netdev, Jesper Dangaard Brouer


W dniu 01.12.2019 o 17:05, David Ahern pisze:
> On 11/29/19 4:00 PM, Paweł Staszewski wrote:
>> As always - each year i need to summarize network performance for
>> routing applications like linux router on native Linux kernel (without
>> xdp/dpdk/vpp etc) :)
>>
> Do you keep past profiles? How does this profile (and traffic rates)
> compare to older kernels - e.g., 5.0 or 4.19?
>
>
Yes - so for 4.19:

Max bandwidth was about 40-42Gbit/s RX / 40-42Gbit/s TX of 
forwarded(routed) traffic

And after "order-0 pages" patches - max was 50Gbit/s RX + 50Gbit/s TX 
(forwarding - bandwidth max)

(current kernel almost doubled this)

And also old perf top (from kernel 4.19) - before "order-0 pages patch":

    PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 


     26.78%  [kernel]       [k] queued_spin_lock_slowpath
      9.09%  [kernel]       [k] mlx5e_skb_from_cqe_linear
      4.94%  [kernel]       [k] mlx5e_sq_xmit
      3.63%  [kernel]       [k] memcpy_erms
      3.30%  [kernel]       [k] fib_table_lookup
      3.26%  [kernel]       [k] build_skb
      2.41%  [kernel]       [k] mlx5e_poll_tx_cq
      2.11%  [kernel]       [k] get_page_from_freelist
      1.51%  [kernel]       [k] vlan_do_receive
      1.51%  [kernel]       [k] _raw_spin_lock
      1.43%  [kernel]       [k] __dev_queue_xmit
      1.41%  [kernel]       [k] dev_gro_receive
      1.34%  [kernel]       [k] mlx5e_poll_rx_cq
      1.26%  [kernel]       [k] tcp_gro_receive
      1.21%  [kernel]       [k] free_one_page
      1.13%  [kernel]       [k] swiotlb_map_page
      1.13%  [kernel]       [k] mlx5e_post_rx_wqes
      1.05%  [kernel]       [k] pfifo_fast_dequeue
      1.05%  [kernel]       [k] mlx5e_handle_rx_cqe
      1.03%  [kernel]       [k] ip_finish_output2
      1.02%  [kernel]       [k] ipt_do_table
      0.96%  [kernel]       [k] inet_gro_receive
      0.91%  [kernel]       [k] mlx5_eq_int
      0.88%  [kernel]       [k] __slab_free.isra.79
      0.86%  [kernel]       [k] __build_skb
      0.84%  [kernel]       [k] page_frag_free
      0.76%  [kernel]       [k] skb_release_data
      0.75%  [kernel]       [k] __netif_receive_skb_core
      0.75%  [kernel]       [k] irq_entries_start
      0.71%  [kernel]       [k] ip_route_input_rcu
      0.65%  [kernel]       [k] vlan_dev_hard_start_xmit
      0.56%  [kernel]       [k] ip_forward
      0.56%  [kernel]       [k] __memcpy
      0.52%  [kernel]       [k] kmem_cache_alloc
      0.52%  [kernel]       [k] kmem_cache_free_bulk
      0.49%  [kernel]       [k] mlx5e_page_release
      0.47%  [kernel]       [k] netif_skb_features
      0.47%  [kernel]       [k] mlx5e_build_rx_skb
      0.47%  [kernel]       [k] dev_hard_start_xmit
      0.43%  [kernel]       [k] __page_pool_put_page
      0.43%  [kernel]       [k] __netif_schedule
      0.43%  [kernel]       [k] mlx5e_xmit
      0.41%  [kernel]       [k] __qdisc_run
      0.41%  [kernel]       [k] validate_xmit_skb.isra.142
      0.41%  [kernel]       [k] swiotlb_unmap_page
      0.40%  [kernel]       [k] inet_lookup_ifaddr_rcu
      0.34%  [kernel]       [k] ip_rcv_core.isra.20.constprop.25
      0.34%  [kernel]       [k] tcp4_gro_receive
      0.29%  [kernel]       [k] _raw_spin_lock_irqsave
      0.29%  [kernel]       [k] napi_consume_skb
      0.29%  [kernel]       [k] skb_gro_receive
      0.29%  [kernel]       [k] ___slab_alloc.isra.80
      0.27%  [kernel]       [k] eth_type_trans
      0.26%  [kernel]       [k] __free_pages_ok
      0.26%  [kernel]       [k] __get_xps_queue_idx
      0.24%  [kernel]       [k] _raw_spin_trylock
      0.23%  [kernel]       [k] __local_bh_enable_ip
      0.22%  [kernel]       [k] pfifo_fast_enqueue
      0.21%  [kernel]       [k] tasklet_action_common.isra.21
      0.21%  [kernel]       [k] sch_direct_xmit
      0.21%  [kernel]       [k] skb_network_protocol
      0.21%  [kernel]       [k] kmem_cache_free
      0.20%  [kernel]       [k] netdev_pick_tx
      0.18%  [kernel]       [k] napi_gro_complete
      0.18%  [kernel]       [k] __sched_text_start
      0.18%  [kernel]       [k] mlx5e_xdp_handle
      0.17%  [kernel]       [k] ip_finish_output
      0.16%  [kernel]       [k] napi_gro_flush
      0.16%  [kernel]       [k] vlan_passthru_hard_header
      0.16%  [kernel]       [k] skb_segment
      0.15%  [kernel]       [k] __alloc_pages_nodemask
      0.15%  [kernel]       [k] mlx5e_features_check
      0.15%  [kernel]       [k] mlx5e_napi_poll
      0.15%  [kernel]       [k] napi_gro_receive
      0.14%  [kernel]       [k] fib_validate_source
      0.14%  [kernel]       [k] _raw_spin_lock_irq
      0.14%  [kernel]       [k] inet_gro_complete
      0.14%  [kernel]       [k] get_partial_node.isra.78
      0.13%  [kernel]       [k] napi_complete_done
      0.13%  [kernel]       [k] ip_rcv_finish_core.isra.17
      0.13%  [kernel]       [k] cmd_exec


After "order-0 pages" patch

    PerfTop:  104692 irqs/sec  kernel:99.5%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 


      9.06%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
      6.43%  [kernel]       [k] tasklet_action_common.isra.21
      5.68%  [kernel]       [k] fib_table_lookup
      4.89%  [kernel]       [k] irq_entries_start
      4.53%  [kernel]       [k] mlx5_eq_int
      4.10%  [kernel]       [k] build_skb
      3.39%  [kernel]       [k] mlx5e_poll_tx_cq
      3.38%  [kernel]       [k] mlx5e_sq_xmit
      2.73%  [kernel]       [k] mlx5e_poll_rx_cq
      2.18%  [kernel]       [k] __dev_queue_xmit
      2.13%  [kernel]       [k] vlan_do_receive
      2.12%  [kernel]       [k] mlx5e_handle_rx_cqe_mpwrq
      2.00%  [kernel]       [k] ip_finish_output2
      1.87%  [kernel]       [k] mlx5e_post_rx_mpwqes
      1.86%  [kernel]       [k] memcpy_erms
      1.85%  [kernel]       [k] ipt_do_table
      1.70%  [kernel]       [k] dev_gro_receive
      1.39%  [kernel]       [k] __netif_receive_skb_core
      1.31%  [kernel]       [k] inet_gro_receive
      1.21%  [kernel]       [k] ip_route_input_rcu
      1.21%  [kernel]       [k] tcp_gro_receive
      1.13%  [kernel]       [k] _raw_spin_lock
      1.08%  [kernel]       [k] __build_skb
      1.06%  [kernel]       [k] kmem_cache_free_bulk
      1.05%  [kernel]       [k] __softirqentry_text_start
      1.03%  [kernel]       [k] vlan_dev_hard_start_xmit
      0.98%  [kernel]       [k] pfifo_fast_dequeue
      0.95%  [kernel]       [k] mlx5e_xmit
      0.95%  [kernel]       [k] page_frag_free
      0.88%  [kernel]       [k] ip_forward
      0.81%  [kernel]       [k] dev_hard_start_xmit
      0.78%  [kernel]       [k] rcu_irq_exit
      0.77%  [kernel]       [k] netif_skb_features
      0.72%  [kernel]       [k] napi_complete_done
      0.72%  [kernel]       [k] kmem_cache_alloc
      0.68%  [kernel]       [k] validate_xmit_skb.isra.142
      0.66%  [kernel]       [k] ip_rcv_core.isra.20.constprop.25
      0.58%  [kernel]       [k] swiotlb_map_page
      0.57%  [kernel]       [k] __qdisc_run
      0.56%  [kernel]       [k] tasklet_action
      0.54%  [kernel]       [k] __get_xps_queue_idx
      0.54%  [kernel]       [k] inet_lookup_ifaddr_rcu
      0.50%  [kernel]       [k] tcp4_gro_receive
      0.49%  [kernel]       [k] skb_release_data
      0.47%  [kernel]       [k] eth_type_trans
      0.40%  [kernel]       [k] sch_direct_xmit
      0.40%  [kernel]       [k] net_rx_action
      0.39%  [kernel]       [k] __local_bh_enable_ip


>> HW setup:
>>
>> Server (Supermicro SYS-1019P-WTR)
>>
>> 1x Intel 6146
>>
>> 2x Mellanox connect-x 5 (100G) (installed in two different x16 pcie
>> gen3.1 slots)
>>
>> 6x 8GB DDR4 2666 (it really matters cause 100G is about 12.5GB/s of
>> memory bandwidth one direction)
>>
>>
>> And here it is:
>>
>> perf top at 72Gbit.s RX and 72Gbit/s TX (at same time)
>>
>>     PerfTop:   91202 irqs/sec  kernel:99.7%  exact: 100.0% [4000Hz
>> cycles:ppp],  (all, 24 CPUs)
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>
>>       7.56%  [kernel]       [k] __dev_queue_xmit
>>       5.27%  [kernel]       [k] build_skb
>>       4.41%  [kernel]       [k] rr_transmit
>>       4.17%  [kernel]       [k] fib_table_lookup
>>       3.83%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
>>       3.30%  [kernel]       [k] mlx5e_sq_xmit
>>       3.14%  [kernel]       [k] __netif_receive_skb_core
>>       2.48%  [kernel]       [k] netif_skb_features
>>       2.36%  [kernel]       [k] _raw_spin_trylock
>>       2.27%  [kernel]       [k] dev_hard_start_xmit

-- 
Paweł Staszewski


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network performance
  2019-12-02 10:09   ` Paweł Staszewski
@ 2019-12-02 10:53     ` Paolo Abeni
  2019-12-02 16:23       ` Paweł Staszewski
  0 siblings, 1 reply; 6+ messages in thread
From: Paolo Abeni @ 2019-12-02 10:53 UTC (permalink / raw)
  To: Paweł Staszewski, David Ahern, netdev, Jesper Dangaard Brouer

On Mon, 2019-12-02 at 11:09 +0100, Paweł Staszewski wrote:
> W dniu 01.12.2019 o 17:05, David Ahern pisze:
> > On 11/29/19 4:00 PM, Paweł Staszewski wrote:
> > > As always - each year i need to summarize network performance for
> > > routing applications like linux router on native Linux kernel (without
> > > xdp/dpdk/vpp etc) :)
> > > 
> > Do you keep past profiles? How does this profile (and traffic rates)
> > compare to older kernels - e.g., 5.0 or 4.19?
> > 
> > 
> Yes - so for 4.19:
> 
> Max bandwidth was about 40-42Gbit/s RX / 40-42Gbit/s TX of 
> forwarded(routed) traffic
> 
> And after "order-0 pages" patches - max was 50Gbit/s RX + 50Gbit/s TX 
> (forwarding - bandwidth max)
> 
> (current kernel almost doubled this)

Looks like we are on the good track ;)

[...]
> After "order-0 pages" patch
> 
>     PerfTop:  104692 irqs/sec  kernel:99.5%  exact:  0.0% [4000Hz 
> cycles],  (all, 56 CPUs)
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
> 
> 
>       9.06%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
>       6.43%  [kernel]       [k] tasklet_action_common.isra.21
>       5.68%  [kernel]       [k] fib_table_lookup
>       4.89%  [kernel]       [k] irq_entries_start
>       4.53%  [kernel]       [k] mlx5_eq_int
>       4.10%  [kernel]       [k] build_skb
>       3.39%  [kernel]       [k] mlx5e_poll_tx_cq
>       3.38%  [kernel]       [k] mlx5e_sq_xmit
>       2.73%  [kernel]       [k] mlx5e_poll_rx_cq

Compared to the current kernel perf figures, it looks like most of the
gains come from driver changes.

[... current perf figures follow ...]
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> 
>       7.56%  [kernel]       [k] __dev_queue_xmit

This is a bit surprising to me. I guess this is due
'__dev_queue_xmit()' being calling twice per packet (team, NIC) and due
to the retpoline overhead.

>       1.74%  [kernel]       [k] tcp_gro_receive

If the reference use-case is with a quite large number of cuncurrent
flows, I guess you can try disabling GRO

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network performance
  2019-12-02 10:53     ` Paolo Abeni
@ 2019-12-02 16:23       ` Paweł Staszewski
  0 siblings, 0 replies; 6+ messages in thread
From: Paweł Staszewski @ 2019-12-02 16:23 UTC (permalink / raw)
  To: Paolo Abeni, David Ahern, netdev, Jesper Dangaard Brouer


W dniu 02.12.2019 o 11:53, Paolo Abeni pisze:
> On Mon, 2019-12-02 at 11:09 +0100, Paweł Staszewski wrote:
>> W dniu 01.12.2019 o 17:05, David Ahern pisze:
>>> On 11/29/19 4:00 PM, Paweł Staszewski wrote:
>>>> As always - each year i need to summarize network performance for
>>>> routing applications like linux router on native Linux kernel (without
>>>> xdp/dpdk/vpp etc) :)
>>>>
>>> Do you keep past profiles? How does this profile (and traffic rates)
>>> compare to older kernels - e.g., 5.0 or 4.19?
>>>
>>>
>> Yes - so for 4.19:
>>
>> Max bandwidth was about 40-42Gbit/s RX / 40-42Gbit/s TX of
>> forwarded(routed) traffic
>>
>> And after "order-0 pages" patches - max was 50Gbit/s RX + 50Gbit/s TX
>> (forwarding - bandwidth max)
>>
>> (current kernel almost doubled this)
> Looks like we are on the good track ;)
>
> [...]
>> After "order-0 pages" patch
>>
>>      PerfTop:  104692 irqs/sec  kernel:99.5%  exact:  0.0% [4000Hz
>> cycles],  (all, 56 CPUs)
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>
>>        9.06%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
>>        6.43%  [kernel]       [k] tasklet_action_common.isra.21
>>        5.68%  [kernel]       [k] fib_table_lookup
>>        4.89%  [kernel]       [k] irq_entries_start
>>        4.53%  [kernel]       [k] mlx5_eq_int
>>        4.10%  [kernel]       [k] build_skb
>>        3.39%  [kernel]       [k] mlx5e_poll_tx_cq
>>        3.38%  [kernel]       [k] mlx5e_sq_xmit
>>        2.73%  [kernel]       [k] mlx5e_poll_rx_cq
> Compared to the current kernel perf figures, it looks like most of the
> gains come from driver changes.
>
> [... current perf figures follow ...]
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>
>>        7.56%  [kernel]       [k] __dev_queue_xmit
> This is a bit surprising to me. I guess this is due
> '__dev_queue_xmit()' being calling twice per packet (team, NIC) and due
> to the retpoline overhead.
>
>>        1.74%  [kernel]       [k] tcp_gro_receive
> If the reference use-case is with a quite large number of cuncurrent
> flows, I guess you can try disabling GRO

Disabling GRO with teamed interfaces is not good cause after disabling 
GRO on physical interfaces cpu load is about 10% higher on all cores.

And observation:

Enabled GRO on interfaces vs team0 packets per second:

   iface                   Rx                   Tx Total
==============================================================================
             team0:     5952483.50 KB/s      6028436.50 KB/s 11980919.00 
KB/s
----------------------------------------------------------------------------

And softnetstats:

CPU          total/sec     dropped/sec    squeezed/sec 
collision/sec      rx_rps/sec  flow_limit/sec
CPU:00         1014977               0 35               0               
0               0
CPU:01         1074461               0 30               0               
0               0
CPU:02         1020460               0 34               0               
0               0
CPU:03         1077624               0 34               0               
0               0
CPU:04         1005102               0 32               0               
0               0
CPU:05         1097107               0 46               0               
0               0
CPU:06          997877               0 24               0               
0               0
CPU:07         1056216               0 34               0               
0               0
CPU:08          856567               0 34               0               
0               0
CPU:09          862527               0 23               0               
0               0
CPU:10          876107               0 34               0               
0               0
CPU:11          759275               0 27               0               
0               0
CPU:12          817307               0 27               0               
0               0
CPU:13          868073               0 21               0               
0               0
CPU:14          837783               0 34               0               
0               0
CPU:15          817946               0 27               0               
0               0
CPU:16          785500               0 25               0               
0               0
CPU:17          851276               0 28               0               
0               0
CPU:18          843888               0 29               0               
0               0
CPU:19          924840               0 34               0               
0               0
CPU:20          884879               0 37               0               
0               0
CPU:21          841461               0 28               0               
0               0
CPU:22          819436               0 32               0               
0               0
CPU:23          872843               0 32               0               
0               0

Summed:       21863531               0 740               0               
0               0


Disabled GRO on interfaces vs team0 packets per second:

   iface                   Rx                   Tx Total
==============================================================================
             team0:     5952483.50 KB/s      6028436.50 KB/s 11980919.00 
KB/s
----------------------------------------------------------------------------

And softnet stat:

CPU          total/sec     dropped/sec    squeezed/sec 
collision/sec      rx_rps/sec  flow_limit/sec
CPU:00          625288               0 23               0               
0               0
CPU:01          605239               0 24               0               
0               0
CPU:02          644965               0 26               0               
0               0
CPU:03          620264               0 30               0               
0               0
CPU:04          603416               0 25               0               
0               0
CPU:05          597838               0 23               0               
0               0
CPU:06          580028               0 22               0               
0               0
CPU:07          604274               0 23               0               
0               0
CPU:08          556119               0 26               0               
0               0
CPU:09          494997               0 23               0               
0               0
CPU:10          514759               0 23               0               
0               0
CPU:11          500333               0 22               0               
0               0
CPU:12          497956               0 23               0               
0               0
CPU:13          535194               0 14               0               
0               0
CPU:14          504304               0 24               0               
0               0
CPU:15          489015               0 18               0               
0               0
CPU:16          487249               0 24               0               
0               0
CPU:17          472023               0 23               0               
0               0
CPU:18          539454               0 24               0               
0               0
CPU:19          499901               0 19               0               
0               0
CPU:20          479945               0 26               0               
0               0
CPU:21          486800               0 29               0               
0               0
CPU:22          466916               0 26               0               
0               0
CPU:23          559730               0 34               0               
0               0

Summed:       12966008               0 573               0               
0               0

Maybee without team it will be better.

>
> Cheers,
>
> Paolo
>
-- 
Paweł Staszewski


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, back to index

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-29 22:00 Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network performance Paweł Staszewski
2019-11-29 22:13 ` Paweł Staszewski
2019-12-01 16:05 ` David Ahern
2019-12-02 10:09   ` Paweł Staszewski
2019-12-02 10:53     ` Paolo Abeni
2019-12-02 16:23       ` Paweł Staszewski

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org
	public-inbox-index netdev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git