From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?Q?Pawe=c5=82_Staszewski?= Subject: Re: Kernel 4.19 network performance - forwarding/routing normal users traffic Date: Wed, 31 Oct 2018 23:45:04 +0100 Message-ID: <7de89ef7-6bab-2926-3d9c-bed7e059ad97@itcare.pl> References: <61697e49-e839-befc-8330-fc00187c48ee@itcare.pl> <61e30474-b5e9-4dc8-a8a6-90cdd17d2a66@gmail.com> <8e10bf68-f3b3-98f2-91a5-25b151756dd6@itcare.pl> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit To: Eric Dumazet , netdev Return-path: Received: from smtp19.iq.pl ([86.111.242.224]:34171 "EHLO smtp19.iq.pl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725933AbeKAHot (ORCPT ); Thu, 1 Nov 2018 03:44:49 -0400 In-Reply-To: <8e10bf68-f3b3-98f2-91a5-25b151756dd6@itcare.pl> Content-Language: pl Sender: netdev-owner@vger.kernel.org List-ID: W dniu 31.10.2018 o 23:20, Paweł Staszewski pisze: > > > W dniu 31.10.2018 o 23:09, Eric Dumazet pisze: >> >> On 10/31/2018 02:57 PM, Paweł Staszewski wrote: >>> Hi >>> >>> So maybee someone will be interested how linux kernel handles normal >>> traffic (not pktgen :) ) >>> >>> >>> Server HW configuration: >>> >>> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz >>> >>> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) >>> >>> >>> Server software: >>> >>> FRR - as routing daemon >>> >>> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local >>> numa node) >>> >>> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local >>> numa node) >>> >>> >>> Maximum traffic that server can handle: >>> >>> Bandwidth >>> >>>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help >>>    input: /proc/net/dev type: rate >>>    \         iface                   Rx Tx Total >>> ============================================================================== >>> >>>         enp175s0f1:          28.51 Gb/s           37.24 >>> Gb/s           65.74 Gb/s >>>         enp175s0f0:          38.07 Gb/s           28.44 >>> Gb/s           66.51 Gb/s >>> ------------------------------------------------------------------------------ >>> >>>              total:          66.58 Gb/s           65.67 >>> Gb/s          132.25 Gb/s >>> >>> >>> Packets per second: >>> >>>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help >>>    input: /proc/net/dev type: rate >>>    -         iface                   Rx Tx Total >>> ============================================================================== >>> >>>         enp175s0f1:      5248589.00 P/s       3486617.75 P/s >>> 8735207.00 P/s >>>         enp175s0f0:      3557944.25 P/s       5232516.00 P/s >>> 8790460.00 P/s >>> ------------------------------------------------------------------------------ >>> >>>              total:      8806533.00 P/s       8719134.00 P/s >>> 17525668.00 P/s >>> >>> >>> After reaching that limits nics on the upstream side (more RX >>> traffic) start to drop packets >>> >>> >>> I just dont understand that server can't handle more bandwidth >>> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX >>> side are increasing. >>> >>> Was thinking that maybee reached some pcie x16 limit - but x16 8GT >>> is 126Gbit - and also when testing with pktgen i can reach more bw >>> and pps (like 4x more comparing to normal internet traffic) >>> >>> And wondering if there is something that can be improved here. >>> >>> >>> >>> Some more informations / counters / stats and perf top below: >>> >>> Perf top flame graph: >>> >>> https://uploadfiles.io/7zo6u >>> >>> >>> >>> System configuration(long): >>> >>> >>> cat /sys/devices/system/node/node1/cpulist >>> 14-27,42-55 >>> cat /sys/class/net/enp175s0f0/device/numa_node >>> 1 >>> cat /sys/class/net/enp175s0f1/device/numa_node >>> 1 >>> >>> >>> >>> >>> >>> ip -s -d link ls dev enp175s0f0 >>> 6: enp175s0f0: mtu 1500 qdisc mq >>> state UP mode DEFAULT group default qlen 8192 >>>      link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity >>> 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size >>> 65536 gso_max_segs 65535 >>>      RX: bytes  packets  errors  dropped overrun mcast >>>      184142375840858 141347715974 2       2806325 0 85050528 >>>      TX: bytes  packets  errors  dropped carrier collsns >>>      99270697277430 172227994003 0       0       0       0 >>> >>>   ip -s -d link ls dev enp175s0f1 >>> 7: enp175s0f1: mtu 1500 qdisc mq >>> state UP mode DEFAULT group default qlen 8192 >>>      link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity >>> 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size >>> 65536 gso_max_segs 65535 >>>      RX: bytes  packets  errors  dropped overrun mcast >>>      99686284170801 173507590134 61      669685  0 100304421 >>>      TX: bytes  packets  errors  dropped carrier collsns >>>      184435107970545 142383178304 0       0       0       0 >>> >>> >>> ./softnet.sh >>> cpu      total    dropped   squeezed  collision        rps flow_limit >>> >>> >>> >>> >>>     PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz >>> cycles],  (all, 56 CPUs) >>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> >>> >>>      26.78%  [kernel]       [k] queued_spin_lock_slowpath >> This is highly suspect. >> >> A call graph (perf record -a -g sleep 1; perf report --stdio) would >> tell what is going on. > perf report: > https://ufile.io/rqp0h > > > >> >> With that many TX/RX queues, I would expect you to not use RPS/RFS, >> and have a 1/1 RX/TX mapping, >> so I do not know what could request a spinlock contention. >> >> >> > > And yes there is no RPF/RFS - just 1/1 RX/TX and affinity mapping on local cpu for the network controller for 28 RX+TX queues per nic .