From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?UTF-8?Q?Pawe=c5=82_Staszewski?= <pstaszewski@itcare.pl>
Subject: Re: Kernel 4.19 network performance - forwarding/routing normal users
 traffic
Date: Wed, 31 Oct 2018 23:45:04 +0100
Message-ID: <7de89ef7-6bab-2926-3d9c-bed7e059ad97@itcare.pl>
References: <61697e49-e839-befc-8330-fc00187c48ee@itcare.pl>
 <61e30474-b5e9-4dc8-a8a6-90cdd17d2a66@gmail.com>
 <8e10bf68-f3b3-98f2-91a5-25b151756dd6@itcare.pl>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
To: Eric Dumazet <eric.dumazet@gmail.com>,
        netdev <netdev@vger.kernel.org>
Return-path: <netdev-owner@vger.kernel.org>
Received: from smtp19.iq.pl ([86.111.242.224]:34171 "EHLO smtp19.iq.pl"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1725933AbeKAHot (ORCPT <rfc822;netdev@vger.kernel.org>);
        Thu, 1 Nov 2018 03:44:49 -0400
In-Reply-To: <8e10bf68-f3b3-98f2-91a5-25b151756dd6@itcare.pl>
Content-Language: pl
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>


W dniu 31.10.2018 o 23:20, Paweł Staszewski pisze:
>
>
> W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:
>>
>> On 10/31/2018 02:57 PM, Paweł Staszewski wrote:
>>> Hi
>>>
>>> So maybee someone will be interested how linux kernel handles normal 
>>> traffic (not pktgen :) )
>>>
>>>
>>> Server HW configuration:
>>>
>>> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>>>
>>> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
>>>
>>>
>>> Server software:
>>>
>>> FRR - as routing daemon
>>>
>>> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local 
>>> numa node)
>>>
>>> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local 
>>> numa node)
>>>
>>>
>>> Maximum traffic that server can handle:
>>>
>>> Bandwidth
>>>
>>>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>>    input: /proc/net/dev type: rate
>>>    \         iface                   Rx Tx Total
>>> ============================================================================== 
>>>
>>>         enp175s0f1:          28.51 Gb/s           37.24 
>>> Gb/s           65.74 Gb/s
>>>         enp175s0f0:          38.07 Gb/s           28.44 
>>> Gb/s           66.51 Gb/s
>>> ------------------------------------------------------------------------------ 
>>>
>>>              total:          66.58 Gb/s           65.67 
>>> Gb/s          132.25 Gb/s
>>>
>>>
>>> Packets per second:
>>>
>>>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>>    input: /proc/net/dev type: rate
>>>    -         iface                   Rx Tx Total
>>> ============================================================================== 
>>>
>>>         enp175s0f1:      5248589.00 P/s       3486617.75 P/s 
>>> 8735207.00 P/s
>>>         enp175s0f0:      3557944.25 P/s       5232516.00 P/s 
>>> 8790460.00 P/s
>>> ------------------------------------------------------------------------------ 
>>>
>>>              total:      8806533.00 P/s       8719134.00 P/s 
>>> 17525668.00 P/s
>>>
>>>
>>> After reaching that limits nics on the upstream side (more RX 
>>> traffic) start to drop packets
>>>
>>>
>>> I just dont understand that server can't handle more bandwidth 
>>> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX 
>>> side are increasing.
>>>
>>> Was thinking that maybee reached some pcie x16 limit - but x16 8GT 
>>> is 126Gbit - and also when testing with pktgen i can reach more bw 
>>> and pps (like 4x more comparing to normal internet traffic)
>>>
>>> And wondering if there is something that can be improved here.
>>>
>>>
>>>
>>> Some more informations / counters / stats and perf top below:
>>>
>>> Perf top flame graph:
>>>
>>> https://uploadfiles.io/7zo6u
>>>
>>>
>>>
>>> System configuration(long):
>>>
>>>
>>> cat /sys/devices/system/node/node1/cpulist
>>> 14-27,42-55
>>> cat /sys/class/net/enp175s0f0/device/numa_node
>>> 1
>>> cat /sys/class/net/enp175s0f1/device/numa_node
>>> 1
>>>
>>>
>>>
>>>
>>>
>>> ip -s -d link ls dev enp175s0f0
>>> 6: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq 
>>> state UP mode DEFAULT group default qlen 8192
>>>      link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 
>>> 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 
>>> 65536 gso_max_segs 65535
>>>      RX: bytes  packets  errors  dropped overrun mcast
>>>      184142375840858 141347715974 2       2806325 0 85050528
>>>      TX: bytes  packets  errors  dropped carrier collsns
>>>      99270697277430 172227994003 0       0       0       0
>>>
>>>   ip -s -d link ls dev enp175s0f1
>>> 7: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq 
>>> state UP mode DEFAULT group default qlen 8192
>>>      link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 
>>> 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 
>>> 65536 gso_max_segs 65535
>>>      RX: bytes  packets  errors  dropped overrun mcast
>>>      99686284170801 173507590134 61      669685  0 100304421
>>>      TX: bytes  packets  errors  dropped carrier collsns
>>>      184435107970545 142383178304 0       0       0       0
>>>
>>>
>>> ./softnet.sh
>>> cpu      total    dropped   squeezed  collision        rps flow_limit
>>>
>>>
>>>
>>>
>>>     PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz 
>>> cycles],  (all, 56 CPUs)
>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
>>>
>>>
>>>      26.78%  [kernel]       [k] queued_spin_lock_slowpath
>> This is highly suspect.
>>
>> A call graph (perf record -a -g sleep 1; perf report --stdio) would 
>> tell what is going on.
> perf report:
> https://ufile.io/rqp0h
>
>
>
>>
>> With that many TX/RX queues, I would expect you to not use RPS/RFS, 
>> and have a 1/1 RX/TX mapping,
>> so I do not know what could request a spinlock contention.
>>
>>
>>
>
>
And yes there is no RPF/RFS - just 1/1 RX/TX and affinity mapping on 
local cpu for the network controller for 28 RX+TX queues per nic .