From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?UTF-8?Q?Pawe=c5=82_Staszewski?= <pstaszewski@itcare.pl>
Subject: Re: Kernel 4.19 network performance - forwarding/routing normal users
 traffic
Date: Thu, 1 Nov 2018 11:34:35 +0100
Message-ID: <92ff3f76-1ac2-84c9-1cd0-c4cb26e64074@itcare.pl>
References: <61697e49-e839-befc-8330-fc00187c48ee@itcare.pl>
 <61e30474-b5e9-4dc8-a8a6-90cdd17d2a66@gmail.com>
 <8e10bf68-f3b3-98f2-91a5-25b151756dd6@itcare.pl>
 <20181101102213.2fa2643d@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
        netdev <netdev@vger.kernel.org>,
        Tariq Toukan <tariqt@mellanox.com>,
        Ilias Apalodimas <ilias.apalodimas@linaro.org>,
        Yoel Caspersen <yoel@kviknet.dk>,
        Mel Gorman <mgorman@techsingularity.net>,
        Aaron Lu <aaron.lu@intel.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from smtp19.iq.pl ([86.111.242.224]:39469 "EHLO smtp19.iq.pl"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1726320AbeKATgf (ORCPT <rfc822;netdev@vger.kernel.org>);
        Thu, 1 Nov 2018 15:36:35 -0400
In-Reply-To: <20181101102213.2fa2643d@redhat.com>
Content-Language: pl
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>


W dniu 01.11.2018 o 10:22, Jesper Dangaard Brouer pisze:
> On Wed, 31 Oct 2018 23:20:01 +0100
> Paweł Staszewski <pstaszewski@itcare.pl> wrote:
>
>> W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:
>>> On 10/31/2018 02:57 PM, Paweł Staszewski wrote:
>>>> Hi
>>>>
>>>> So maybee someone will be interested how linux kernel handles
>>>> normal traffic (not pktgen :) )
> Pawel is this live production traffic?
Yes moved server from testlab to production to check (risking a little - 
but this is traffic switched to backup router : ) )

>
> I know Yoel (Cc) is very interested to know the real-life limitation of
> Linux as a router, especially with VLANs like you use.
So yes this is real-life traffic , real users - normal mixed internet 
traffic forwarded (including ddos-es :) )


>
>
>>>> Server HW configuration:
>>>>
>>>> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>>>>
>>>> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
>>>>
>>>>
>>>> Server software:
>>>>
>>>> FRR - as routing daemon
>>>>
>>>> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)
>>>>
>>>> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)
>>>>
>>>>
>>>> Maximum traffic that server can handle:
>>>>
>>>> Bandwidth
>>>>
>>>>    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>>>     input: /proc/net/dev type: rate
>>>>     \         iface                   Rx Tx                Total
>>>> ==============================================================================
>>>>          enp175s0f1:          28.51 Gb/s           37.24 Gb/s           65.74 Gb/s
>>>>          enp175s0f0:          38.07 Gb/s           28.44 Gb/s           66.51 Gb/s
>>>> ------------------------------------------------------------------------------
>>>>               total:          66.58 Gb/s           65.67 Gb/s          132.25 Gb/s
>>>>
> Actually rather impressive number for a Linux router.
>
>>>> Packets per second:
>>>>
>>>>    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>>>     input: /proc/net/dev type: rate
>>>>     -         iface                   Rx Tx                Total
>>>> ==============================================================================
>>>>          enp175s0f1:      5248589.00 P/s       3486617.75 P/s 8735207.00 P/s
>>>>          enp175s0f0:      3557944.25 P/s       5232516.00 P/s 8790460.00 P/s
>>>> ------------------------------------------------------------------------------
>>>>               total:      8806533.00 P/s       8719134.00 P/s 17525668.00 P/s
>>>>
> Average packet size:
>    (28.51*10^9/8)/5248589 =  678.99 bytes
>    (38.07*10^9/8)/3557944 = 1337.49 bytes
>
>
>>>> After reaching that limits nics on the upstream side (more RX
>>>> traffic) start to drop packets
>>>>
>>>>
>>>> I just dont understand that server can't handle more bandwidth
>>>> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on
>>>> RX side are increasing.
>>>>
>>>> Was thinking that maybee reached some pcie x16 limit - but x16 8GT
>>>> is 126Gbit - and also when testing with pktgen i can reach more bw
>>>> and pps (like 4x more comparing to normal internet traffic)
>>>>
>>>> And wondering if there is something that can be improved here.
>>>>
>>>>
>>>>
>>>> Some more informations / counters / stats and perf top below:
>>>>
>>>> Perf top flame graph:
>>>>
>>>> https://uploadfiles.io/7zo6u
> Thanks a lot for the flame graph!
>
>>>> System configuration(long):
>>>>
>>>>
>>>> cat /sys/devices/system/node/node1/cpulist
>>>> 14-27,42-55
>>>> cat /sys/class/net/enp175s0f0/device/numa_node
>>>> 1
>>>> cat /sys/class/net/enp175s0f1/device/numa_node
>>>> 1
>>>>
> Hint grep can give you nicer output that cat:
>
> $ grep -H . /sys/class/net/*/device/numa_node
Sure:
grep -H . /sys/class/net/*/device/numa_node
/sys/class/net/enp175s0f0/device/numa_node:1
/sys/class/net/enp175s0f1/device/numa_node:1


>
>>>>
>>>>
>>>>
>>>> ip -s -d link ls dev enp175s0f0
>>>> 6: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192
>>>>       link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535
>>>>       RX: bytes  packets  errors  dropped overrun mcast
>>>>       184142375840858 141347715974 2       2806325 0       85050528
>>>>       TX: bytes  packets  errors  dropped carrier collsns
>>>>       99270697277430 172227994003 0       0       0       0
>>>>
>>>>    ip -s -d link ls dev enp175s0f1
>>>> 7: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192
>>>>       link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535
>>>>       RX: bytes  packets  errors  dropped overrun mcast
>>>>       99686284170801 173507590134 61      669685  0       100304421
>>>>       TX: bytes  packets  errors  dropped carrier collsns
>>>>       184435107970545 142383178304 0       0       0       0
>>>>
> You have increased the default (1000) qlen to 8192, why?
Was checking if higher txq will change anything
But no change for settings 1000,4096,8192
But yes i do not use there any traffic shaping like hfsc/hdb etc
- just default qdisc mq 0:
root pfifp_fast
tc qdisc show dev enp175s0f1
qdisc mq 0: root
qdisc pfifo_fast 0: parent :38 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
qdisc pfifo_fast 0: parent :37 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
qdisc pfifo_fast 0: parent :36 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
...
...


And vlans are noqueue
tc -s -d qdisc show dev vlan1521
qdisc noqueue 0: root refcnt 2
  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0


Weird is that no counters increasing but there is traffic in/out on that 
vlans

ip -s -d link ls dev vlan1521
87: vlan1521@enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
qdisc noqueue state UP mode DEFAULT group default qlen 1000
     link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0
     vlan protocol 802.1Q id 1521 <REORDER_HDR> addrgenmode eui64 
numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
     RX: bytes  packets  errors  dropped overrun mcast
     562964218394 1639370761 0       0       0       0
     TX: bytes  packets  errors  dropped carrier collsns
     1417648713052 618271312 0       0       0       0

>
> What default qdisc do you run?... looking through your very detail main
> email report (I do love the details you give!).  You run
> pfifo_fast_dequeue, thus this 8192 qlen is actually having effect.
>
> I would like to know if and how much qdisc_dequeue bulking is happening
> in this setup?  Can you run:
>
>   perf-stat-hist -m 8192 -P2 qdisc:qdisc_dequeue packets
>
> The perf-stat-hist is from Brendan Gregg's git-tree:
>   https://github.com/brendangregg/perf-tools
>   https://github.com/brendangregg/perf-tools/blob/master/misc/perf-stat-hist
>
  ./perf-stat-hist -m 8192 -P2 qdisc:qdisc_dequeue packets
Tracing qdisc:qdisc_dequeue, power-of-2, max 8192, until Ctrl-C...
^C
             Range          : Count    Distribution
               -> -1        : 0 |                                      |
             0 -> 0         : 43768349 
|######################################|
             1 -> 1         : 43895249 
|######################################|
             2 -> 3         : 352 |#                                     |
             4 -> 7         : 228 |#                                     |
             8 -> 15        : 135 |#                                     |
            16 -> 31        : 73 |#                                     |
            32 -> 63        : 7 |#                                     |
            64 -> 127       : 0 |                                      |
           128 -> 255       : 0 |                                      |
           256 -> 511       : 0 |                                      |
           512 -> 1023      : 0 |                                      |
          1024 -> 2047      : 0 |                                      |
          2048 -> 4095      : 0 |                                      |
          4096 -> 8191      : 0 |                                      |
          8192 ->           : 0 |                                      |

>>>> ./softnet.sh
>>>> cpu      total    dropped   squeezed  collision        rps flow_limit
>>>>
>>>>
>>>>
>>>>
>>>>      PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz cycles],  (all, 56 CPUs)
>>>> ------------------------------------------------------------------------------------------
>>>>
>>>>       26.78%  [kernel]       [k] queued_spin_lock_slowpath
>>> This is highly suspect.
>>>
> I agree! -- 26.78% spend in queued_spin_lock_slowpath.  Hint if you see
> _raw_spin_lock then it is likely not a contended lock, but if you see
> queued_spin_lock_slowpath in a perf-report your workload is likely in
> trouble.
>
>
>>> A call graph (perf record -a -g sleep 1; perf report --stdio)
>>> would tell what is going on.
>> perf report:
>> https://ufile.io/rqp0h
>>
> Thanks for the output (my 30" screen is just large enough to see the
> full output).  Together with the flame-graph, it is clear that this
> lock happens in the page allocator code.
>
> Section copied out:
>
>    mlx5e_poll_tx_cq
>    |
>     --16.34%--napi_consume_skb
>               |
>               |--12.65%--__free_pages_ok
>               |          |
>               |           --11.86%--free_one_page
>               |                     |
>               |                     |--10.10%--queued_spin_lock_slowpath
>               |                     |
>               |                      --0.65%--_raw_spin_lock
>               |
>               |--1.55%--page_frag_free
>               |
>                --1.44%--skb_release_data
>
>
> Let me explain what (I think) happens.  The mlx5 driver RX-page recycle
> mechanism is not effective in this workload, and pages have to go
> through the page allocator.  The lock contention happens during mlx5
> DMA TX completion cycle.  And the page allocator cannot keep up at
> these speeds.
>
> One solution is extend page allocator with a bulk free API.  (This have
> been on my TODO list for a long time, but I don't have a
> micro-benchmark that trick the driver page-recycle to fail).  It should
> fit nicely, as I can see that kmem_cache_free_bulk() does get
> activated (bulk freeing SKBs), which means that DMA TX completion do
> have a bulk of packets.
>
> We can (and should) also improve the page recycle scheme in the driver.
> After LPC, I have a project with Tariq and Ilias (Cc'ed) to improve the
> page_pool, and we will (attempt) to generalize this, for both high-end
> mlx5 and more low-end ARM64-boards (macchiatobin and espressobin).
>
> The MM-people is in parallel working to improve the performance of
> order-0 page returns.  Thus, the explicit page bulk free API might
> actually become less important.  I actually think (Cc.) Aaron have a
> patchset he would like you to test, which removes the (zone->)lock
> you hit in free_one_page().
>
Ok - Thank You Jesper