From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?Q?Pawe=c5=82_Staszewski?= Subject: Re: Kernel 4.19 network performance - forwarding/routing normal users traffic Date: Thu, 1 Nov 2018 11:34:35 +0100 Message-ID: <92ff3f76-1ac2-84c9-1cd0-c4cb26e64074@itcare.pl> References: <61697e49-e839-befc-8330-fc00187c48ee@itcare.pl> <61e30474-b5e9-4dc8-a8a6-90cdd17d2a66@gmail.com> <8e10bf68-f3b3-98f2-91a5-25b151756dd6@itcare.pl> <20181101102213.2fa2643d@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Cc: Eric Dumazet , netdev , Tariq Toukan , Ilias Apalodimas , Yoel Caspersen , Mel Gorman , Aaron Lu To: Jesper Dangaard Brouer Return-path: Received: from smtp19.iq.pl ([86.111.242.224]:39469 "EHLO smtp19.iq.pl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726320AbeKATgf (ORCPT ); Thu, 1 Nov 2018 15:36:35 -0400 In-Reply-To: <20181101102213.2fa2643d@redhat.com> Content-Language: pl Sender: netdev-owner@vger.kernel.org List-ID: W dniu 01.11.2018 o 10:22, Jesper Dangaard Brouer pisze: > On Wed, 31 Oct 2018 23:20:01 +0100 > Paweł Staszewski wrote: > >> W dniu 31.10.2018 o 23:09, Eric Dumazet pisze: >>> On 10/31/2018 02:57 PM, Paweł Staszewski wrote: >>>> Hi >>>> >>>> So maybee someone will be interested how linux kernel handles >>>> normal traffic (not pktgen :) ) > Pawel is this live production traffic? Yes moved server from testlab to production to check (risking a little - but this is traffic switched to backup router : ) ) > > I know Yoel (Cc) is very interested to know the real-life limitation of > Linux as a router, especially with VLANs like you use. So yes this is real-life traffic , real users - normal mixed internet traffic forwarded (including ddos-es :) ) > > >>>> Server HW configuration: >>>> >>>> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz >>>> >>>> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) >>>> >>>> >>>> Server software: >>>> >>>> FRR - as routing daemon >>>> >>>> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node) >>>> >>>> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) >>>> >>>> >>>> Maximum traffic that server can handle: >>>> >>>> Bandwidth >>>> >>>>  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help >>>>   input: /proc/net/dev type: rate >>>>   \         iface                   Rx Tx                Total >>>> ============================================================================== >>>>        enp175s0f1:          28.51 Gb/s           37.24 Gb/s           65.74 Gb/s >>>>        enp175s0f0:          38.07 Gb/s           28.44 Gb/s           66.51 Gb/s >>>> ------------------------------------------------------------------------------ >>>>             total:          66.58 Gb/s           65.67 Gb/s          132.25 Gb/s >>>> > Actually rather impressive number for a Linux router. > >>>> Packets per second: >>>> >>>>  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help >>>>   input: /proc/net/dev type: rate >>>>   -         iface                   Rx Tx                Total >>>> ============================================================================== >>>>        enp175s0f1:      5248589.00 P/s       3486617.75 P/s 8735207.00 P/s >>>>        enp175s0f0:      3557944.25 P/s       5232516.00 P/s 8790460.00 P/s >>>> ------------------------------------------------------------------------------ >>>>             total:      8806533.00 P/s       8719134.00 P/s 17525668.00 P/s >>>> > Average packet size: > (28.51*10^9/8)/5248589 = 678.99 bytes > (38.07*10^9/8)/3557944 = 1337.49 bytes > > >>>> After reaching that limits nics on the upstream side (more RX >>>> traffic) start to drop packets >>>> >>>> >>>> I just dont understand that server can't handle more bandwidth >>>> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on >>>> RX side are increasing. >>>> >>>> Was thinking that maybee reached some pcie x16 limit - but x16 8GT >>>> is 126Gbit - and also when testing with pktgen i can reach more bw >>>> and pps (like 4x more comparing to normal internet traffic) >>>> >>>> And wondering if there is something that can be improved here. >>>> >>>> >>>> >>>> Some more informations / counters / stats and perf top below: >>>> >>>> Perf top flame graph: >>>> >>>> https://uploadfiles.io/7zo6u > Thanks a lot for the flame graph! > >>>> System configuration(long): >>>> >>>> >>>> cat /sys/devices/system/node/node1/cpulist >>>> 14-27,42-55 >>>> cat /sys/class/net/enp175s0f0/device/numa_node >>>> 1 >>>> cat /sys/class/net/enp175s0f1/device/numa_node >>>> 1 >>>> > Hint grep can give you nicer output that cat: > > $ grep -H . /sys/class/net/*/device/numa_node Sure: grep -H . /sys/class/net/*/device/numa_node /sys/class/net/enp175s0f0/device/numa_node:1 /sys/class/net/enp175s0f1/device/numa_node:1 > >>>> >>>> >>>> >>>> ip -s -d link ls dev enp175s0f0 >>>> 6: enp175s0f0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192 >>>>     link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535 >>>>     RX: bytes  packets  errors  dropped overrun mcast >>>>     184142375840858 141347715974 2       2806325 0       85050528 >>>>     TX: bytes  packets  errors  dropped carrier collsns >>>>     99270697277430 172227994003 0       0       0       0 >>>> >>>>  ip -s -d link ls dev enp175s0f1 >>>> 7: enp175s0f1: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192 >>>>     link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535 >>>>     RX: bytes  packets  errors  dropped overrun mcast >>>>     99686284170801 173507590134 61      669685  0       100304421 >>>>     TX: bytes  packets  errors  dropped carrier collsns >>>>     184435107970545 142383178304 0       0       0       0 >>>> > You have increased the default (1000) qlen to 8192, why? Was checking if higher txq will change anything But no change for settings 1000,4096,8192 But yes i do not use there any traffic shaping like hfsc/hdb etc - just default qdisc mq 0: root pfifp_fast tc qdisc show dev enp175s0f1 qdisc mq 0: root qdisc pfifo_fast 0: parent :38 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :37 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :36 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 ... ... And vlans are noqueue tc -s -d qdisc show dev vlan1521 qdisc noqueue 0: root refcnt 2  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)  backlog 0b 0p requeues 0 Weird is that no counters increasing but there is traffic in/out on that vlans ip -s -d link ls dev vlan1521 87: vlan1521@enp175s0f1: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000     link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0     vlan protocol 802.1Q id 1521 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535     RX: bytes  packets  errors  dropped overrun mcast     562964218394 1639370761 0       0       0       0     TX: bytes  packets  errors  dropped carrier collsns     1417648713052 618271312 0       0       0       0 > > What default qdisc do you run?... looking through your very detail main > email report (I do love the details you give!). You run > pfifo_fast_dequeue, thus this 8192 qlen is actually having effect. > > I would like to know if and how much qdisc_dequeue bulking is happening > in this setup? Can you run: > > perf-stat-hist -m 8192 -P2 qdisc:qdisc_dequeue packets > > The perf-stat-hist is from Brendan Gregg's git-tree: > https://github.com/brendangregg/perf-tools > https://github.com/brendangregg/perf-tools/blob/master/misc/perf-stat-hist >  ./perf-stat-hist -m 8192 -P2 qdisc:qdisc_dequeue packets Tracing qdisc:qdisc_dequeue, power-of-2, max 8192, until Ctrl-C... ^C             Range          : Count    Distribution               -> -1        : 0 |                                      |             0 -> 0         : 43768349 |######################################|             1 -> 1         : 43895249 |######################################|             2 -> 3         : 352 |#                                     |             4 -> 7         : 228 |#                                     |             8 -> 15        : 135 |#                                     |            16 -> 31        : 73 |#                                     |            32 -> 63        : 7 |#                                     |            64 -> 127       : 0 |                                      |           128 -> 255       : 0 |                                      |           256 -> 511       : 0 |                                      |           512 -> 1023      : 0 |                                      |          1024 -> 2047      : 0 |                                      |          2048 -> 4095      : 0 |                                      |          4096 -> 8191      : 0 |                                      |          8192 ->           : 0 |                                      | >>>> ./softnet.sh >>>> cpu      total    dropped   squeezed  collision        rps flow_limit >>>> >>>> >>>> >>>> >>>>    PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz cycles],  (all, 56 CPUs) >>>> ------------------------------------------------------------------------------------------ >>>> >>>>     26.78%  [kernel]       [k] queued_spin_lock_slowpath >>> This is highly suspect. >>> > I agree! -- 26.78% spend in queued_spin_lock_slowpath. Hint if you see > _raw_spin_lock then it is likely not a contended lock, but if you see > queued_spin_lock_slowpath in a perf-report your workload is likely in > trouble. > > >>> A call graph (perf record -a -g sleep 1; perf report --stdio) >>> would tell what is going on. >> perf report: >> https://ufile.io/rqp0h >> > Thanks for the output (my 30" screen is just large enough to see the > full output). Together with the flame-graph, it is clear that this > lock happens in the page allocator code. > > Section copied out: > > mlx5e_poll_tx_cq > | > --16.34%--napi_consume_skb > | > |--12.65%--__free_pages_ok > | | > | --11.86%--free_one_page > | | > | |--10.10%--queued_spin_lock_slowpath > | | > | --0.65%--_raw_spin_lock > | > |--1.55%--page_frag_free > | > --1.44%--skb_release_data > > > Let me explain what (I think) happens. The mlx5 driver RX-page recycle > mechanism is not effective in this workload, and pages have to go > through the page allocator. The lock contention happens during mlx5 > DMA TX completion cycle. And the page allocator cannot keep up at > these speeds. > > One solution is extend page allocator with a bulk free API. (This have > been on my TODO list for a long time, but I don't have a > micro-benchmark that trick the driver page-recycle to fail). It should > fit nicely, as I can see that kmem_cache_free_bulk() does get > activated (bulk freeing SKBs), which means that DMA TX completion do > have a bulk of packets. > > We can (and should) also improve the page recycle scheme in the driver. > After LPC, I have a project with Tariq and Ilias (Cc'ed) to improve the > page_pool, and we will (attempt) to generalize this, for both high-end > mlx5 and more low-end ARM64-boards (macchiatobin and espressobin). > > The MM-people is in parallel working to improve the performance of > order-0 page returns. Thus, the explicit page bulk free API might > actually become less important. I actually think (Cc.) Aaron have a > patchset he would like you to test, which removes the (zone->)lock > you hit in free_one_page(). > Ok - Thank You Jesper