Re: Multi-core scalability problems

* Re: Multi-core scalability problems
       [not found] <VI1PR04MB3104C1D86BDC113F4AC0CF4A9E050@VI1PR04MB3104.eurprd04.prod.outlook.com>
@ 2020-10-14  8:35 ` Federico Parola
  0 siblings, 0 replies; 14+ messages in thread
From: Federico Parola @ 2020-10-14  8:35 UTC (permalink / raw)
  To: xdp-newbies

On 14/10/20 08:56, Federico Parola wrote:
> Thanks for your help!
>
> On 13/10/20 18:44, Toke Høiland-Jørgensen wrote:
>> Federico Parola<fede.parola@hotmail.it> writes:
>>
>>> Hello,
>>> I'm testing the performance of XDP when dropping packets using multiple
>>> cores and I'm getting unexpected results.
>>> My machine is equipped with a dual port Intel XL710 40 GbE and an Intel
>>> Xeon Gold 5120 CPU @ 2.20GHz with 14 cores (HyperThreading disabled),
>>> running Ubuntu server 18.04 with kernel 5.8.12.
>>> I'm using the xdp_rxq_info program from the kernel tree samples to drop
>>> packets.
>>> I generate 64 bytes UDP packets with MoonGen for a total of 42 Mpps.
>>> Packets are uniformly distributed in different flows (different src
>>> port) and I use flow direction rules on the rx NIC to send these flows
>>> to different queues/cores.
>>> Here are my results:
>>>
>>> 1 FLOW:
>>> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP 
>>> options:no_touch
>>> XDP stats       CPU     pps         issue-pps
>>> XDP-RX CPU      0       17784270    0
>>> XDP-RX CPU      total   17784270
>>>
>>> RXQ stats       RXQ:CPU pps         issue-pps
>>> rx_queue_index    0:0   17784270    0
>>> rx_queue_index    0:sum 17784270
>>> ---
>>>
>>> 2 FLOWS:
>>> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP 
>>> options:no_touch
>>> XDP stats       CPU     pps         issue-pps
>>> XDP-RX CPU      0       7016363     0
>>> XDP-RX CPU      1       7017291     0
>>> XDP-RX CPU      total   14033655
>>>
>>> RXQ stats       RXQ:CPU pps         issue-pps
>>> rx_queue_index    0:0   7016366     0
>>> rx_queue_index    0:sum 7016366
>>> rx_queue_index    1:1   7017294     0
>>> rx_queue_index    1:sum 7017294
>>> ---
>>>
>>> 4 FLOWS:
>>> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP 
>>> options:no_touch
>>> XDP stats       CPU     pps         issue-pps
>>> XDP-RX CPU      0       2359478     0
>>> XDP-RX CPU      1       2358508     0
>>> XDP-RX CPU      2       2357042     0
>>> XDP-RX CPU      3       2355396     0
>>> XDP-RX CPU      total   9430425
>>>
>>> RXQ stats       RXQ:CPU pps         issue-pps
>>> rx_queue_index    0:0   2359474     0
>>> rx_queue_index    0:sum 2359474
>>> rx_queue_index    1:1   2358504     0
>>> rx_queue_index    1:sum 2358504
>>> rx_queue_index    2:2   2357040     0
>>> rx_queue_index    2:sum 2357040
>>> rx_queue_index    3:3   2355392     0
>>> rx_queue_index    3:sum 2355392
>>>
>>> I don't understand why overall performance is reducing with the number
>>> of cores, according to [1] I would expect it to increase until reaching
>>> a maximum value. Is there any parameter I should tune to overcome the
>>> problem?
>> Yeah, this does look a bit odd. My immediate thought is that maybe your
>> RXQs are not pinned to the cores correctly? There is nothing in
>> xdp_rxq_info that ensures this, you have to configure the IRQ affinity
>> manually. If you don't do this, I suppose the processing could be
>> bouncing around on different CPUs leading to cache line contention when
>> updating the stats map.
>>
>> You can try to look at what the actual CPU load is on each core -
>> 'mpstat -P ALL -n 1' is my goto for this.
>>
>> -Toke
>>
> I forgot to mention, I have manually configured the IRQ affinity to 
> map every queue on a different core, and running your command confirms 
> that one core per queue/flow is used.
>
>
> On 13/10/20 18:41, Jesper Dangaard Brouer wrote:
>> This is what I see with i40e:
>>
>> unning XDP on dev:i40e2 (ifindex:6) action:XDP_DROP options:no_touch
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      1       8,411,547   0
>> XDP-RX CPU      2       2,804,016   0
>> XDP-RX CPU      3       2,803,600   0
>> XDP-RX CPU      4       5,608,380   0
>> XDP-RX CPU      5       13,999,125  0
>> XDP-RX CPU      total   33,626,671
>>
>> RXQ stats       RXQ:CPU pps         issue-pps
>> rx_queue_index    0:3   2,803,600   0
>> rx_queue_index    0:sum 2,803,600
>> rx_queue_index    1:1   8,411,540   0
>> rx_queue_index    1:sum 8,411,540
>> rx_queue_index    2:2   2,804,015   0
>> rx_queue_index    2:sum 2,804,015
>> rx_queue_index    3:5   8,399,326   0
>> rx_queue_index    3:sum 8,399,326
>> rx_queue_index    4:4   5,608,372   0
>> rx_queue_index    4:sum 5,608,372
>> rx_queue_index    5:5   5,599,809   0
>> rx_queue_index    5:sum 5,599,809
>> That is strange, as my results above show that it does scale on my
>> testlab on same NIC i40e (Intel Corporation Ethernet Controller XL710
>> for 40GbE QSFP+ (rev 02)).
>>
>> Can you try to use this[2] tool:
>>   ethtool_stats.pl --dev enp101s0f0
>>
>> And notice if there are any strange counters.
>>
>>
>> [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl 
>>
>> My best guess is that you have Ethernet flow-control enabled.
>> Some ethtool counter might show if that is the case.
>>
> Here are the results of the tool:
>
>
> 1 FLOW:
>
> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
> Ethtool(enp101s0f0) stat:     35458700 (     35,458,700) <= 
> port.fdir_sb_match /sec
> Ethtool(enp101s0f0) stat:   2729223958 (  2,729,223,958) <= 
> port.rx_bytes /sec
> Ethtool(enp101s0f0) stat:      7185397 (      7,185,397) <= 
> port.rx_dropped /sec
> Ethtool(enp101s0f0) stat:     42644155 (     42,644,155) <= 
> port.rx_size_64 /sec
> Ethtool(enp101s0f0) stat:     42644140 (     42,644,140) <= 
> port.rx_unicast /sec
> Ethtool(enp101s0f0) stat:   1062159456 (  1,062,159,456) <= rx-0.bytes 
> /sec
> Ethtool(enp101s0f0) stat:     17702658 (     17,702,658) <= 
> rx-0.packets /sec
> Ethtool(enp101s0f0) stat:   1062155639 (  1,062,155,639) <= rx_bytes /sec
> Ethtool(enp101s0f0) stat:     17756128 (     17,756,128) <= rx_dropped 
> /sec
> Ethtool(enp101s0f0) stat:     17702594 (     17,702,594) <= rx_packets 
> /sec
> Ethtool(enp101s0f0) stat:     35458743 (     35,458,743) <= rx_unicast 
> /sec
>
> ---
>
>
> 4 FLOWS:
>
> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
> Ethtool(enp101s0f0) stat:      9351001 (      9,351,001) <= 
> port.fdir_sb_match /sec
> Ethtool(enp101s0f0) stat:   2559136358 (  2,559,136,358) <= 
> port.rx_bytes /sec
> Ethtool(enp101s0f0) stat:     30635346 (     30,635,346) <= 
> port.rx_dropped /sec
> Ethtool(enp101s0f0) stat:     39986386 (     39,986,386) <= 
> port.rx_size_64 /sec
> Ethtool(enp101s0f0) stat:     39986799 (     39,986,799) <= 
> port.rx_unicast /sec
> Ethtool(enp101s0f0) stat:    140177834 (    140,177,834) <= rx-0.bytes 
> /sec
> Ethtool(enp101s0f0) stat:      2336297 (      2,336,297) <= 
> rx-0.packets /sec
> Ethtool(enp101s0f0) stat:    140260002 (    140,260,002) <= rx-1.bytes 
> /sec
> Ethtool(enp101s0f0) stat:      2337667 (      2,337,667) <= 
> rx-1.packets /sec
> Ethtool(enp101s0f0) stat:    140261431 (    140,261,431) <= rx-2.bytes 
> /sec
> Ethtool(enp101s0f0) stat:      2337691 (      2,337,691) <= 
> rx-2.packets /sec
> Ethtool(enp101s0f0) stat:    140175690 (    140,175,690) <= rx-3.bytes 
> /sec
> Ethtool(enp101s0f0) stat:      2336262 (      2,336,262) <= 
> rx-3.packets /sec
> Ethtool(enp101s0f0) stat:    560877338 (    560,877,338) <= rx_bytes /sec
> Ethtool(enp101s0f0) stat:         3354 (          3,354) <= rx_dropped 
> /sec
> Ethtool(enp101s0f0) stat:      9347956 (      9,347,956) <= rx_packets 
> /sec
> Ethtool(enp101s0f0) stat:      9351183 (      9,351,183) <= rx_unicast 
> /sec
>
>
> So if I understand the field port.rx_dropped represents packets 
> dropped due to a lack of buffer on the NIC while rx_dropped represents 
> packets dropped because upper layers aren't able to process them, am I 
> right?
>
> It seems that the problem is in the NIC.
>
>
> Federico
>
I was able to scale up to 4 cores by reducing the size of the rx ring 
from 512 to 128 with

sudo ethtool -G enp101s0f0 rx 128

(Why reducing? I would expect an increase to help)

However the problem persists when exceeding 4 flows/cores, and a further 
reduction of the size of the ring doesn't help.

4 FLOWS:

Running XDP on dev:enp101s0f0 (ifindex:4) action:XDP_DROP options:no_touch
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       9841972     0
XDP-RX CPU      1       9842098     0
XDP-RX CPU      2       9842010     0
XDP-RX CPU      3       9842301     0
XDP-RX CPU      total   39368383

---

6 FLOWS:

Running XDP on dev:enp101s0f0 (ifindex:4) action:XDP_DROP options:no_touch
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       4470754     0
XDP-RX CPU      1       4470224     0
XDP-RX CPU      2       4468194     0
XDP-RX CPU      3       4470562     0
XDP-RX CPU      4       4470316     0
XDP-RX CPU      5       4467888     0
XDP-RX CPU      total   26817942

Federico

^ permalink raw reply	[flat|nested] 14+ messages in thread