Re: Multi-core scalability problems

From: Federico Parola <fede.parola@hotmail.it>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: xdp-newbies@vger.kernel.org
Subject: Re: Multi-core scalability problems
Date: Thu, 15 Oct 2020 14:04:51 +0200	[thread overview]
Message-ID: <VI1PR04MB3104487E7F503BEEE13AE7999E020@VI1PR04MB3104.eurprd04.prod.outlook.com> (raw)
In-Reply-To: <20201014162636.39c2ba14@carbon>

On 14/10/20 16:26, Jesper Dangaard Brouer wrote:
> On Wed, 14 Oct 2020 14:17:46 +0200
> Federico Parola <fede.parola@hotmail.it> wrote:
> 
>> On 14/10/20 11:15, Jesper Dangaard Brouer wrote:
>>> On Wed, 14 Oct 2020 08:56:43 +0200
>>> Federico Parola <fede.parola@hotmail.it> wrote:
>>>
>>> [...]
>>>>> Can you try to use this[2] tool:
>>>>>     ethtool_stats.pl --dev enp101s0f0
>>>>>
>>>>> And notice if there are any strange counters.
>>>>>
>>>>>
>>>>> [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>>>>> My best guess is that you have Ethernet flow-control enabled.
>>>>> Some ethtool counter might show if that is the case.
>>>>>      
>>>> Here are the results of the tool:
>>>>
>>>>
>>>> 1 FLOW:
>>>>
>>>> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
>>>> Ethtool(enp101s0f0) stat:     35458700 (     35,458,700) <= port.fdir_sb_match /sec
>>>> Ethtool(enp101s0f0) stat:   2729223958 (  2,729,223,958) <= port.rx_bytes /sec
>>>> Ethtool(enp101s0f0) stat:      7185397 (      7,185,397) <= port.rx_dropped /sec
>>>> Ethtool(enp101s0f0) stat:     42644155 (     42,644,155) <= port.rx_size_64 /sec
>>>> Ethtool(enp101s0f0) stat:     42644140 (     42,644,140) <= port.rx_unicast /sec
>>>> Ethtool(enp101s0f0) stat:   1062159456 (  1,062,159,456) <= rx-0.bytes /sec
>>>> Ethtool(enp101s0f0) stat:     17702658 (     17,702,658) <= rx-0.packets /sec
>>>> Ethtool(enp101s0f0) stat:   1062155639 (  1,062,155,639) <= rx_bytes /sec
>>>> Ethtool(enp101s0f0) stat:     17756128 (     17,756,128) <= rx_dropped /sec
>>>> Ethtool(enp101s0f0) stat:     17702594 (     17,702,594) <= rx_packets /sec
>>>> Ethtool(enp101s0f0) stat:     35458743 (     35,458,743) <= rx_unicast /sec
>>>>
>>>> ---
>>>>
>>>>
>>>> 4 FLOWS:
>>>>
>>>> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
>>>> Ethtool(enp101s0f0) stat:      9351001 (      9,351,001) <= port.fdir_sb_match /sec
>>>> Ethtool(enp101s0f0) stat:   2559136358 (  2,559,136,358) <= port.rx_bytes /sec
>>>> Ethtool(enp101s0f0) stat:     30635346 (     30,635,346) <= port.rx_dropped /sec
>>>> Ethtool(enp101s0f0) stat:     39986386 (     39,986,386) <= port.rx_size_64 /sec
>>>> Ethtool(enp101s0f0) stat:     39986799 (     39,986,799) <= port.rx_unicast /sec
>>>> Ethtool(enp101s0f0) stat:    140177834 (    140,177,834) <= rx-0.bytes /sec
>>>> Ethtool(enp101s0f0) stat:      2336297 (      2,336,297) <= rx-0.packets /sec
>>>> Ethtool(enp101s0f0) stat:    140260002 (    140,260,002) <= rx-1.bytes /sec
>>>> Ethtool(enp101s0f0) stat:      2337667 (      2,337,667) <= rx-1.packets /sec
>>>> Ethtool(enp101s0f0) stat:    140261431 (    140,261,431) <= rx-2.bytes /sec
>>>> Ethtool(enp101s0f0) stat:      2337691 (      2,337,691) <= rx-2.packets /sec
>>>> Ethtool(enp101s0f0) stat:    140175690 (    140,175,690) <= rx-3.bytes /sec
>>>> Ethtool(enp101s0f0) stat:      2336262 (      2,336,262) <= rx-3.packets /sec
>>>> Ethtool(enp101s0f0) stat:    560877338 (    560,877,338) <= rx_bytes /sec
>>>> Ethtool(enp101s0f0) stat:         3354 (          3,354) <= rx_dropped /sec
>>>> Ethtool(enp101s0f0) stat:      9347956 (      9,347,956) <= rx_packets /sec
>>>> Ethtool(enp101s0f0) stat:      9351183 (      9,351,183) <= rx_unicast /sec
>>>>
>>>>
>>>> So if I understand the field port.rx_dropped represents packets dropped
>>>> due to a lack of buffer on the NIC while rx_dropped represents packets
>>>> dropped because upper layers aren't able to process them, am I right?
>>>>
>>>> It seems that the problem is in the NIC.
>>> Yes, it seems that the problem is in the NIC hardware, or config of the
>>> NIC hardware.
>>>
>>> Look at the counter "port.fdir_sb_match":
>>> - 1 flow: 35,458,700 = port.fdir_sb_match /sec
>>> - 4 flow:  9,351,001 = port.fdir_sb_match /sec
>>>
>>> I think fdir_sb translates to Flow Director Sideband filter (in the
>>> driver code this is sometimes related to "ATR" (Application Targeted
>>> Routing)). (note: I've seen fdir_match before, but not the "sb"
>>> fdir_sb_match part). This is happening inside the NIC HW/FW that does
>>> filtering on flows and make sure same-flow goes to same RX-queue number
>>> to avoid OOO packets. This looks like the limiting factor in your setup.
>>>
>>> Have you installed any filters yourself?
>>>
>>> Try to disable Flow Director:
>>>
>>>    ethtool -K ethX ntuple <on|off>
>>>   
>> Yes, I'm using flow filters to manually steer traffic to different
>> queues/cores, however disabling ntuple doesn't solve the problem, the
>> port.fdir_sb_match value disappears but the number of packets dropped in
>> port.rx_dropped stays high.
> 
> Try to disable your flow filters.  There are indications that hardware
> cannot run these filters at these speeds.

There are no changes with flow filters disabled or enabled, except for 
the presence of the port.fdir_sb_match counter, here are the results of 
ethtool for 4 flows:

FLOW FILTERS DISABLED:
Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
Ethtool(enp101s0f0) stat:   2575765457 (  2,575,765,457) <= 
port.rx_bytes /sec
Ethtool(enp101s0f0) stat:     30718177 (     30,718,177) <= 
port.rx_dropped /sec
Ethtool(enp101s0f0) stat:     40246552 (     40,246,552) <= 
port.rx_size_64 /sec
Ethtool(enp101s0f0) stat:     40246558 (     40,246,558) <= 
port.rx_unicast /sec
Ethtool(enp101s0f0) stat:    143008276 (    143,008,276) <= rx-10.bytes /sec
Ethtool(enp101s0f0) stat:      2383471 (      2,383,471) <= 
rx-10.packets /sec
Ethtool(enp101s0f0) stat:    142866811 (    142,866,811) <= rx-13.bytes /sec
Ethtool(enp101s0f0) stat:      2381114 (      2,381,114) <= 
rx-13.packets /sec
Ethtool(enp101s0f0) stat:    142924921 (    142,924,921) <= rx-3.bytes /sec
Ethtool(enp101s0f0) stat:      2382082 (      2,382,082) <= rx-3.packets 
/sec
Ethtool(enp101s0f0) stat:    142918015 (    142,918,015) <= rx-6.bytes /sec
Ethtool(enp101s0f0) stat:      2381967 (      2,381,967) <= rx-6.packets 
/sec
Ethtool(enp101s0f0) stat:    571723262 (    571,723,262) <= rx_bytes /sec
Ethtool(enp101s0f0) stat:      9528721 (      9,528,721) <= rx_packets /sec
Ethtool(enp101s0f0) stat:      9528674 (      9,528,674) <= rx_unicast /sec

FLOW FILTERS ENABLED:
Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
Ethtool(enp101s0f0) stat:     15810008 (     15,810,008) <= 
port.fdir_sb_match /sec
Ethtool(enp101s0f0) stat:   2634909056 (  2,634,909,056) <= 
port.rx_bytes /sec
Ethtool(enp101s0f0) stat:     31640574 (     31,640,574) <= 
port.rx_dropped /sec
Ethtool(enp101s0f0) stat:     41170436 (     41,170,436) <= 
port.rx_size_64 /sec
Ethtool(enp101s0f0) stat:     41170327 (     41,170,327) <= 
port.rx_unicast /sec
Ethtool(enp101s0f0) stat:    143016759 (    143,016,759) <= rx-0.bytes /sec
Ethtool(enp101s0f0) stat:      2383613 (      2,383,613) <= rx-0.packets 
/sec
Ethtool(enp101s0f0) stat:    142921054 (    142,921,054) <= rx-1.bytes /sec
Ethtool(enp101s0f0) stat:      2382018 (      2,382,018) <= rx-1.packets 
/sec
Ethtool(enp101s0f0) stat:    142943103 (    142,943,103) <= rx-2.bytes /sec
Ethtool(enp101s0f0) stat:      2382385 (      2,382,385) <= rx-2.packets 
/sec
Ethtool(enp101s0f0) stat:    142907586 (    142,907,586) <= rx-3.bytes /sec
Ethtool(enp101s0f0) stat:      2381793 (      2,381,793) <= rx-3.packets 
/sec
Ethtool(enp101s0f0) stat:    571775035 (    571,775,035) <= rx_bytes /sec
Ethtool(enp101s0f0) stat:      9529584 (      9,529,584) <= rx_packets /sec
Ethtool(enp101s0f0) stat:      9529673 (      9,529,673) <= rx_unicast /sec

>> The only solution I've found so far is to reduce the size of the rx ring
>> as I mentioned in my former post. However I still see a decrease in
>> performance when exceeding 4 cores.
> 
> What is happening when you are reducing the size of the rx ring is two
> things. (1) i40e driver have reuse/recycle-pages trick that get less
> efficient, but because you are dropping packets early you are not
> affected. (2) the total size of L3 memory you need to touch is also
> decreased.
> 
> I think you are hitting case (2).  The Intel CPU have a cool feature
> called DDIO (Data-Direct IO) or DCA (Direct Cache Access), which can
> deliver packet data into L3 cache memory (if NIC is directly PCIe
> connected to CPU).  The CPU is in charge when this feature is enabled,
> and it will try to avoid L3 trashing and disable it in certain cases.
> When you reduce the size of the rx rings, then you are also needing
> less L3 cache memory, to the CPU will allow this DDIO feature.
> 
> You can use the 'perf stat' tool to check if this is happening, by
> monitoring L3 (and L2) cache usage.

What events should I monitor? LLC-load-misses/LLC-loads?

Federico