xdp-newbies.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Multi-core scalability problems
       [not found] <VI1PR04MB3104C1D86BDC113F4AC0CF4A9E050@VI1PR04MB3104.eurprd04.prod.outlook.com>
@ 2020-10-14  8:35 ` Federico Parola
  0 siblings, 0 replies; 14+ messages in thread
From: Federico Parola @ 2020-10-14  8:35 UTC (permalink / raw)
  To: xdp-newbies

On 14/10/20 08:56, Federico Parola wrote:
> Thanks for your help!
>
> On 13/10/20 18:44, Toke Høiland-Jørgensen wrote:
>> Federico Parola<fede.parola@hotmail.it> writes:
>>
>>> Hello,
>>> I'm testing the performance of XDP when dropping packets using multiple
>>> cores and I'm getting unexpected results.
>>> My machine is equipped with a dual port Intel XL710 40 GbE and an Intel
>>> Xeon Gold 5120 CPU @ 2.20GHz with 14 cores (HyperThreading disabled),
>>> running Ubuntu server 18.04 with kernel 5.8.12.
>>> I'm using the xdp_rxq_info program from the kernel tree samples to drop
>>> packets.
>>> I generate 64 bytes UDP packets with MoonGen for a total of 42 Mpps.
>>> Packets are uniformly distributed in different flows (different src
>>> port) and I use flow direction rules on the rx NIC to send these flows
>>> to different queues/cores.
>>> Here are my results:
>>>
>>> 1 FLOW:
>>> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP 
>>> options:no_touch
>>> XDP stats       CPU     pps         issue-pps
>>> XDP-RX CPU      0       17784270    0
>>> XDP-RX CPU      total   17784270
>>>
>>> RXQ stats       RXQ:CPU pps         issue-pps
>>> rx_queue_index    0:0   17784270    0
>>> rx_queue_index    0:sum 17784270
>>> ---
>>>
>>> 2 FLOWS:
>>> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP 
>>> options:no_touch
>>> XDP stats       CPU     pps         issue-pps
>>> XDP-RX CPU      0       7016363     0
>>> XDP-RX CPU      1       7017291     0
>>> XDP-RX CPU      total   14033655
>>>
>>> RXQ stats       RXQ:CPU pps         issue-pps
>>> rx_queue_index    0:0   7016366     0
>>> rx_queue_index    0:sum 7016366
>>> rx_queue_index    1:1   7017294     0
>>> rx_queue_index    1:sum 7017294
>>> ---
>>>
>>> 4 FLOWS:
>>> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP 
>>> options:no_touch
>>> XDP stats       CPU     pps         issue-pps
>>> XDP-RX CPU      0       2359478     0
>>> XDP-RX CPU      1       2358508     0
>>> XDP-RX CPU      2       2357042     0
>>> XDP-RX CPU      3       2355396     0
>>> XDP-RX CPU      total   9430425
>>>
>>> RXQ stats       RXQ:CPU pps         issue-pps
>>> rx_queue_index    0:0   2359474     0
>>> rx_queue_index    0:sum 2359474
>>> rx_queue_index    1:1   2358504     0
>>> rx_queue_index    1:sum 2358504
>>> rx_queue_index    2:2   2357040     0
>>> rx_queue_index    2:sum 2357040
>>> rx_queue_index    3:3   2355392     0
>>> rx_queue_index    3:sum 2355392
>>>
>>> I don't understand why overall performance is reducing with the number
>>> of cores, according to [1] I would expect it to increase until reaching
>>> a maximum value. Is there any parameter I should tune to overcome the
>>> problem?
>> Yeah, this does look a bit odd. My immediate thought is that maybe your
>> RXQs are not pinned to the cores correctly? There is nothing in
>> xdp_rxq_info that ensures this, you have to configure the IRQ affinity
>> manually. If you don't do this, I suppose the processing could be
>> bouncing around on different CPUs leading to cache line contention when
>> updating the stats map.
>>
>> You can try to look at what the actual CPU load is on each core -
>> 'mpstat -P ALL -n 1' is my goto for this.
>>
>> -Toke
>>
> I forgot to mention, I have manually configured the IRQ affinity to 
> map every queue on a different core, and running your command confirms 
> that one core per queue/flow is used.
>
>
> On 13/10/20 18:41, Jesper Dangaard Brouer wrote:
>> This is what I see with i40e:
>>
>> unning XDP on dev:i40e2 (ifindex:6) action:XDP_DROP options:no_touch
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      1       8,411,547   0
>> XDP-RX CPU      2       2,804,016   0
>> XDP-RX CPU      3       2,803,600   0
>> XDP-RX CPU      4       5,608,380   0
>> XDP-RX CPU      5       13,999,125  0
>> XDP-RX CPU      total   33,626,671
>>
>> RXQ stats       RXQ:CPU pps         issue-pps
>> rx_queue_index    0:3   2,803,600   0
>> rx_queue_index    0:sum 2,803,600
>> rx_queue_index    1:1   8,411,540   0
>> rx_queue_index    1:sum 8,411,540
>> rx_queue_index    2:2   2,804,015   0
>> rx_queue_index    2:sum 2,804,015
>> rx_queue_index    3:5   8,399,326   0
>> rx_queue_index    3:sum 8,399,326
>> rx_queue_index    4:4   5,608,372   0
>> rx_queue_index    4:sum 5,608,372
>> rx_queue_index    5:5   5,599,809   0
>> rx_queue_index    5:sum 5,599,809
>> That is strange, as my results above show that it does scale on my
>> testlab on same NIC i40e (Intel Corporation Ethernet Controller XL710
>> for 40GbE QSFP+ (rev 02)).
>>
>> Can you try to use this[2] tool:
>>   ethtool_stats.pl --dev enp101s0f0
>>
>> And notice if there are any strange counters.
>>
>>
>> [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl 
>>
>> My best guess is that you have Ethernet flow-control enabled.
>> Some ethtool counter might show if that is the case.
>>
> Here are the results of the tool:
>
>
> 1 FLOW:
>
> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
> Ethtool(enp101s0f0) stat:     35458700 (     35,458,700) <= 
> port.fdir_sb_match /sec
> Ethtool(enp101s0f0) stat:   2729223958 (  2,729,223,958) <= 
> port.rx_bytes /sec
> Ethtool(enp101s0f0) stat:      7185397 (      7,185,397) <= 
> port.rx_dropped /sec
> Ethtool(enp101s0f0) stat:     42644155 (     42,644,155) <= 
> port.rx_size_64 /sec
> Ethtool(enp101s0f0) stat:     42644140 (     42,644,140) <= 
> port.rx_unicast /sec
> Ethtool(enp101s0f0) stat:   1062159456 (  1,062,159,456) <= rx-0.bytes 
> /sec
> Ethtool(enp101s0f0) stat:     17702658 (     17,702,658) <= 
> rx-0.packets /sec
> Ethtool(enp101s0f0) stat:   1062155639 (  1,062,155,639) <= rx_bytes /sec
> Ethtool(enp101s0f0) stat:     17756128 (     17,756,128) <= rx_dropped 
> /sec
> Ethtool(enp101s0f0) stat:     17702594 (     17,702,594) <= rx_packets 
> /sec
> Ethtool(enp101s0f0) stat:     35458743 (     35,458,743) <= rx_unicast 
> /sec
>
> ---
>
>
> 4 FLOWS:
>
> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
> Ethtool(enp101s0f0) stat:      9351001 (      9,351,001) <= 
> port.fdir_sb_match /sec
> Ethtool(enp101s0f0) stat:   2559136358 (  2,559,136,358) <= 
> port.rx_bytes /sec
> Ethtool(enp101s0f0) stat:     30635346 (     30,635,346) <= 
> port.rx_dropped /sec
> Ethtool(enp101s0f0) stat:     39986386 (     39,986,386) <= 
> port.rx_size_64 /sec
> Ethtool(enp101s0f0) stat:     39986799 (     39,986,799) <= 
> port.rx_unicast /sec
> Ethtool(enp101s0f0) stat:    140177834 (    140,177,834) <= rx-0.bytes 
> /sec
> Ethtool(enp101s0f0) stat:      2336297 (      2,336,297) <= 
> rx-0.packets /sec
> Ethtool(enp101s0f0) stat:    140260002 (    140,260,002) <= rx-1.bytes 
> /sec
> Ethtool(enp101s0f0) stat:      2337667 (      2,337,667) <= 
> rx-1.packets /sec
> Ethtool(enp101s0f0) stat:    140261431 (    140,261,431) <= rx-2.bytes 
> /sec
> Ethtool(enp101s0f0) stat:      2337691 (      2,337,691) <= 
> rx-2.packets /sec
> Ethtool(enp101s0f0) stat:    140175690 (    140,175,690) <= rx-3.bytes 
> /sec
> Ethtool(enp101s0f0) stat:      2336262 (      2,336,262) <= 
> rx-3.packets /sec
> Ethtool(enp101s0f0) stat:    560877338 (    560,877,338) <= rx_bytes /sec
> Ethtool(enp101s0f0) stat:         3354 (          3,354) <= rx_dropped 
> /sec
> Ethtool(enp101s0f0) stat:      9347956 (      9,347,956) <= rx_packets 
> /sec
> Ethtool(enp101s0f0) stat:      9351183 (      9,351,183) <= rx_unicast 
> /sec
>
>
> So if I understand the field port.rx_dropped represents packets 
> dropped due to a lack of buffer on the NIC while rx_dropped represents 
> packets dropped because upper layers aren't able to process them, am I 
> right?
>
> It seems that the problem is in the NIC.
>
>
> Federico
>
I was able to scale up to 4 cores by reducing the size of the rx ring 
from 512 to 128 with

sudo ethtool -G enp101s0f0 rx 128

(Why reducing? I would expect an increase to help)

However the problem persists when exceeding 4 flows/cores, and a further 
reduction of the size of the ring doesn't help.


4 FLOWS:

Running XDP on dev:enp101s0f0 (ifindex:4) action:XDP_DROP options:no_touch
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       9841972     0
XDP-RX CPU      1       9842098     0
XDP-RX CPU      2       9842010     0
XDP-RX CPU      3       9842301     0
XDP-RX CPU      total   39368383

---

6 FLOWS:

Running XDP on dev:enp101s0f0 (ifindex:4) action:XDP_DROP options:no_touch
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       4470754     0
XDP-RX CPU      1       4470224     0
XDP-RX CPU      2       4468194     0
XDP-RX CPU      3       4470562     0
XDP-RX CPU      4       4470316     0
XDP-RX CPU      5       4467888     0
XDP-RX CPU      total   26817942


Federico


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Multi-core scalability problems
  2020-10-24 13:57                   ` Federico Parola
@ 2020-10-26  8:14                     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-26  8:14 UTC (permalink / raw)
  To: Federico Parola; +Cc: xdp-newbies, brouer

On Sat, 24 Oct 2020 15:57:50 +0200
Federico Parola <fede.parola@hotmail.it> wrote:

> On 19/10/20 20:26, Jesper Dangaard Brouer wrote:
> > On Mon, 19 Oct 2020 17:23:18 +0200
> > Federico Parola <fede.parola@hotmail.it> wrote:  
>  >>
>  >> [...]
>  >>
> >> Hi Jesper, sorry for the late reply, this are the cache refs/misses for
> >> 4 flows and different rx ring sizes:
> >>
> >> RX 512 (9.4 Mpps dropped):
> >> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
> >>     23771011  cache-references                                (+-  0.04% )
> >>      8865698  cache-misses      # 37.296 % of all cache refs  (+-  0.04% )
> >>
> >> RX 128 (39.4 Mpps dropped):
> >> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
> >>     68177470  cache-references                               ( +-  0.01% )
> >>        23898  cache-misses      # 0.035 % of all cache refs  ( +-  3.23% )
> >>
> >> Reducing the size of the rx ring brings to a huge decrease in cache
> >> misses, is this the effect of DDIO turning on?  
> > 
> > Yes, exactly.
> > 
> > It is very high that 37.296 % of all cache refs is being cache-misses.
> > The number of cache-misses 8,865,698 is close to your reported 9.4
> > Mpps. Thus, seems to correlate with the idea that this is DDIO-missing
> > as you have a miss per packet.
> > 
> > I can see that you have selected a subset of the CPUs (0,1,2,13), it
> > important that this is the active CPUs.  I usually only select a
> > single/individual CPU to make sure I can reason about the numbers.
> > I've seen before that some CPUs get DDIO effect and others not, so
> > watch out for this.
> > 
> > If you add HW-counter -e instructions -e cycles to your perf stat
> > command, you will also see the instructions per cycle calculation.  You
> > should notice that the cache-miss also cause this number to be reduced,
> > as the CPUs stalls it cannot keep the CPU pipeline full/busy.
> > 
> > What kind of CPU are you using?
> > Specifically cache-sizes (use dmidecode look for "Cache Information")
> >   
> I'm using an Intel Xeon Gold 5120, L1: 896 KiB, L2: 14 MiB, L3: 19.25 MiB.

Is this a NUMA system?

The numbers you report is for all cores together.  Looking at [1] and
[2], I can see this is a 14-cores CPU. According to [3] the cache is:

Level 1 cache size:
	14 x 32 KB 8-way set associative instruction caches
	14 x 32 KB 8-way set associative data caches

Level 2 cache size:
 	14 x 1 MB 16-way set associative caches

Level 3 cache size
	19.25 MB 11-way set associative non-inclusive shared cache

One thing that catch my eye is the "non-inclusive" cache.  And that [4]
states "rearchitected cache hierarchy designed for server workloads".



[1] https://en.wikichip.org/wiki/intel/xeon_gold/5120
[2] https://ark.intel.com/content/www/us/en/ark/products/120474/intel-xeon-gold-5120-processor-19-25m-cache-2-20-ghz.html
[3] https://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%205120.html
[4] https://en.wikichip.org/wiki/intel/xeon_gold

> > The performance drop is a little too large 39.4 Mpps -> 9.4 Mpps.
> > 
> > If I were you, I would measure the speed of the memory, via using the
> > tool lmbench-3.0 command 'lat_mem_rd'.
> > 
> >   /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_mem_rd 2000 128
> > 
> > The output is the nanosec latency of accessing increasing sizes of	
> > memory.  The jumps/increases in latency should be fairly clear and
> > shows the latency of the different cache levels.  For my CPU E5-1650 v4
> > @ 3.60GHz with 15MB L2 cache, I see L1=1.055ns, L2=5.521ns, L3=17.569ns.
> > (I could not find a tool that tells me the cost of accessing main-memory,
> > but maybe it is the 17.569ns, as the tool measurement jump from 12MB
> > (5.933ns) to 16MB (12.334ns) and I know L3 is 15MB, so I don't get an
> > accurate L3 measurement.)
> >   
> I run the benchmark, I can see to well distinct jumps (L1 and L2 cache I 
> guess) of 1.543ns and 5.400ns, but then the latency grows gradually:

Guess you left out some numbers below for the 1.543ns measurement you
mention in the text.  There is a plateau at 5.508ns, and another at
plateau 8.629ns, which could be L3?

> 0.25000 5.400
> 0.37500 5.508
> 0.50000 5.508
> 0.75000 6.603
> 1.00000 8.247
> 1.50000 8.616
> 2.00000 8.747
> 3.00000 8.629
> 4.00000 8.629
> 6.00000 8.676
> 8.00000 8.800
> 12.00000 9.119
> 16.00000 10.840
> 24.00000 16.650
> 32.00000 19.888
> 48.00000 21.582
> 64.00000 22.519
> 96.00000 23.473
> 128.00000 24.125
> 192.00000 24.777
> 256.00000 25.124
> 384.00000 25.445
> 512.00000 25.642
> 768.00000 25.775
> 1024.00000 25.869
> 1536.00000 25.942
> I can't really tell where L3 cache and main memory start.

I guess the plateau around 25.445ns is the main memory speed. 

The latency different is very large, but the performance drop is still
too large 39.4 Mpps -> 9.4 Mpps.  Back-of-envelope calc, 8.629ns to
25.445ns is approx a factor 3 (25.445/8.629=2.948).  9.4 Mpps x factor
is 27.7Mpps, 39.4 Mpps div factor is 13.36Mpps.  Meaning it doesn't
add-up to explain this difference.


> One thing I forgot to mention is that I experience the same performance 
> drop even without specifying the --readmem flag of the bpf sample 
> (no_touch mode), if I'm not wrong without the flag the ebpf program 
> should not access to the packet buffer and therefore the DDIO should 
> have no effect.

I was going to ask you to swap between --readmem flag and no_touch
mode, and then measure if perf-stat cache-misses stay the same.  It
sounds like you already did this?

The DDIO/DCA is something the CPU chooses to do, based on proprietary
design by Intel.  Thus, it is hard to say why DDIO is acting like this.
E.g. still causing a cache-miss even when using no_touch mode.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Multi-core scalability problems
  2020-10-19 18:26                 ` Jesper Dangaard Brouer
@ 2020-10-24 13:57                   ` Federico Parola
  2020-10-26  8:14                     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Federico Parola @ 2020-10-24 13:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: xdp-newbies



On 19/10/20 20:26, Jesper Dangaard Brouer wrote:
> On Mon, 19 Oct 2020 17:23:18 +0200
> Federico Parola <fede.parola@hotmail.it> wrote:
 >>
 >> [...]
 >>
>> Hi Jesper, sorry for the late reply, this are the cache refs/misses for
>> 4 flows and different rx ring sizes:
>>
>> RX 512 (9.4 Mpps dropped):
>> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
>>     23771011  cache-references                                (+-  0.04% )
>>      8865698  cache-misses      # 37.296 % of all cache refs  (+-  0.04% )
>>
>> RX 128 (39.4 Mpps dropped):
>> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
>>     68177470  cache-references                               ( +-  0.01% )
>>        23898  cache-misses      # 0.035 % of all cache refs  ( +-  3.23% )
>>
>> Reducing the size of the rx ring brings to a huge decrease in cache
>> misses, is this the effect of DDIO turning on?
> 
> Yes, exactly.
> 
> It is very high that 37.296 % of all cache refs is being cache-misses.
> The number of cache-misses 8,865,698 is close to your reported 9.4
> Mpps. Thus, seems to correlate with the idea that this is DDIO-missing
> as you have a miss per packet.
> 
> I can see that you have selected a subset of the CPUs (0,1,2,13), it
> important that this is the active CPUs.  I usually only select a
> single/individual CPU to make sure I can reason about the numbers.
> I've seen before that some CPUs get DDIO effect and others not, so
> watch out for this.
> 
> If you add HW-counter -e instructions -e cycles to your perf stat
> command, you will also see the instructions per cycle calculation.  You
> should notice that the cache-miss also cause this number to be reduced,
> as the CPUs stalls it cannot keep the CPU pipeline full/busy.
> 
> What kind of CPU are you using?
> Specifically cache-sizes (use dmidecode look for "Cache Information")
> 
I'm using an Intel Xeon Gold 5120, L1: 896 KiB, L2: 14 MiB, L3: 19.25 MiB.

> The performance drop is a little too large 39.4 Mpps -> 9.4 Mpps.
> 
> If I were you, I would measure the speed of the memory, via using the
> tool lmbench-3.0 command 'lat_mem_rd'.
> 
>   /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_mem_rd 2000 128
> 
> The output is the nanosec latency of accessing increasing sizes of	
> memory.  The jumps/increases in latency should be fairly clear and
> shows the latency of the different cache levels.  For my CPU E5-1650 v4
> @ 3.60GHz with 15MB L2 cache, I see L1=1.055ns, L2=5.521ns, L3=17.569ns.
> (I could not find a tool that tells me the cost of accessing main-memory,
> but maybe it is the 17.569ns, as the tool measurement jump from 12MB
> (5.933ns) to 16MB (12.334ns) and I know L3 is 15MB, so I don't get an
> accurate L3 measurement.)
> 
I run the benchmark, I can see to well distinct jumps (L1 and L2 cache I 
guess) of 1.543ns and 5.400ns, but then the latency grows gradually:
0.25000 5.400
0.37500 5.508
0.50000 5.508
0.75000 6.603
1.00000 8.247
1.50000 8.616
2.00000 8.747
3.00000 8.629
4.00000 8.629
6.00000 8.676
8.00000 8.800
12.00000 9.119
16.00000 10.840
24.00000 16.650
32.00000 19.888
48.00000 21.582
64.00000 22.519
96.00000 23.473
128.00000 24.125
192.00000 24.777
256.00000 25.124
384.00000 25.445
512.00000 25.642
768.00000 25.775
1024.00000 25.869
1536.00000 25.942
I can't really tell where L3 cache and main memory start.

One thing I forgot to mention is that I experience the same performance 
drop even without specifying the --readmem flag of the bpf sample 
(no_touch mode), if I'm not wrong without the flag the ebpf program 
should not access to the packet buffer and therefore the DDIO should 
have no effect.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Multi-core scalability problems
  2020-10-19 15:23               ` Federico Parola
@ 2020-10-19 18:26                 ` Jesper Dangaard Brouer
  2020-10-24 13:57                   ` Federico Parola
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-19 18:26 UTC (permalink / raw)
  To: Federico Parola; +Cc: xdp-newbies, brouer

On Mon, 19 Oct 2020 17:23:18 +0200
Federico Parola <fede.parola@hotmail.it> wrote:

> On 15/10/20 15:22, Jesper Dangaard Brouer wrote:
> > On Thu, 15 Oct 2020 14:04:51 +0200
> > Federico Parola <fede.parola@hotmail.it> wrote:
> >   
> >> On 14/10/20 16:26, Jesper Dangaard Brouer wrote:  
> >>> On Wed, 14 Oct 2020 14:17:46 +0200
> >>> Federico Parola <fede.parola@hotmail.it> wrote:
> >>>      
> >>>> On 14/10/20 11:15, Jesper Dangaard Brouer wrote:  
> >>>>> On Wed, 14 Oct 2020 08:56:43 +0200
> >>>>> Federico Parola <fede.parola@hotmail.it> wrote:
> >>>>>
> >>>>> [...]  
> >>>>>>> Can you try to use this[2] tool:
> >>>>>>>      ethtool_stats.pl --dev enp101s0f0
> >>>>>>>
> >>>>>>> And notice if there are any strange counters.
> >>>>>>>
> >>>>>>>
> >>>>>>> [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl  
> > [...]
> >   
> >>>> The only solution I've found so far is to reduce the size of the rx ring
> >>>> as I mentioned in my former post. However I still see a decrease in
> >>>> performance when exceeding 4 cores.  
> >>>
> >>> What is happening when you are reducing the size of the rx ring is two
> >>> things. (1) i40e driver have reuse/recycle-pages trick that get less
> >>> efficient, but because you are dropping packets early you are not
> >>> affected. (2) the total size of L3 memory you need to touch is also
> >>> decreased.
> >>>
> >>> I think you are hitting case (2).  The Intel CPU have a cool feature
> >>> called DDIO (Data-Direct IO) or DCA (Direct Cache Access), which can
> >>> deliver packet data into L3 cache memory (if NIC is directly PCIe
> >>> connected to CPU).  The CPU is in charge when this feature is enabled,
> >>> and it will try to avoid L3 trashing and disable it in certain cases.
> >>> When you reduce the size of the rx rings, then you are also needing
> >>> less L3 cache memory, to the CPU will allow this DDIO feature.
> >>>
> >>> You can use the 'perf stat' tool to check if this is happening, by
> >>> monitoring L3 (and L2) cache usage.  
> >>
> >> What events should I monitor? LLC-load-misses/LLC-loads?  
> > 
> > Looking at my own results from xdp-paper[1], it looks like that it
> > results in real 'cache-misses' (perf stat -e cache-misses).
> > 
> > E.g I ran:
> >   sudo ~/perf stat -C3 -e cycles -e  instructions -e cache-references -e cache-misses -r 3 sleep 1
> > 
> > Notice how the 'insn per cycle' gets less efficient when we experience
> > these cache-misses.
> > 
> > Also how RX-size of queues affect XDP-redirect in [2].
> > 
> > 
> > [1] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org
> > [2] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench05_xdp_redirect.org
> >  
> Hi Jesper, sorry for the late reply, this are the cache refs/misses for 
> 4 flows and different rx ring sizes:
> 
> RX 512 (9.4 Mpps dropped):
> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
>    23771011  cache-references                                (+-  0.04% )
>     8865698  cache-misses      # 37.296 % of all cache refs  (+-  0.04% )
> 
> RX 128 (39.4 Mpps dropped):
> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
>    68177470  cache-references                               ( +-  0.01% )
>       23898  cache-misses      # 0.035 % of all cache refs  ( +-  3.23% )
> 
> Reducing the size of the rx ring brings to a huge decrease in cache 
> misses, is this the effect of DDIO turning on?

Yes, exactly.

It is very high that 37.296 % of all cache refs is being cache-misses.
The number of cache-misses 8,865,698 is close to your reported 9.4
Mpps. Thus, seems to correlate with the idea that this is DDIO-missing
as you have a miss per packet.

I can see that you have selected a subset of the CPUs (0,1,2,13), it
important that this is the active CPUs.  I usually only select a
single/individual CPU to make sure I can reason about the numbers.
I've seen before that some CPUs get DDIO effect and others not, so
watch out for this.

If you add HW-counter -e instructions -e cycles to your perf stat
command, you will also see the instructions per cycle calculation.  You
should notice that the cache-miss also cause this number to be reduced,
as the CPUs stalls it cannot keep the CPU pipeline full/busy.

What kind of CPU are you using?
Specifically cache-sizes (use dmidecode look for "Cache Information")

The performance drop is a little too large 39.4 Mpps -> 9.4 Mpps.  

If I were you, I would measure the speed of the memory, via using the
tool lmbench-3.0 command 'lat_mem_rd'.

 /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_mem_rd 2000 128

The output is the nanosec latency of accessing increasing sizes of	
memory.  The jumps/increases in latency should be fairly clear and
shows the latency of the different cache levels.  For my CPU E5-1650 v4
@ 3.60GHz with 15MB L2 cache, I see L1=1.055ns, L2=5.521ns, L3=17.569ns.
(I could not find a tool that tells me the cost of accessing main-memory,
but maybe it is the 17.569ns, as the tool measurement jump from 12MB
(5.933ns) to 16MB (12.334ns) and I know L3 is 15MB, so I don't get an
accurate L3 measurement.)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Multi-core scalability problems
  2020-10-15 13:22             ` Jesper Dangaard Brouer
@ 2020-10-19 15:23               ` Federico Parola
  2020-10-19 18:26                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Federico Parola @ 2020-10-19 15:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: xdp-newbies

On 15/10/20 15:22, Jesper Dangaard Brouer wrote:
> On Thu, 15 Oct 2020 14:04:51 +0200
> Federico Parola <fede.parola@hotmail.it> wrote:
> 
>> On 14/10/20 16:26, Jesper Dangaard Brouer wrote:
>>> On Wed, 14 Oct 2020 14:17:46 +0200
>>> Federico Parola <fede.parola@hotmail.it> wrote:
>>>    
>>>> On 14/10/20 11:15, Jesper Dangaard Brouer wrote:
>>>>> On Wed, 14 Oct 2020 08:56:43 +0200
>>>>> Federico Parola <fede.parola@hotmail.it> wrote:
>>>>>
>>>>> [...]
>>>>>>> Can you try to use this[2] tool:
>>>>>>>      ethtool_stats.pl --dev enp101s0f0
>>>>>>>
>>>>>>> And notice if there are any strange counters.
>>>>>>>
>>>>>>>
>>>>>>> [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> [...]
> 
>>>> The only solution I've found so far is to reduce the size of the rx ring
>>>> as I mentioned in my former post. However I still see a decrease in
>>>> performance when exceeding 4 cores.
>>>
>>> What is happening when you are reducing the size of the rx ring is two
>>> things. (1) i40e driver have reuse/recycle-pages trick that get less
>>> efficient, but because you are dropping packets early you are not
>>> affected. (2) the total size of L3 memory you need to touch is also
>>> decreased.
>>>
>>> I think you are hitting case (2).  The Intel CPU have a cool feature
>>> called DDIO (Data-Direct IO) or DCA (Direct Cache Access), which can
>>> deliver packet data into L3 cache memory (if NIC is directly PCIe
>>> connected to CPU).  The CPU is in charge when this feature is enabled,
>>> and it will try to avoid L3 trashing and disable it in certain cases.
>>> When you reduce the size of the rx rings, then you are also needing
>>> less L3 cache memory, to the CPU will allow this DDIO feature.
>>>
>>> You can use the 'perf stat' tool to check if this is happening, by
>>> monitoring L3 (and L2) cache usage.
>>
>> What events should I monitor? LLC-load-misses/LLC-loads?
> 
> Looking at my own results from xdp-paper[1], it looks like that it
> results in real 'cache-misses' (perf stat -e cache-misses).
> 
> E.g I ran:
>   sudo ~/perf stat -C3 -e cycles -e  instructions -e cache-references -e cache-misses -r 3 sleep 1
> 
> Notice how the 'insn per cycle' gets less efficient when we experience
> these cache-misses.
> 
> Also how RX-size of queues affect XDP-redirect in [2].
> 
> 
> [1] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org
> [2] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench05_xdp_redirect.org
>
Hi Jesper, sorry for the late reply, this are the cache refs/misses for 
4 flows and different rx ring sizes:

RX 512 (9.4 Mpps dropped):
Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
   23771011  cache-references                                (+-  0.04% )
    8865698  cache-misses      # 37.296 % of all cache refs  (+-  0.04% )

RX 128 (39.4 Mpps dropped):
Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
   68177470  cache-references                               ( +-  0.01% )
      23898  cache-misses      # 0.035 % of all cache refs  ( +-  3.23% )

Reducing the size of the rx ring brings to a huge decrease in cache 
misses, is this the effect of DDIO turning on?


Federico

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Multi-core scalability problems
  2020-10-15 12:04           ` Federico Parola
@ 2020-10-15 13:22             ` Jesper Dangaard Brouer
  2020-10-19 15:23               ` Federico Parola
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-15 13:22 UTC (permalink / raw)
  To: Federico Parola; +Cc: xdp-newbies, brouer

On Thu, 15 Oct 2020 14:04:51 +0200
Federico Parola <fede.parola@hotmail.it> wrote:

> On 14/10/20 16:26, Jesper Dangaard Brouer wrote:
> > On Wed, 14 Oct 2020 14:17:46 +0200
> > Federico Parola <fede.parola@hotmail.it> wrote:
> >   
> >> On 14/10/20 11:15, Jesper Dangaard Brouer wrote:  
> >>> On Wed, 14 Oct 2020 08:56:43 +0200
> >>> Federico Parola <fede.parola@hotmail.it> wrote:
> >>>
> >>> [...]  
> >>>>> Can you try to use this[2] tool:
> >>>>>     ethtool_stats.pl --dev enp101s0f0
> >>>>>
> >>>>> And notice if there are any strange counters.
> >>>>>
> >>>>>
> >>>>> [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
[...]

> >> The only solution I've found so far is to reduce the size of the rx ring
> >> as I mentioned in my former post. However I still see a decrease in
> >> performance when exceeding 4 cores.  
> > 
> > What is happening when you are reducing the size of the rx ring is two
> > things. (1) i40e driver have reuse/recycle-pages trick that get less
> > efficient, but because you are dropping packets early you are not
> > affected. (2) the total size of L3 memory you need to touch is also
> > decreased.
> > 
> > I think you are hitting case (2).  The Intel CPU have a cool feature
> > called DDIO (Data-Direct IO) or DCA (Direct Cache Access), which can
> > deliver packet data into L3 cache memory (if NIC is directly PCIe
> > connected to CPU).  The CPU is in charge when this feature is enabled,
> > and it will try to avoid L3 trashing and disable it in certain cases.
> > When you reduce the size of the rx rings, then you are also needing
> > less L3 cache memory, to the CPU will allow this DDIO feature.
> > 
> > You can use the 'perf stat' tool to check if this is happening, by
> > monitoring L3 (and L2) cache usage.  
> 
> What events should I monitor? LLC-load-misses/LLC-loads?

Looking at my own results from xdp-paper[1], it looks like that it
results in real 'cache-misses' (perf stat -e cache-misses).

E.g I ran:
 sudo ~/perf stat -C3 -e cycles -e  instructions -e cache-references -e cache-misses -r 3 sleep 1

Notice how the 'insn per cycle' gets less efficient when we experience
these cache-misses.

Also how RX-size of queues affect XDP-redirect in [2].


[1] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org
[2] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench05_xdp_redirect.org
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Multi-core scalability problems
  2020-10-14 14:26         ` Jesper Dangaard Brouer
@ 2020-10-15 12:04           ` Federico Parola
  2020-10-15 13:22             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Federico Parola @ 2020-10-15 12:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: xdp-newbies



On 14/10/20 16:26, Jesper Dangaard Brouer wrote:
> On Wed, 14 Oct 2020 14:17:46 +0200
> Federico Parola <fede.parola@hotmail.it> wrote:
> 
>> On 14/10/20 11:15, Jesper Dangaard Brouer wrote:
>>> On Wed, 14 Oct 2020 08:56:43 +0200
>>> Federico Parola <fede.parola@hotmail.it> wrote:
>>>
>>> [...]
>>>>> Can you try to use this[2] tool:
>>>>>     ethtool_stats.pl --dev enp101s0f0
>>>>>
>>>>> And notice if there are any strange counters.
>>>>>
>>>>>
>>>>> [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>>>>> My best guess is that you have Ethernet flow-control enabled.
>>>>> Some ethtool counter might show if that is the case.
>>>>>      
>>>> Here are the results of the tool:
>>>>
>>>>
>>>> 1 FLOW:
>>>>
>>>> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
>>>> Ethtool(enp101s0f0) stat:     35458700 (     35,458,700) <= port.fdir_sb_match /sec
>>>> Ethtool(enp101s0f0) stat:   2729223958 (  2,729,223,958) <= port.rx_bytes /sec
>>>> Ethtool(enp101s0f0) stat:      7185397 (      7,185,397) <= port.rx_dropped /sec
>>>> Ethtool(enp101s0f0) stat:     42644155 (     42,644,155) <= port.rx_size_64 /sec
>>>> Ethtool(enp101s0f0) stat:     42644140 (     42,644,140) <= port.rx_unicast /sec
>>>> Ethtool(enp101s0f0) stat:   1062159456 (  1,062,159,456) <= rx-0.bytes /sec
>>>> Ethtool(enp101s0f0) stat:     17702658 (     17,702,658) <= rx-0.packets /sec
>>>> Ethtool(enp101s0f0) stat:   1062155639 (  1,062,155,639) <= rx_bytes /sec
>>>> Ethtool(enp101s0f0) stat:     17756128 (     17,756,128) <= rx_dropped /sec
>>>> Ethtool(enp101s0f0) stat:     17702594 (     17,702,594) <= rx_packets /sec
>>>> Ethtool(enp101s0f0) stat:     35458743 (     35,458,743) <= rx_unicast /sec
>>>>
>>>> ---
>>>>
>>>>
>>>> 4 FLOWS:
>>>>
>>>> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
>>>> Ethtool(enp101s0f0) stat:      9351001 (      9,351,001) <= port.fdir_sb_match /sec
>>>> Ethtool(enp101s0f0) stat:   2559136358 (  2,559,136,358) <= port.rx_bytes /sec
>>>> Ethtool(enp101s0f0) stat:     30635346 (     30,635,346) <= port.rx_dropped /sec
>>>> Ethtool(enp101s0f0) stat:     39986386 (     39,986,386) <= port.rx_size_64 /sec
>>>> Ethtool(enp101s0f0) stat:     39986799 (     39,986,799) <= port.rx_unicast /sec
>>>> Ethtool(enp101s0f0) stat:    140177834 (    140,177,834) <= rx-0.bytes /sec
>>>> Ethtool(enp101s0f0) stat:      2336297 (      2,336,297) <= rx-0.packets /sec
>>>> Ethtool(enp101s0f0) stat:    140260002 (    140,260,002) <= rx-1.bytes /sec
>>>> Ethtool(enp101s0f0) stat:      2337667 (      2,337,667) <= rx-1.packets /sec
>>>> Ethtool(enp101s0f0) stat:    140261431 (    140,261,431) <= rx-2.bytes /sec
>>>> Ethtool(enp101s0f0) stat:      2337691 (      2,337,691) <= rx-2.packets /sec
>>>> Ethtool(enp101s0f0) stat:    140175690 (    140,175,690) <= rx-3.bytes /sec
>>>> Ethtool(enp101s0f0) stat:      2336262 (      2,336,262) <= rx-3.packets /sec
>>>> Ethtool(enp101s0f0) stat:    560877338 (    560,877,338) <= rx_bytes /sec
>>>> Ethtool(enp101s0f0) stat:         3354 (          3,354) <= rx_dropped /sec
>>>> Ethtool(enp101s0f0) stat:      9347956 (      9,347,956) <= rx_packets /sec
>>>> Ethtool(enp101s0f0) stat:      9351183 (      9,351,183) <= rx_unicast /sec
>>>>
>>>>
>>>> So if I understand the field port.rx_dropped represents packets dropped
>>>> due to a lack of buffer on the NIC while rx_dropped represents packets
>>>> dropped because upper layers aren't able to process them, am I right?
>>>>
>>>> It seems that the problem is in the NIC.
>>> Yes, it seems that the problem is in the NIC hardware, or config of the
>>> NIC hardware.
>>>
>>> Look at the counter "port.fdir_sb_match":
>>> - 1 flow: 35,458,700 = port.fdir_sb_match /sec
>>> - 4 flow:  9,351,001 = port.fdir_sb_match /sec
>>>
>>> I think fdir_sb translates to Flow Director Sideband filter (in the
>>> driver code this is sometimes related to "ATR" (Application Targeted
>>> Routing)). (note: I've seen fdir_match before, but not the "sb"
>>> fdir_sb_match part). This is happening inside the NIC HW/FW that does
>>> filtering on flows and make sure same-flow goes to same RX-queue number
>>> to avoid OOO packets. This looks like the limiting factor in your setup.
>>>
>>> Have you installed any filters yourself?
>>>
>>> Try to disable Flow Director:
>>>
>>>    ethtool -K ethX ntuple <on|off>
>>>   
>> Yes, I'm using flow filters to manually steer traffic to different
>> queues/cores, however disabling ntuple doesn't solve the problem, the
>> port.fdir_sb_match value disappears but the number of packets dropped in
>> port.rx_dropped stays high.
> 
> Try to disable your flow filters.  There are indications that hardware
> cannot run these filters at these speeds.

There are no changes with flow filters disabled or enabled, except for 
the presence of the port.fdir_sb_match counter, here are the results of 
ethtool for 4 flows:

FLOW FILTERS DISABLED:
Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
Ethtool(enp101s0f0) stat:   2575765457 (  2,575,765,457) <= 
port.rx_bytes /sec
Ethtool(enp101s0f0) stat:     30718177 (     30,718,177) <= 
port.rx_dropped /sec
Ethtool(enp101s0f0) stat:     40246552 (     40,246,552) <= 
port.rx_size_64 /sec
Ethtool(enp101s0f0) stat:     40246558 (     40,246,558) <= 
port.rx_unicast /sec
Ethtool(enp101s0f0) stat:    143008276 (    143,008,276) <= rx-10.bytes /sec
Ethtool(enp101s0f0) stat:      2383471 (      2,383,471) <= 
rx-10.packets /sec
Ethtool(enp101s0f0) stat:    142866811 (    142,866,811) <= rx-13.bytes /sec
Ethtool(enp101s0f0) stat:      2381114 (      2,381,114) <= 
rx-13.packets /sec
Ethtool(enp101s0f0) stat:    142924921 (    142,924,921) <= rx-3.bytes /sec
Ethtool(enp101s0f0) stat:      2382082 (      2,382,082) <= rx-3.packets 
/sec
Ethtool(enp101s0f0) stat:    142918015 (    142,918,015) <= rx-6.bytes /sec
Ethtool(enp101s0f0) stat:      2381967 (      2,381,967) <= rx-6.packets 
/sec
Ethtool(enp101s0f0) stat:    571723262 (    571,723,262) <= rx_bytes /sec
Ethtool(enp101s0f0) stat:      9528721 (      9,528,721) <= rx_packets /sec
Ethtool(enp101s0f0) stat:      9528674 (      9,528,674) <= rx_unicast /sec

FLOW FILTERS ENABLED:
Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
Ethtool(enp101s0f0) stat:     15810008 (     15,810,008) <= 
port.fdir_sb_match /sec
Ethtool(enp101s0f0) stat:   2634909056 (  2,634,909,056) <= 
port.rx_bytes /sec
Ethtool(enp101s0f0) stat:     31640574 (     31,640,574) <= 
port.rx_dropped /sec
Ethtool(enp101s0f0) stat:     41170436 (     41,170,436) <= 
port.rx_size_64 /sec
Ethtool(enp101s0f0) stat:     41170327 (     41,170,327) <= 
port.rx_unicast /sec
Ethtool(enp101s0f0) stat:    143016759 (    143,016,759) <= rx-0.bytes /sec
Ethtool(enp101s0f0) stat:      2383613 (      2,383,613) <= rx-0.packets 
/sec
Ethtool(enp101s0f0) stat:    142921054 (    142,921,054) <= rx-1.bytes /sec
Ethtool(enp101s0f0) stat:      2382018 (      2,382,018) <= rx-1.packets 
/sec
Ethtool(enp101s0f0) stat:    142943103 (    142,943,103) <= rx-2.bytes /sec
Ethtool(enp101s0f0) stat:      2382385 (      2,382,385) <= rx-2.packets 
/sec
Ethtool(enp101s0f0) stat:    142907586 (    142,907,586) <= rx-3.bytes /sec
Ethtool(enp101s0f0) stat:      2381793 (      2,381,793) <= rx-3.packets 
/sec
Ethtool(enp101s0f0) stat:    571775035 (    571,775,035) <= rx_bytes /sec
Ethtool(enp101s0f0) stat:      9529584 (      9,529,584) <= rx_packets /sec
Ethtool(enp101s0f0) stat:      9529673 (      9,529,673) <= rx_unicast /sec

>> The only solution I've found so far is to reduce the size of the rx ring
>> as I mentioned in my former post. However I still see a decrease in
>> performance when exceeding 4 cores.
> 
> What is happening when you are reducing the size of the rx ring is two
> things. (1) i40e driver have reuse/recycle-pages trick that get less
> efficient, but because you are dropping packets early you are not
> affected. (2) the total size of L3 memory you need to touch is also
> decreased.
> 
> I think you are hitting case (2).  The Intel CPU have a cool feature
> called DDIO (Data-Direct IO) or DCA (Direct Cache Access), which can
> deliver packet data into L3 cache memory (if NIC is directly PCIe
> connected to CPU).  The CPU is in charge when this feature is enabled,
> and it will try to avoid L3 trashing and disable it in certain cases.
> When you reduce the size of the rx rings, then you are also needing
> less L3 cache memory, to the CPU will allow this DDIO feature.
> 
> You can use the 'perf stat' tool to check if this is happening, by
> monitoring L3 (and L2) cache usage.

What events should I monitor? LLC-load-misses/LLC-loads?

Federico

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Multi-core scalability problems
  2020-10-14 12:17       ` Federico Parola
@ 2020-10-14 14:26         ` Jesper Dangaard Brouer
  2020-10-15 12:04           ` Federico Parola
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-14 14:26 UTC (permalink / raw)
  To: Federico Parola; +Cc: xdp-newbies, brouer

On Wed, 14 Oct 2020 14:17:46 +0200
Federico Parola <fede.parola@hotmail.it> wrote:

> On 14/10/20 11:15, Jesper Dangaard Brouer wrote:
> > On Wed, 14 Oct 2020 08:56:43 +0200
> > Federico Parola <fede.parola@hotmail.it> wrote:
> >
> > [...]  
> >>> Can you try to use this[2] tool:
> >>>    ethtool_stats.pl --dev enp101s0f0
> >>>
> >>> And notice if there are any strange counters.
> >>>
> >>>
> >>> [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> >>> My best guess is that you have Ethernet flow-control enabled.
> >>> Some ethtool counter might show if that is the case.
> >>>     
> >> Here are the results of the tool:
> >>
> >>
> >> 1 FLOW:
> >>
> >> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
> >> Ethtool(enp101s0f0) stat:     35458700 (     35,458,700) <= port.fdir_sb_match /sec
> >> Ethtool(enp101s0f0) stat:   2729223958 (  2,729,223,958) <= port.rx_bytes /sec
> >> Ethtool(enp101s0f0) stat:      7185397 (      7,185,397) <= port.rx_dropped /sec
> >> Ethtool(enp101s0f0) stat:     42644155 (     42,644,155) <= port.rx_size_64 /sec
> >> Ethtool(enp101s0f0) stat:     42644140 (     42,644,140) <= port.rx_unicast /sec
> >> Ethtool(enp101s0f0) stat:   1062159456 (  1,062,159,456) <= rx-0.bytes /sec
> >> Ethtool(enp101s0f0) stat:     17702658 (     17,702,658) <= rx-0.packets /sec
> >> Ethtool(enp101s0f0) stat:   1062155639 (  1,062,155,639) <= rx_bytes /sec
> >> Ethtool(enp101s0f0) stat:     17756128 (     17,756,128) <= rx_dropped /sec
> >> Ethtool(enp101s0f0) stat:     17702594 (     17,702,594) <= rx_packets /sec
> >> Ethtool(enp101s0f0) stat:     35458743 (     35,458,743) <= rx_unicast /sec
> >>
> >> ---
> >>
> >>
> >> 4 FLOWS:
> >>
> >> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
> >> Ethtool(enp101s0f0) stat:      9351001 (      9,351,001) <= port.fdir_sb_match /sec
> >> Ethtool(enp101s0f0) stat:   2559136358 (  2,559,136,358) <= port.rx_bytes /sec
> >> Ethtool(enp101s0f0) stat:     30635346 (     30,635,346) <= port.rx_dropped /sec
> >> Ethtool(enp101s0f0) stat:     39986386 (     39,986,386) <= port.rx_size_64 /sec
> >> Ethtool(enp101s0f0) stat:     39986799 (     39,986,799) <= port.rx_unicast /sec
> >> Ethtool(enp101s0f0) stat:    140177834 (    140,177,834) <= rx-0.bytes /sec
> >> Ethtool(enp101s0f0) stat:      2336297 (      2,336,297) <= rx-0.packets /sec
> >> Ethtool(enp101s0f0) stat:    140260002 (    140,260,002) <= rx-1.bytes /sec
> >> Ethtool(enp101s0f0) stat:      2337667 (      2,337,667) <= rx-1.packets /sec
> >> Ethtool(enp101s0f0) stat:    140261431 (    140,261,431) <= rx-2.bytes /sec
> >> Ethtool(enp101s0f0) stat:      2337691 (      2,337,691) <= rx-2.packets /sec
> >> Ethtool(enp101s0f0) stat:    140175690 (    140,175,690) <= rx-3.bytes /sec
> >> Ethtool(enp101s0f0) stat:      2336262 (      2,336,262) <= rx-3.packets /sec
> >> Ethtool(enp101s0f0) stat:    560877338 (    560,877,338) <= rx_bytes /sec
> >> Ethtool(enp101s0f0) stat:         3354 (          3,354) <= rx_dropped /sec
> >> Ethtool(enp101s0f0) stat:      9347956 (      9,347,956) <= rx_packets /sec
> >> Ethtool(enp101s0f0) stat:      9351183 (      9,351,183) <= rx_unicast /sec
> >>
> >>
> >> So if I understand the field port.rx_dropped represents packets dropped
> >> due to a lack of buffer on the NIC while rx_dropped represents packets
> >> dropped because upper layers aren't able to process them, am I right?
> >>
> >> It seems that the problem is in the NIC.  
> > Yes, it seems that the problem is in the NIC hardware, or config of the
> > NIC hardware.
> >
> > Look at the counter "port.fdir_sb_match":
> > - 1 flow: 35,458,700 = port.fdir_sb_match /sec
> > - 4 flow:  9,351,001 = port.fdir_sb_match /sec
> >
> > I think fdir_sb translates to Flow Director Sideband filter (in the
> > driver code this is sometimes related to "ATR" (Application Targeted
> > Routing)). (note: I've seen fdir_match before, but not the "sb"
> > fdir_sb_match part). This is happening inside the NIC HW/FW that does
> > filtering on flows and make sure same-flow goes to same RX-queue number
> > to avoid OOO packets. This looks like the limiting factor in your setup.
> >
> > Have you installed any filters yourself?
> >
> > Try to disable Flow Director:
> >
> >   ethtool -K ethX ntuple <on|off>
> >  
> Yes, I'm using flow filters to manually steer traffic to different 
> queues/cores, however disabling ntuple doesn't solve the problem, the 
> port.fdir_sb_match value disappears but the number of packets dropped in 
> port.rx_dropped stays high.

Try to disable your flow filters.  There are indications that hardware
cannot run these filters at these speeds.


> The only solution I've found so far is to reduce the size of the rx ring 
> as I mentioned in my former post. However I still see a decrease in 
> performance when exceeding 4 cores.

What is happening when you are reducing the size of the rx ring is two
things. (1) i40e driver have reuse/recycle-pages trick that get less
efficient, but because you are dropping packets early you are not
affected. (2) the total size of L3 memory you need to touch is also
decreased.

I think you are hitting case (2).  The Intel CPU have a cool feature
called DDIO (Data-Direct IO) or DCA (Direct Cache Access), which can
deliver packet data into L3 cache memory (if NIC is directly PCIe
connected to CPU).  The CPU is in charge when this feature is enabled,
and it will try to avoid L3 trashing and disable it in certain cases.
When you reduce the size of the rx rings, then you are also needing
less L3 cache memory, to the CPU will allow this DDIO feature.

You can use the 'perf stat' tool to check if this is happening, by
monitoring L3 (and L2) cache usage.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Multi-core scalability problems
  2020-10-14  9:15     ` Jesper Dangaard Brouer
@ 2020-10-14 12:17       ` Federico Parola
  2020-10-14 14:26         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Federico Parola @ 2020-10-14 12:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, xdp-newbies

On 14/10/20 11:15, Jesper Dangaard Brouer wrote:
> On Wed, 14 Oct 2020 08:56:43 +0200
> Federico Parola <fede.parola@hotmail.it> wrote:
>
> [...]
>>> Can you try to use this[2] tool:
>>>    ethtool_stats.pl --dev enp101s0f0
>>>
>>> And notice if there are any strange counters.
>>>
>>>
>>> [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>>> My best guess is that you have Ethernet flow-control enabled.
>>> Some ethtool counter might show if that is the case.
>>>   
>> Here are the results of the tool:
>>
>>
>> 1 FLOW:
>>
>> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
>> Ethtool(enp101s0f0) stat:     35458700 (     35,458,700) <= port.fdir_sb_match /sec
>> Ethtool(enp101s0f0) stat:   2729223958 (  2,729,223,958) <= port.rx_bytes /sec
>> Ethtool(enp101s0f0) stat:      7185397 (      7,185,397) <= port.rx_dropped /sec
>> Ethtool(enp101s0f0) stat:     42644155 (     42,644,155) <= port.rx_size_64 /sec
>> Ethtool(enp101s0f0) stat:     42644140 (     42,644,140) <= port.rx_unicast /sec
>> Ethtool(enp101s0f0) stat:   1062159456 (  1,062,159,456) <= rx-0.bytes /sec
>> Ethtool(enp101s0f0) stat:     17702658 (     17,702,658) <= rx-0.packets /sec
>> Ethtool(enp101s0f0) stat:   1062155639 (  1,062,155,639) <= rx_bytes /sec
>> Ethtool(enp101s0f0) stat:     17756128 (     17,756,128) <= rx_dropped /sec
>> Ethtool(enp101s0f0) stat:     17702594 (     17,702,594) <= rx_packets /sec
>> Ethtool(enp101s0f0) stat:     35458743 (     35,458,743) <= rx_unicast /sec
>>
>> ---
>>
>>
>> 4 FLOWS:
>>
>> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
>> Ethtool(enp101s0f0) stat:      9351001 (      9,351,001) <= port.fdir_sb_match /sec
>> Ethtool(enp101s0f0) stat:   2559136358 (  2,559,136,358) <= port.rx_bytes /sec
>> Ethtool(enp101s0f0) stat:     30635346 (     30,635,346) <= port.rx_dropped /sec
>> Ethtool(enp101s0f0) stat:     39986386 (     39,986,386) <= port.rx_size_64 /sec
>> Ethtool(enp101s0f0) stat:     39986799 (     39,986,799) <= port.rx_unicast /sec
>> Ethtool(enp101s0f0) stat:    140177834 (    140,177,834) <= rx-0.bytes /sec
>> Ethtool(enp101s0f0) stat:      2336297 (      2,336,297) <= rx-0.packets /sec
>> Ethtool(enp101s0f0) stat:    140260002 (    140,260,002) <= rx-1.bytes /sec
>> Ethtool(enp101s0f0) stat:      2337667 (      2,337,667) <= rx-1.packets /sec
>> Ethtool(enp101s0f0) stat:    140261431 (    140,261,431) <= rx-2.bytes /sec
>> Ethtool(enp101s0f0) stat:      2337691 (      2,337,691) <= rx-2.packets /sec
>> Ethtool(enp101s0f0) stat:    140175690 (    140,175,690) <= rx-3.bytes /sec
>> Ethtool(enp101s0f0) stat:      2336262 (      2,336,262) <= rx-3.packets /sec
>> Ethtool(enp101s0f0) stat:    560877338 (    560,877,338) <= rx_bytes /sec
>> Ethtool(enp101s0f0) stat:         3354 (          3,354) <= rx_dropped /sec
>> Ethtool(enp101s0f0) stat:      9347956 (      9,347,956) <= rx_packets /sec
>> Ethtool(enp101s0f0) stat:      9351183 (      9,351,183) <= rx_unicast /sec
>>
>>
>> So if I understand the field port.rx_dropped represents packets dropped
>> due to a lack of buffer on the NIC while rx_dropped represents packets
>> dropped because upper layers aren't able to process them, am I right?
>>
>> It seems that the problem is in the NIC.
> Yes, it seems that the problem is in the NIC hardware, or config of the
> NIC hardware.
>
> Look at the counter "port.fdir_sb_match":
> - 1 flow: 35,458,700 = port.fdir_sb_match /sec
> - 4 flow:  9,351,001 = port.fdir_sb_match /sec
>
> I think fdir_sb translates to Flow Director Sideband filter (in the
> driver code this is sometimes related to "ATR" (Application Targeted
> Routing)). (note: I've seen fdir_match before, but not the "sb"
> fdir_sb_match part). This is happening inside the NIC HW/FW that does
> filtering on flows and make sure same-flow goes to same RX-queue number
> to avoid OOO packets. This looks like the limiting factor in your setup.
>
> Have you installed any filters yourself?
>
> Try to disable Flow Director:
>
>   ethtool -K ethX ntuple <on|off>
>
Yes, I'm using flow filters to manually steer traffic to different 
queues/cores, however disabling ntuple doesn't solve the problem, the 
port.fdir_sb_match value disappears but the number of packets dropped in 
port.rx_dropped stays high.

The only solution I've found so far is to reduce the size of the rx ring 
as I mentioned in my former post. However I still see a decrease in 
performance when exceeding 4 cores.


Federico


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Multi-core scalability problems
  2020-10-14  6:56   ` Federico Parola
@ 2020-10-14  9:15     ` Jesper Dangaard Brouer
  2020-10-14 12:17       ` Federico Parola
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-14  9:15 UTC (permalink / raw)
  To: Federico Parola; +Cc: Toke Høiland-Jørgensen, xdp-newbies, brouer


On Wed, 14 Oct 2020 08:56:43 +0200
Federico Parola <fede.parola@hotmail.it> wrote:

[...]
> >
> > Can you try to use this[2] tool:
> >   ethtool_stats.pl --dev enp101s0f0
> >
> > And notice if there are any strange counters.
> >
> >
> > [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> > My best guess is that you have Ethernet flow-control enabled.
> > Some ethtool counter might show if that is the case.
> >  
> Here are the results of the tool:
> 
> 
> 1 FLOW:
> 
> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
> Ethtool(enp101s0f0) stat:     35458700 (     35,458,700) <= port.fdir_sb_match /sec
> Ethtool(enp101s0f0) stat:   2729223958 (  2,729,223,958) <= port.rx_bytes /sec
> Ethtool(enp101s0f0) stat:      7185397 (      7,185,397) <= port.rx_dropped /sec
> Ethtool(enp101s0f0) stat:     42644155 (     42,644,155) <= port.rx_size_64 /sec
> Ethtool(enp101s0f0) stat:     42644140 (     42,644,140) <= port.rx_unicast /sec
> Ethtool(enp101s0f0) stat:   1062159456 (  1,062,159,456) <= rx-0.bytes /sec
> Ethtool(enp101s0f0) stat:     17702658 (     17,702,658) <= rx-0.packets /sec
> Ethtool(enp101s0f0) stat:   1062155639 (  1,062,155,639) <= rx_bytes /sec
> Ethtool(enp101s0f0) stat:     17756128 (     17,756,128) <= rx_dropped /sec
> Ethtool(enp101s0f0) stat:     17702594 (     17,702,594) <= rx_packets /sec
> Ethtool(enp101s0f0) stat:     35458743 (     35,458,743) <= rx_unicast /sec
> 
> ---
> 
> 
> 4 FLOWS:
> 
> Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
> Ethtool(enp101s0f0) stat:      9351001 (      9,351,001) <= port.fdir_sb_match /sec
> Ethtool(enp101s0f0) stat:   2559136358 (  2,559,136,358) <= port.rx_bytes /sec
> Ethtool(enp101s0f0) stat:     30635346 (     30,635,346) <= port.rx_dropped /sec
> Ethtool(enp101s0f0) stat:     39986386 (     39,986,386) <= port.rx_size_64 /sec
> Ethtool(enp101s0f0) stat:     39986799 (     39,986,799) <= port.rx_unicast /sec
> Ethtool(enp101s0f0) stat:    140177834 (    140,177,834) <= rx-0.bytes /sec
> Ethtool(enp101s0f0) stat:      2336297 (      2,336,297) <= rx-0.packets /sec
> Ethtool(enp101s0f0) stat:    140260002 (    140,260,002) <= rx-1.bytes /sec
> Ethtool(enp101s0f0) stat:      2337667 (      2,337,667) <= rx-1.packets /sec
> Ethtool(enp101s0f0) stat:    140261431 (    140,261,431) <= rx-2.bytes /sec
> Ethtool(enp101s0f0) stat:      2337691 (      2,337,691) <= rx-2.packets /sec
> Ethtool(enp101s0f0) stat:    140175690 (    140,175,690) <= rx-3.bytes /sec
> Ethtool(enp101s0f0) stat:      2336262 (      2,336,262) <= rx-3.packets /sec
> Ethtool(enp101s0f0) stat:    560877338 (    560,877,338) <= rx_bytes /sec
> Ethtool(enp101s0f0) stat:         3354 (          3,354) <= rx_dropped /sec
> Ethtool(enp101s0f0) stat:      9347956 (      9,347,956) <= rx_packets /sec
> Ethtool(enp101s0f0) stat:      9351183 (      9,351,183) <= rx_unicast /sec
> 
> 
> So if I understand the field port.rx_dropped represents packets dropped 
> due to a lack of buffer on the NIC while rx_dropped represents packets 
> dropped because upper layers aren't able to process them, am I right?
> 
> It seems that the problem is in the NIC.

Yes, it seems that the problem is in the NIC hardware, or config of the
NIC hardware.

Look at the counter "port.fdir_sb_match":
- 1 flow: 35,458,700 = port.fdir_sb_match /sec
- 4 flow:  9,351,001 = port.fdir_sb_match /sec

I think fdir_sb translates to Flow Director Sideband filter (in the
driver code this is sometimes related to "ATR" (Application Targeted
Routing)). (note: I've seen fdir_match before, but not the "sb"
fdir_sb_match part). This is happening inside the NIC HW/FW that does
filtering on flows and make sure same-flow goes to same RX-queue number
to avoid OOO packets. This looks like the limiting factor in your setup.

Have you installed any filters yourself?

Try to disable Flow Director:

 ethtool -K ethX ntuple <on|off>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Multi-core scalability problems
  2020-10-13 16:44 ` Toke Høiland-Jørgensen
@ 2020-10-14  6:56   ` Federico Parola
  2020-10-14  9:15     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Federico Parola @ 2020-10-14  6:56 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, brouer, xdp-newbies

Thanks for your help!

On 13/10/20 18:44, Toke Høiland-Jørgensen wrote:
> Federico Parola<fede.parola@hotmail.it>  writes:
>
>> Hello,
>> I'm testing the performance of XDP when dropping packets using multiple
>> cores and I'm getting unexpected results.
>> My machine is equipped with a dual port Intel XL710 40 GbE and an Intel
>> Xeon Gold 5120 CPU @ 2.20GHz with 14 cores (HyperThreading disabled),
>> running Ubuntu server 18.04 with kernel 5.8.12.
>> I'm using the xdp_rxq_info program from the kernel tree samples to drop
>> packets.
>> I generate 64 bytes UDP packets with MoonGen for a total of 42 Mpps.
>> Packets are uniformly distributed in different flows (different src
>> port) and I use flow direction rules on the rx NIC to send these flows
>> to different queues/cores.
>> Here are my results:
>>
>> 1 FLOW:
>> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP options:no_touch
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      0       17784270    0
>> XDP-RX CPU      total   17784270
>>
>> RXQ stats       RXQ:CPU pps         issue-pps
>> rx_queue_index    0:0   17784270    0
>> rx_queue_index    0:sum 17784270
>> ---
>>
>> 2 FLOWS:
>> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP options:no_touch
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      0       7016363     0
>> XDP-RX CPU      1       7017291     0
>> XDP-RX CPU      total   14033655
>>
>> RXQ stats       RXQ:CPU pps         issue-pps
>> rx_queue_index    0:0   7016366     0
>> rx_queue_index    0:sum 7016366
>> rx_queue_index    1:1   7017294     0
>> rx_queue_index    1:sum 7017294
>> ---
>>
>> 4 FLOWS:
>> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP options:no_touch
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      0       2359478     0
>> XDP-RX CPU      1       2358508     0
>> XDP-RX CPU      2       2357042     0
>> XDP-RX CPU      3       2355396     0
>> XDP-RX CPU      total   9430425
>>
>> RXQ stats       RXQ:CPU pps         issue-pps
>> rx_queue_index    0:0   2359474     0
>> rx_queue_index    0:sum 2359474
>> rx_queue_index    1:1   2358504     0
>> rx_queue_index    1:sum 2358504
>> rx_queue_index    2:2   2357040     0
>> rx_queue_index    2:sum 2357040
>> rx_queue_index    3:3   2355392     0
>> rx_queue_index    3:sum 2355392
>>
>> I don't understand why overall performance is reducing with the number
>> of cores, according to [1] I would expect it to increase until reaching
>> a maximum value. Is there any parameter I should tune to overcome the
>> problem?
> Yeah, this does look a bit odd. My immediate thought is that maybe your
> RXQs are not pinned to the cores correctly? There is nothing in
> xdp_rxq_info that ensures this, you have to configure the IRQ affinity
> manually. If you don't do this, I suppose the processing could be
> bouncing around on different CPUs leading to cache line contention when
> updating the stats map.
>
> You can try to look at what the actual CPU load is on each core -
> 'mpstat -P ALL -n 1' is my goto for this.
>
> -Toke
>
I forgot to mention, I have manually configured the IRQ affinity to map 
every queue on a different core, and running your command confirms that 
one core per queue/flow is used.


On 13/10/20 18:41, Jesper Dangaard Brouer wrote:
> This is what I see with i40e:
>
> unning XDP on dev:i40e2 (ifindex:6) action:XDP_DROP options:no_touch
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      1       8,411,547   0
> XDP-RX CPU      2       2,804,016   0
> XDP-RX CPU      3       2,803,600   0
> XDP-RX CPU      4       5,608,380   0
> XDP-RX CPU      5       13,999,125  0
> XDP-RX CPU      total   33,626,671
>
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:3   2,803,600   0
> rx_queue_index    0:sum 2,803,600
> rx_queue_index    1:1   8,411,540   0
> rx_queue_index    1:sum 8,411,540
> rx_queue_index    2:2   2,804,015   0
> rx_queue_index    2:sum 2,804,015
> rx_queue_index    3:5   8,399,326   0
> rx_queue_index    3:sum 8,399,326
> rx_queue_index    4:4   5,608,372   0
> rx_queue_index    4:sum 5,608,372
> rx_queue_index    5:5   5,599,809   0
> rx_queue_index    5:sum 5,599,809
> That is strange, as my results above show that it does scale on my
> testlab on same NIC i40e (Intel Corporation Ethernet Controller XL710
> for 40GbE QSFP+ (rev 02)).
>
> Can you try to use this[2] tool:
>   ethtool_stats.pl --dev enp101s0f0
>
> And notice if there are any strange counters.
>
>
> [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> My best guess is that you have Ethernet flow-control enabled.
> Some ethtool counter might show if that is the case.
>
Here are the results of the tool:


1 FLOW:

Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
Ethtool(enp101s0f0) stat:     35458700 (     35,458,700) <= 
port.fdir_sb_match /sec
Ethtool(enp101s0f0) stat:   2729223958 (  2,729,223,958) <= 
port.rx_bytes /sec
Ethtool(enp101s0f0) stat:      7185397 (      7,185,397) <= 
port.rx_dropped /sec
Ethtool(enp101s0f0) stat:     42644155 (     42,644,155) <= 
port.rx_size_64 /sec
Ethtool(enp101s0f0) stat:     42644140 (     42,644,140) <= 
port.rx_unicast /sec
Ethtool(enp101s0f0) stat:   1062159456 (  1,062,159,456) <= rx-0.bytes /sec
Ethtool(enp101s0f0) stat:     17702658 (     17,702,658) <= rx-0.packets 
/sec
Ethtool(enp101s0f0) stat:   1062155639 (  1,062,155,639) <= rx_bytes /sec
Ethtool(enp101s0f0) stat:     17756128 (     17,756,128) <= rx_dropped /sec
Ethtool(enp101s0f0) stat:     17702594 (     17,702,594) <= rx_packets /sec
Ethtool(enp101s0f0) stat:     35458743 (     35,458,743) <= rx_unicast /sec

---


4 FLOWS:

Show adapter(s) (enp101s0f0) statistics (ONLY that changed!)
Ethtool(enp101s0f0) stat:      9351001 (      9,351,001) <= 
port.fdir_sb_match /sec
Ethtool(enp101s0f0) stat:   2559136358 (  2,559,136,358) <= 
port.rx_bytes /sec
Ethtool(enp101s0f0) stat:     30635346 (     30,635,346) <= 
port.rx_dropped /sec
Ethtool(enp101s0f0) stat:     39986386 (     39,986,386) <= 
port.rx_size_64 /sec
Ethtool(enp101s0f0) stat:     39986799 (     39,986,799) <= 
port.rx_unicast /sec
Ethtool(enp101s0f0) stat:    140177834 (    140,177,834) <= rx-0.bytes /sec
Ethtool(enp101s0f0) stat:      2336297 (      2,336,297) <= rx-0.packets 
/sec
Ethtool(enp101s0f0) stat:    140260002 (    140,260,002) <= rx-1.bytes /sec
Ethtool(enp101s0f0) stat:      2337667 (      2,337,667) <= rx-1.packets 
/sec
Ethtool(enp101s0f0) stat:    140261431 (    140,261,431) <= rx-2.bytes /sec
Ethtool(enp101s0f0) stat:      2337691 (      2,337,691) <= rx-2.packets 
/sec
Ethtool(enp101s0f0) stat:    140175690 (    140,175,690) <= rx-3.bytes /sec
Ethtool(enp101s0f0) stat:      2336262 (      2,336,262) <= rx-3.packets 
/sec
Ethtool(enp101s0f0) stat:    560877338 (    560,877,338) <= rx_bytes /sec
Ethtool(enp101s0f0) stat:         3354 (          3,354) <= rx_dropped /sec
Ethtool(enp101s0f0) stat:      9347956 (      9,347,956) <= rx_packets /sec
Ethtool(enp101s0f0) stat:      9351183 (      9,351,183) <= rx_unicast /sec


So if I understand the field port.rx_dropped represents packets dropped 
due to a lack of buffer on the NIC while rx_dropped represents packets 
dropped because upper layers aren't able to process them, am I right?

It seems that the problem is in the NIC.


Federico


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Multi-core scalability problems
  2020-10-13 13:49 Federico Parola
  2020-10-13 16:41 ` Jesper Dangaard Brouer
@ 2020-10-13 16:44 ` Toke Høiland-Jørgensen
  2020-10-14  6:56   ` Federico Parola
  1 sibling, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-10-13 16:44 UTC (permalink / raw)
  To: Federico Parola, xdp-newbies

Federico Parola <fede.parola@hotmail.it> writes:

> Hello,
> I'm testing the performance of XDP when dropping packets using multiple 
> cores and I'm getting unexpected results.
> My machine is equipped with a dual port Intel XL710 40 GbE and an Intel 
> Xeon Gold 5120 CPU @ 2.20GHz with 14 cores (HyperThreading disabled), 
> running Ubuntu server 18.04 with kernel 5.8.12.
> I'm using the xdp_rxq_info program from the kernel tree samples to drop 
> packets.
> I generate 64 bytes UDP packets with MoonGen for a total of 42 Mpps. 
> Packets are uniformly distributed in different flows (different src 
> port) and I use flow direction rules on the rx NIC to send these flows 
> to different queues/cores.
> Here are my results:
>
> 1 FLOW:
> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP options:no_touch
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       17784270    0
> XDP-RX CPU      total   17784270
>
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   17784270    0
> rx_queue_index    0:sum 17784270
> ---
>
> 2 FLOWS:
> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP options:no_touch
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       7016363     0
> XDP-RX CPU      1       7017291     0
> XDP-RX CPU      total   14033655
>
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   7016366     0
> rx_queue_index    0:sum 7016366
> rx_queue_index    1:1   7017294     0
> rx_queue_index    1:sum 7017294
> ---
>
> 4 FLOWS:
> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP options:no_touch
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       2359478     0
> XDP-RX CPU      1       2358508     0
> XDP-RX CPU      2       2357042     0
> XDP-RX CPU      3       2355396     0
> XDP-RX CPU      total   9430425
>
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   2359474     0
> rx_queue_index    0:sum 2359474
> rx_queue_index    1:1   2358504     0
> rx_queue_index    1:sum 2358504
> rx_queue_index    2:2   2357040     0
> rx_queue_index    2:sum 2357040
> rx_queue_index    3:3   2355392     0
> rx_queue_index    3:sum 2355392
>
> I don't understand why overall performance is reducing with the number 
> of cores, according to [1] I would expect it to increase until reaching 
> a maximum value. Is there any parameter I should tune to overcome the 
> problem?

Yeah, this does look a bit odd. My immediate thought is that maybe your
RXQs are not pinned to the cores correctly? There is nothing in
xdp_rxq_info that ensures this, you have to configure the IRQ affinity
manually. If you don't do this, I suppose the processing could be
bouncing around on different CPUs leading to cache line contention when
updating the stats map.

You can try to look at what the actual CPU load is on each core -
'mpstat -P ALL -n 1' is my goto for this.

-Toke


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Multi-core scalability problems
  2020-10-13 13:49 Federico Parola
@ 2020-10-13 16:41 ` Jesper Dangaard Brouer
  2020-10-13 16:44 ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-13 16:41 UTC (permalink / raw)
  To: Federico Parola; +Cc: brouer, xdp-newbies

On Tue, 13 Oct 2020 15:49:03 +0200
Federico Parola <fede.parola@hotmail.it> wrote:

> Hello,
> I'm testing the performance of XDP when dropping packets using multiple 
> cores and I'm getting unexpected results.
> My machine is equipped with a dual port Intel XL710 40 GbE and an Intel 
> Xeon Gold 5120 CPU @ 2.20GHz with 14 cores (HyperThreading disabled), 
> running Ubuntu server 18.04 with kernel 5.8.12.
> I'm using the xdp_rxq_info program from the kernel tree samples to drop 
> packets.
> I generate 64 bytes UDP packets with MoonGen for a total of 42 Mpps. 
> Packets are uniformly distributed in different flows (different src 
> port) and I use flow direction rules on the rx NIC to send these flows 
> to different queues/cores.
> Here are my results:
> 
> 1 FLOW:
> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP options:no_touch
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       17784270    0
> XDP-RX CPU      total   17784270
> 
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   17784270    0
> rx_queue_index    0:sum 17784270
> ---
> 
> 2 FLOWS:
> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP options:no_touch
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       7016363     0
> XDP-RX CPU      1       7017291     0
> XDP-RX CPU      total   14033655
> 
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   7016366     0
> rx_queue_index    0:sum 7016366
> rx_queue_index    1:1   7017294     0
> rx_queue_index    1:sum 7017294
> ---
> 
> 4 FLOWS:
> Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP options:no_touch
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       2359478     0
> XDP-RX CPU      1       2358508     0
> XDP-RX CPU      2       2357042     0
> XDP-RX CPU      3       2355396     0
> XDP-RX CPU      total   9430425
> 
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   2359474     0
> rx_queue_index    0:sum 2359474
> rx_queue_index    1:1   2358504     0
> rx_queue_index    1:sum 2358504
> rx_queue_index    2:2   2357040     0
> rx_queue_index    2:sum 2357040
> rx_queue_index    3:3   2355392     0
> rx_queue_index    3:sum 2355392
> 

This is what I see with i40e:

unning XDP on dev:i40e2 (ifindex:6) action:XDP_DROP options:no_touch
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      1       8,411,547   0          
XDP-RX CPU      2       2,804,016   0          
XDP-RX CPU      3       2,803,600   0          
XDP-RX CPU      4       5,608,380   0          
XDP-RX CPU      5       13,999,125  0          
XDP-RX CPU      total   33,626,671 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:3   2,803,600   0          
rx_queue_index    0:sum 2,803,600  
rx_queue_index    1:1   8,411,540   0          
rx_queue_index    1:sum 8,411,540  
rx_queue_index    2:2   2,804,015   0          
rx_queue_index    2:sum 2,804,015  
rx_queue_index    3:5   8,399,326   0          
rx_queue_index    3:sum 8,399,326  
rx_queue_index    4:4   5,608,372   0          
rx_queue_index    4:sum 5,608,372  
rx_queue_index    5:5   5,599,809   0          
rx_queue_index    5:sum 5,599,809  


> I don't understand why overall performance is reducing with the number 
> of cores, according to [1] I would expect it to increase until reaching 
> a maximum value. Is there any parameter I should tune to overcome the 
> problem?

That is strange, as my results above show that it does scale on my
testlab on same NIC i40e (Intel Corporation Ethernet Controller XL710
for 40GbE QSFP+ (rev 02)).

Can you try to use this[2] tool:
 ethtool_stats.pl --dev enp101s0f0

And notice if there are any strange counters.


[2] https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
 
> [1] 
> https://github.com/tohojo/xdp-paper/blob/master/benchmarks/bench02_xdp_drop.org

My best guess is that you have Ethernet flow-control enabled.
Some ethtool counter might show if that is the case. 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Multi-core scalability problems
@ 2020-10-13 13:49 Federico Parola
  2020-10-13 16:41 ` Jesper Dangaard Brouer
  2020-10-13 16:44 ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 14+ messages in thread
From: Federico Parola @ 2020-10-13 13:49 UTC (permalink / raw)
  To: xdp-newbies

Hello,
I'm testing the performance of XDP when dropping packets using multiple 
cores and I'm getting unexpected results.
My machine is equipped with a dual port Intel XL710 40 GbE and an Intel 
Xeon Gold 5120 CPU @ 2.20GHz with 14 cores (HyperThreading disabled), 
running Ubuntu server 18.04 with kernel 5.8.12.
I'm using the xdp_rxq_info program from the kernel tree samples to drop 
packets.
I generate 64 bytes UDP packets with MoonGen for a total of 42 Mpps. 
Packets are uniformly distributed in different flows (different src 
port) and I use flow direction rules on the rx NIC to send these flows 
to different queues/cores.
Here are my results:

1 FLOW:
Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP options:no_touch
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       17784270    0
XDP-RX CPU      total   17784270

RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   17784270    0
rx_queue_index    0:sum 17784270
---

2 FLOWS:
Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP options:no_touch
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       7016363     0
XDP-RX CPU      1       7017291     0
XDP-RX CPU      total   14033655

RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   7016366     0
rx_queue_index    0:sum 7016366
rx_queue_index    1:1   7017294     0
rx_queue_index    1:sum 7017294
---

4 FLOWS:
Running XDP on dev:enp101s0f0 (ifindex:3) action:XDP_DROP options:no_touch
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       2359478     0
XDP-RX CPU      1       2358508     0
XDP-RX CPU      2       2357042     0
XDP-RX CPU      3       2355396     0
XDP-RX CPU      total   9430425

RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   2359474     0
rx_queue_index    0:sum 2359474
rx_queue_index    1:1   2358504     0
rx_queue_index    1:sum 2358504
rx_queue_index    2:2   2357040     0
rx_queue_index    2:sum 2357040
rx_queue_index    3:3   2355392     0
rx_queue_index    3:sum 2355392

I don't understand why overall performance is reducing with the number 
of cores, according to [1] I would expect it to increase until reaching 
a maximum value. Is there any parameter I should tune to overcome the 
problem?

Thanks in advance for your help.
Federico Parola

[1] 
https://github.com/tohojo/xdp-paper/blob/master/benchmarks/bench02_xdp_drop.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-10-26  8:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <VI1PR04MB3104C1D86BDC113F4AC0CF4A9E050@VI1PR04MB3104.eurprd04.prod.outlook.com>
2020-10-14  8:35 ` Multi-core scalability problems Federico Parola
2020-10-13 13:49 Federico Parola
2020-10-13 16:41 ` Jesper Dangaard Brouer
2020-10-13 16:44 ` Toke Høiland-Jørgensen
2020-10-14  6:56   ` Federico Parola
2020-10-14  9:15     ` Jesper Dangaard Brouer
2020-10-14 12:17       ` Federico Parola
2020-10-14 14:26         ` Jesper Dangaard Brouer
2020-10-15 12:04           ` Federico Parola
2020-10-15 13:22             ` Jesper Dangaard Brouer
2020-10-19 15:23               ` Federico Parola
2020-10-19 18:26                 ` Jesper Dangaard Brouer
2020-10-24 13:57                   ` Federico Parola
2020-10-26  8:14                     ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).