Re: Bad performance in RX with sfc 40G

From: Eric Dumazet <eric.dumazet@gmail.com>
To: "Íñigo Huguet" <ihuguet@redhat.com>,
	"Edward Cree" <ecree.xilinx@gmail.com>,
	habetsm.xilinx@gmail.com
Cc: netdev@vger.kernel.org, Dinan Gunawardena <dinang@xilinx.com>,
	Pablo Cascon <pabloc@xilinx.com>
Subject: Re: Bad performance in RX with sfc 40G
Date: Thu, 18 Nov 2021 09:19:39 -0800	[thread overview]
Message-ID: <beef3b28-6818-df7b-eaad-8569cac5d79b@gmail.com> (raw)
In-Reply-To: <CACT4oudChHDKecLfDdA7R8jpQv2Nmz5xBS3hH_jFWeS37CnQGg@mail.gmail.com>

On 11/18/21 7:14 AM, Íñigo Huguet wrote:
> Hello,
> 
> Doing some tests a few weeks ago I noticed a very low performance in
> RX using 40G Solarflare NICs. Doing tests with iperf3 I got more than
> 30Gbps in TX, but just around 15Gbps in RX. Other NICs from other
> vendors could send and receive over 30Gbps.
> 
> I was doing the tests with multiple threads in iperf3 (-P 8).
> 
> The models used are SFC9140 and SFC9220.
> 
> Perf showed that most of the time was being expended in
> `native_queued_spin_lock_slowpath`. Tracing the calls to it with
> bpftrace I got that most of the calls were from __napi_poll > efx_poll
>> efx_fast_push_rx_descriptors > __alloc_pages >
> get_page_from_freelist > ...
> 
> Please can you help me investigate the issue? At first sight, it seems
> a not very optimal memory allocation strategy, or maybe a failure in
> pages recycling strategy...
> 
> This is the output of bpftrace, the 2 call chains that repeat more
> times, both from sfc
> 
> @[
>     native_queued_spin_lock_slowpath+1
>     _raw_spin_lock+26
>     rmqueue_bulk+76
>     get_page_from_freelist+2295
>     __alloc_pages+214
>     efx_fast_push_rx_descriptors+640
>     efx_poll+660
>     __napi_poll+42
>     net_rx_action+547
>     __softirqentry_text_start+208
>     __irq_exit_rcu+179
>     common_interrupt+131
>     asm_common_interrupt+30
>     cpuidle_enter_state+199
>     cpuidle_enter+41
>     do_idle+462
>     cpu_startup_entry+25
>     start_kernel+2465
>     secondary_startup_64_no_verify+194
> ]: 2650
> @[
>     native_queued_spin_lock_slowpath+1
>     _raw_spin_lock+26
>     rmqueue_bulk+76
>     get_page_from_freelist+2295
>     __alloc_pages+214
>     efx_fast_push_rx_descriptors+640
>     efx_poll+660
>     __napi_poll+42
>     net_rx_action+547
>     __softirqentry_text_start+208
>     __irq_exit_rcu+179
>     common_interrupt+131
>     asm_common_interrupt+30
>     cpuidle_enter_state+199
>     cpuidle_enter+41
>     do_idle+462
>     cpu_startup_entry+25
>     secondary_startup_64_no_verify+194
> ]: 17119
> 
> --
> Íñigo Huguet
> 

You could try to :

Make the RX ring buffers bigger (ethtool -G eth0 rx 8192)

and/or

Make sure your tcp socket receive buffer is smaller than number of frames in the ring buffer

echo "4096 131072 2097152" >/proc/sys/net/ipv4/tcp_rmem

You can also try latest net-next, as TCP got something to help this case.

f35f821935d8df76f9c92e2431a225bdff938169 tcp: defer skb freeing after socket lock is released