linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Clément Léger" <clement.leger@bootlin.com>
To: Vladimir Oltean <vladimir.oltean@nxp.com>
Cc: Jakub Kicinski <kuba@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Rob Herring <robh+dt@kernel.org>,
	Claudiu Manoil <claudiu.manoil@nxp.com>,
	Alexandre Belloni <alexandre.belloni@bootlin.com>,
	"UNGLinuxDriver@microchip.com" <UNGLinuxDriver@microchip.com>,
	Andrew Lunn <andrew@lunn.ch>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"devicetree@vger.kernel.org" <devicetree@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Subject: Re: [PATCH v2 3/6] net: ocelot: pre-compute injection frame header content
Date: Mon, 15 Nov 2021 17:03:41 +0100	[thread overview]
Message-ID: <20211115170341.7d48de0d@fixe.home> (raw)
In-Reply-To: <20211115143105.tmjviz7z7ckmlquk@skbuf>

Le Mon, 15 Nov 2021 14:31:06 +0000,
Vladimir Oltean <vladimir.oltean@nxp.com> a écrit :

> On Mon, Nov 15, 2021 at 03:06:20PM +0100, Clément Léger wrote:
> > Le Mon, 15 Nov 2021 06:08:00 -0800,
> > Jakub Kicinski <kuba@kernel.org> a écrit :
> >  
> > > On Mon, 15 Nov 2021 11:13:44 +0100 Clément Léger wrote:  
> > > > Test on standard packets with UDP (iperf3 -t 100 -l 1460 -u -b 0 -c *)
> > > > - With pre-computed header: UDP TX: 	33Mbit/s
> > > > - Without UDP TX: 			31Mbit/s  
> > > > -> 6.5% improvement  
> > > >
> > > > Test on small packets with UDP (iperf3 -t 100 -l 700 -u -b 0 -c *)
> > > > - With pre-computed header: UDP TX: 	15.8Mbit/s
> > > > - Without UDP TX: 			16.4Mbit/s  
> > > > -> 4.3% improvement  
> > >
> > > Something's wrong with these numbers or I'm missing context.
> > > You say improvement in both cases yet in the latter case the
> > > new number is lower?  
> >
> > You are right Jakub, I swapped the last two results,
> >
> > Test on small packets with UDP (iperf3 -t 100 -l 700 -u -b 0 -c *)
> >  - With pre-computed header: UDP TX: 	16.4Mbit/s
> >  - Without UDP TX: 			15.8Mbit/s  
> >  -> 4.3% improvement  
> 
> Even in reverse, something still seems wrong with the numbers.
> My DSPI controller can transfer at a higher data rate than that.
> Where is the rest of the time spent? Computing checksums?

While adding FDMA support, I was surprised by the low performances I
encountered so I spent some times trying to understand and find where
the time was spent. First, I ran a iperf in loopback (using lo) and it
yielded the following results (of course RX/TX runs on the same CPU in
this case):

TCP (iperf3 -c localhost):
 - RX/TX: 84.0Mbit/s

UDP (iperf3 -u -b 0 -c localhost):
 - RX/TX: 65.0Mbit/s

So even in localhost mode, the CPU is already really slow and can only
sustain a really small "throughput". I then tried to check the
performances using the CPU based injection/extraction, and I obtained
the following results:

TCP: (iperf3 -u -b 0 -c)
 - TX: 11.8MBit/s
 - RX: 21.6Mbit/s

UDP (iperf3 -u -b 0 -c)
 - TX: 13.4Mbit/s
 - RX: Not even possible, CPU never succeed to extract a single packet

I then tried to find where was the time spent with ftrace (I kept only
the relevant functions that consume most of the time), the following
results were recorded when using iperf3 with CPU based
injection/extraction.

In TCP TX, a lot of time is spent doing copy from user:

41.71%  iperf3   [kernel.kallsyms]  [k] __raw_copy_to_user
 6.65%  iperf3   [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
 3.23%  iperf3   [kernel.kallsyms]  [k] do_ade
 2.10%  iperf3   [kernel.kallsyms]  [k] __ocelot_write_ix
 2.10%  iperf3   [kernel.kallsyms]  [k] handle_adel_int
 ...

In TCP RX, numbers are even worse for the time spent in
__raw_copy_to_user:

62.95% iperf3   [kernel.kallsyms]  [k] __raw_copy_to_user
 1.97% iperf3   [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
 1.15% iperf3   [kernel.kallsyms]  [k] __copy_page_start
 1.07% iperf3   [kernel.kallsyms]  [k] __skb_datagram_iter
 ...


In UDP TX, some time is spent handling locking and unaligned copies
as well as pushing packets. Unaligned copies are due to the driver
accessing all directly the bytes of the packets as word whhich might be
bad when there is misalignement.

17.97%  iperf3   [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
11.94%  iperf3   [kernel.kallsyms]  [k] do_ade
 9.07%  iperf3   [kernel.kallsyms]  [k] __ocelot_write_ix
 7.74%  iperf3   [kernel.kallsyms]  [k] handle_adel_int
 5.78%  iperf3   [kernel.kallsyms]  [k] copy_from_kernel_nofault
 4.71%  iperf3   [kernel.kallsyms]  [k] __compute_return_epc_for_insn
 2.51%  iperf3   [kernel.kallsyms]  [k] regmap_write
 2.31%  iperf3   [kernel.kallsyms]  [k] __compute_return_epc
 ...

In UDP RX (iperf3 with -b 5M to ensure packets are received), time is
spent in floating point emulation and other various function.

7.26% iperf3   [kernel.kallsyms]  [k] cop1Emulate
2.84% iperf3   [kernel.kallsyms]  [k] do_select
2.08% iperf3   [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
2.06% iperf3   [kernel.kallsyms]  [k] fpu_emulator_cop1Handler
2.01% iperf3   [kernel.kallsyms]  [k] tcp_poll
2.00% iperf3   [kernel.kallsyms]  [k] __raw_copy_to_user


When using the FDMA, the results are the following:

In TCP TX, copy from user is still present and checksuming takes quite
some time. 

31.31% iperf3   [kernel.kallsyms]  [k] __raw_copy_to_user
10.48% iperf3   [kernel.kallsyms]  [k] __csum_partial_copy_to_user
 3.73% iperf3   [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
 2.08% iperf3   [kernel.kallsyms]  [k] tcp_ack
 1.68% iperf3   [kernel.kallsyms]  [k] ocelot_fdma_napi_poll
 1.63% iperf3   [kernel.kallsyms]  [k] tcp_write_xmit
 1.05% iperf3   [kernel.kallsyms]  [k] finish_task_switch

In TCP RX, the majority of time is still taken by __raw_copy_to_user.

63.95%[[m  iperf3   [kernel.kallsyms]  [k] __raw_copy_to_user
 1.29%[[m  iperf3   [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
 1.23%[[m  iperf3   [kernel.kallsyms]  [k] tcp_recvmsg_locked
 1.23%[[m  iperf3   [kernel.kallsyms]  [k] __skb_datagram_iter
 1.07%[[m  iperf3   [kernel.kallsyms]  [k] vfs_read

In UDP TX, time is spent in softirq entry and in checksuming.

9.01% iperf3   [kernel.kallsyms]  [k] __softirqentry_text_start
7.07% iperf3   [kernel.kallsyms]  [k] __csum_partial_copy_to_user
2.28% iperf3   [kernel.kallsyms]  [k] __ip_append_data.isra.0
2.10% iperf3   [kernel.kallsyms]  [k] __dev_queue_xmit
2.08% iperf3   [kernel.kallsyms]  [k] siphash_3u32
2.06% iperf3   [kernel.kallsyms]  [k] udp_sendmsg

And in UDP RX, again, time is spent in floating point emulation and
cheksuming.

10.33% iperf3   [kernel.kallsyms]  [k] cop1Emulate
 7.62% iperf3   [kernel.kallsyms]  [k] csum_partial
 3.32% iperf3   [kernel.kallsyms]  [k] do_select
 2.69% iperf3   [kernel.kallsyms]  [k] ieee754dp_sub
 2.68% iperf3   [kernel.kallsyms]  [k] fpu_emulator_cop1Handler
 2.56% iperf3   [kernel.kallsyms]  [k] ieee754dp_add
 2.33% iperf3   [kernel.kallsyms]  [k] ieee754dp_div

After all these measurements, the CPU appears to be the bottleneck and
simply spend a lot of time in various functions. I did not went further
using perf events since there was no real reason to dig up more in that
way.

-- 
Clément Léger,
Embedded Linux and Kernel engineer at Bootlin
https://bootlin.com

  reply	other threads:[~2021-11-15 16:08 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-03  9:19 [PATCH v2 0/6] Add FDMA support on ocelot switch driver Clément Léger
2021-11-03  9:19 ` [PATCH v2 1/6] net: ocelot: add support to get port mac from device-tree Clément Léger
2021-11-03 10:26   ` Vladimir Oltean
2021-11-15 11:19   ` Julian Wiedmann
2021-11-15 11:24     ` Clément Léger
2021-11-03  9:19 ` [PATCH v2 2/6] dt-bindings: net: convert mscc,vsc7514-switch bindings to yaml Clément Léger
2021-11-03 10:45   ` Vladimir Oltean
2021-11-08 11:13     ` Clément Léger
2021-11-12 20:06   ` Rob Herring
2021-11-03  9:19 ` [PATCH v2 3/6] net: ocelot: pre-compute injection frame header content Clément Léger
2021-11-03 12:38   ` Vladimir Oltean
2021-11-03 13:53     ` Clément Léger
2021-11-15 10:13       ` Clément Léger
2021-11-15 10:51         ` Vladimir Oltean
2021-11-15 10:58           ` Clément Léger
2021-11-15 14:08         ` Jakub Kicinski
2021-11-15 14:06           ` Clément Léger
2021-11-15 14:31             ` Vladimir Oltean
2021-11-15 16:03               ` Clément Léger [this message]
2021-11-03  9:19 ` [PATCH v2 4/6] net: ocelot: add support for ndo_change_mtu Clément Léger
2021-11-03 12:40   ` Vladimir Oltean
2021-11-03 13:07     ` Clément Léger
2021-11-03  9:19 ` [PATCH v2 5/6] net: ocelot: add FDMA support Clément Léger
2021-11-03 11:25   ` Denis Kirjanov
2021-11-03 12:31   ` Vladimir Oltean
2021-11-03 14:22     ` Clément Léger
2021-11-03  9:19 ` [PATCH v2 6/6] net: ocelot: add jumbo frame support for FDMA Clément Léger
2021-11-03 12:43   ` Vladimir Oltean
2021-11-03 14:30     ` Clément Léger
2021-11-03 10:46 ` [PATCH v2 0/6] Add FDMA support on ocelot switch driver Denis Kirjanov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211115170341.7d48de0d@fixe.home \
    --to=clement.leger@bootlin.com \
    --cc=UNGLinuxDriver@microchip.com \
    --cc=alexandre.belloni@bootlin.com \
    --cc=andrew@lunn.ch \
    --cc=claudiu.manoil@nxp.com \
    --cc=davem@davemloft.net \
    --cc=devicetree@vger.kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=robh+dt@kernel.org \
    --cc=thomas.petazzoni@bootlin.com \
    --cc=vladimir.oltean@nxp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).