Re: dsa_master_find_slave()'s time complexity and potential performance hit

From: Vladimir Oltean <olteanv@gmail.com>
To: DENG Qingfang <dqfext@gmail.com>
Cc: netdev <netdev@vger.kernel.org>, "Andrew Lunn" <andrew@lunn.ch>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"David Miller" <davem@davemloft.net>,
	"Florian Fainelli" <f.fainelli@gmail.com>,
	"Vivien Didelot" <vivien.didelot@gmail.com>,
	linux-kernel@vger.kernel.org,
	"Russell King - ARM Linux admin" <linux@armlinux.org.uk>,
	"Birger Koblitz" <git@birger-koblitz.de>,
	"Bjørn Mork" <bjorn@mork.no>,
	"Stijn Segers" <foss@volatilesystems.org>
Subject: Re: dsa_master_find_slave()'s time complexity and potential performance hit
Date: Tue, 2 Mar 2021 13:28:42 +0200	[thread overview]
Message-ID: <20210302112842.5t54kgz3j556cm52@skbuf> (raw)
In-Reply-To: <CALW65jatBuoE=NDRqccfiMVugPh5eeYSf-9a9qWYhvvszD2Jiw@mail.gmail.com>

On Tue, Mar 02, 2021 at 01:51:42PM +0800, DENG Qingfang wrote:
> Since commit 7b9a2f4bac68 ("net: dsa: use ports list to find slave"),
> dsa_master_find_slave() has been iterating over a linked list instead
> of accessing arrays, making its time complexity O(n).
> The said function is called frequently in DSA RX path, so it may cause
> a performance hit, especially for switches that have many ports (20+)
> such as RTL8380/8390/9300 (There is a downstream DSA driver for it,
> see https://github.com/openwrt/openwrt/tree/openwrt-21.02/target/linux/realtek/files-5.4/drivers/net/dsa/rtl83xx).
> I don't have one of those switches, so I can't test if the performance
> impact is huge or not.

You actually can test that, you could create a tagger in mainline based
on the rtl83xx tagger from downstream, and then you could modify
dsa_loop to use DSA_TAG_PROTO_RTL83XX.

Then you can craft some packets and inject them into the port on which
dsa_loop is attached using tcpreplay.
What I do is:
- I initially send some packets using the xmit function of the tagger,
  just to have an initial template to start with. This assumes that the
  xmit format is more or less similar to the rcv format.
- capture those xmit packets using tcpdump -i eth0 -Q out -w tagger.pcap
- then open tagger-xmit.pcap in wireshark, run Export Specified Packet
  and save it in the K12 text file format
- edit the tagger-xmit.txt file according to my liking, in this case you
  would have to create a receive packet on port 19 (the one where it's
  most expensive to do the linear lookup of the ports list)
- import the tagger.txt file again in Wireshark and save it as a new
  tagger-rcv.pcap
- run tcpreplay on that pcap file in a loop

I would probably go with a very small packet size (64 bytes), and enable
IP routing between two DSA interfaces lan0 and lan1:

ip link set lan0 address de:ad:be:ef:00:00
ip link set lan1 address de:ad:be:ef:00:01
ip addr add 192.168.100.2/24 dev lan0
ip addr add 192.168.101.2/24 dev lan1
echo 1 > /proc/sys/net/ipv4/ip_forward
arp -s 192.168.100.1 00:01:02:03:04:05 dev lan0 # towards spoofed sender
arp -s 192.168.200.1 00:01:02:03:04:06 dev lan1 # towards spoofed receiver

I would make sure the test packet from tagger-rcv.pcap has:
- a source MAC address corresponding to your spoofed sender (in my
  example 00:01:02:03:04:05).
- a source IP address corresponding to your spoofed sender (in my
  example 192.168.100.1)
- a destination MAC address corresponding to the lan0 interface
  (de:ad:be:ef:00:00)
- a destination IP address corresponding to the spoofed receiver
  (192.168.101.2)

Then the network stack should route the received packet on lan0 by
replacing the destination MAC address with that of the spoofed receiver
(00:01:02:03:04:06), decrement the IP TTL to 63 and send it through lan1
according to the routing table.

To make sure your throughput is consistent you can do some things such
as add a static flow steering rule on the DSA master to ensure the
packets from the same flow are affine to the same CPU, and that if you
send bidirectional traffic, it gets load balanced across multiple CPUs:

ethtool --config-nfc eth0 flow-type ether dst de:ad:be:ef:00:00 m ff:ff:ff:ff:ff:ff action 0
ethtool --config-nfc eth0 flow-type ether dst de:ad:be:ef:00:01 m ff:ff:ff:ff:ff:ff action 1

Also, you should probably turn off GRO since it's not useful with IP
forwarding and it takes a lot of time to do the re-segmentation on TX,
to recalculate the checksums and all.

ethtool -K lan0 gro off
ethtool -K lan1 gro off

You could probably adjust things a bit, like for example see if the rcv
throughput on lan19 is higher than the throughput on lan0.

That should give you a baseline. Only then would I start hacking at
dsa_master_find_slave and see what benefit it brings to replace the list
lookup with something of fixed temporal complexity, such as a linear
array or something.

I'm curious what you come up with.