All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Taht <dave.taht@gmail.com>
To: "Rafał Miłecki" <zajec5@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>, Felix Fietkau <nbd@nbd.name>,
	Arnd Bergmann <arnd@arndb.de>,
	 Alexander Lobakin <alexandr.lobakin@intel.com>,
	Network Development <netdev@vger.kernel.org>,
	 linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	Russell King <linux@armlinux.org.uk>,
	 "openwrt-devel@lists.openwrt.org"
	<openwrt-devel@lists.openwrt.org>,
	Florian Fainelli <f.fainelli@gmail.com>
Subject: Re: Optimizing kernel compilation / alignments for network performance
Date: Tue, 10 May 2022 07:09:56 -0700	[thread overview]
Message-ID: <CAA93jw5=Dh9w6x_EQtuWdAbWVUF00M+5x3idFz-XOvAzG5dMQw@mail.gmail.com> (raw)
In-Reply-To: <391ca2d1-6977-0c9b-588c-31ad9bb68c82@gmail.com>

I might have mentioned this before. but I'm really big on using the
flent tool to drive test runs. The comparison
plots are to die for, and it can also sample cpu and other statistics
over time. Also I'm big on testing bidirectional functionality.

client$ flent -H server -t what_test_conditions_you_have
--step-size=.05 --te=upload_streams=4 -x --socket-stats tcp_nup

Gathers a lot of data about everything. The rrul test is one of my
favorites for creating a bittorrent like load.

flent is usually available in apt/rpm/etc. there are scripts that can
run on routers, openwrt has opkg install flent-tools, you use ssh to
fire these off.

there are a few python dependencies for the flent-gui, that aren't
needed for the flent server or client
sometimes you have to install and compile netperf on your own with
./configure --enable-demo

Please see flent.org for more details, and/or hit the flent-users list
for questions.

On Tue, May 10, 2022 at 5:03 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>
> On 6.05.2022 14:42, Andrew Lunn wrote:
> >>> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
> >>> This seems rather excessive, especially since most people are going to use a MTU of 1500.
> >>> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
> >>> This should significantly reduce the time spent on flushing caches.
> >>
> >> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
> >> configure MTU and add support for frames beyond 8192 byte size"):
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03
> >>
> >> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).
> >>
> >> I do all my testing with
> >> #define BGMAC_RX_MAX_FRAME_SIZE                      1536
> >
> > That helps show that cache operations are part of your bottleneck.
> >
> > Taking a quick look at the driver. On the receive side:
> >
> >                         /* Unmap buffer to make it accessible to the CPU */
> >                          dma_unmap_single(dma_dev, dma_addr,
> >                                           BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);
> >
> > Here is data is mapped read for the CPU to use it.
> >
> >                       /* Get info from the header */
> >                          len = le16_to_cpu(rx->len);
> >                          flags = le16_to_cpu(rx->flags);
> >
> >                          /* Check for poison and drop or pass the packet */
> >                          if (len == 0xdead && flags == 0xbeef) {
> >                                  netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n",
> >                                             ring->start);
> >                                  put_page(virt_to_head_page(buf));
> >                                  bgmac->net_dev->stats.rx_errors++;
> >                                  break;
> >                          }
> >
> >                          if (len > BGMAC_RX_ALLOC_SIZE) {
> >                                  netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n",
> >                                             ring->start);
> >                                  put_page(virt_to_head_page(buf));
> >                                  bgmac->net_dev->stats.rx_length_errors++;
> >                                  bgmac->net_dev->stats.rx_errors++;
> >                                  break;
> >                          }
> >
> >                          /* Omit CRC. */
> >                          len -= ETH_FCS_LEN;
> >
> >                          skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
> >                          if (unlikely(!skb)) {
> >                                  netdev_err(bgmac->net_dev, "build_skb failed\n");
> >                                  put_page(virt_to_head_page(buf));
> >                                  bgmac->net_dev->stats.rx_errors++;
> >                                  break;
> >                          }
> >                          skb_put(skb, BGMAC_RX_FRAME_OFFSET +
> >                                  BGMAC_RX_BUF_OFFSET + len);
> >                          skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
> >                                   BGMAC_RX_BUF_OFFSET);
> >
> >                          skb_checksum_none_assert(skb);
> >                          skb->protocol = eth_type_trans(skb, bgmac->net_dev);
> >
> > and this is the first access of the actual data. You can make the
> > cache actually work for you, rather than against you, to adding a call to
> >
> >       prefetch(buf);
> >
> > just after the dma_unmap_single(). That will start getting the frame
> > header from DRAM into cache, so hopefully it is available by the time
> > eth_type_trans() is called and you don't have a cache miss.
>
>
> I don't think that analysis is correct.
>
> Please take a look at following lines:
> struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET;
> void *buf = slot->buf;
>
> The first we do after dma_unmap_single() call is rx->len read. That
> actually points to DMA data. There is nothing we could keep CPU busy
> with while preteching data.
>
> FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by
> a single 1 Mb/s. Speed was exactly the same as without prefetch() call.



-- 
FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/
Dave Täht CEO, TekLibre, LLC

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Dave Taht <dave.taht@gmail.com>
To: "Rafał Miłecki" <zajec5@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>, Felix Fietkau <nbd@nbd.name>,
	Arnd Bergmann <arnd@arndb.de>,
	Alexander Lobakin <alexandr.lobakin@intel.com>,
	Network Development <netdev@vger.kernel.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	Russell King <linux@armlinux.org.uk>,
	"openwrt-devel@lists.openwrt.org"
	<openwrt-devel@lists.openwrt.org>,
	Florian Fainelli <f.fainelli@gmail.com>
Subject: Re: Optimizing kernel compilation / alignments for network performance
Date: Tue, 10 May 2022 07:09:56 -0700	[thread overview]
Message-ID: <CAA93jw5=Dh9w6x_EQtuWdAbWVUF00M+5x3idFz-XOvAzG5dMQw@mail.gmail.com> (raw)
In-Reply-To: <391ca2d1-6977-0c9b-588c-31ad9bb68c82@gmail.com>

I might have mentioned this before. but I'm really big on using the
flent tool to drive test runs. The comparison
plots are to die for, and it can also sample cpu and other statistics
over time. Also I'm big on testing bidirectional functionality.

client$ flent -H server -t what_test_conditions_you_have
--step-size=.05 --te=upload_streams=4 -x --socket-stats tcp_nup

Gathers a lot of data about everything. The rrul test is one of my
favorites for creating a bittorrent like load.

flent is usually available in apt/rpm/etc. there are scripts that can
run on routers, openwrt has opkg install flent-tools, you use ssh to
fire these off.

there are a few python dependencies for the flent-gui, that aren't
needed for the flent server or client
sometimes you have to install and compile netperf on your own with
./configure --enable-demo

Please see flent.org for more details, and/or hit the flent-users list
for questions.

On Tue, May 10, 2022 at 5:03 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>
> On 6.05.2022 14:42, Andrew Lunn wrote:
> >>> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
> >>> This seems rather excessive, especially since most people are going to use a MTU of 1500.
> >>> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes.
> >>> This should significantly reduce the time spent on flushing caches.
> >>
> >> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
> >> configure MTU and add support for frames beyond 8192 byte size"):
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03
> >>
> >> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).
> >>
> >> I do all my testing with
> >> #define BGMAC_RX_MAX_FRAME_SIZE                      1536
> >
> > That helps show that cache operations are part of your bottleneck.
> >
> > Taking a quick look at the driver. On the receive side:
> >
> >                         /* Unmap buffer to make it accessible to the CPU */
> >                          dma_unmap_single(dma_dev, dma_addr,
> >                                           BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);
> >
> > Here is data is mapped read for the CPU to use it.
> >
> >                       /* Get info from the header */
> >                          len = le16_to_cpu(rx->len);
> >                          flags = le16_to_cpu(rx->flags);
> >
> >                          /* Check for poison and drop or pass the packet */
> >                          if (len == 0xdead && flags == 0xbeef) {
> >                                  netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n",
> >                                             ring->start);
> >                                  put_page(virt_to_head_page(buf));
> >                                  bgmac->net_dev->stats.rx_errors++;
> >                                  break;
> >                          }
> >
> >                          if (len > BGMAC_RX_ALLOC_SIZE) {
> >                                  netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n",
> >                                             ring->start);
> >                                  put_page(virt_to_head_page(buf));
> >                                  bgmac->net_dev->stats.rx_length_errors++;
> >                                  bgmac->net_dev->stats.rx_errors++;
> >                                  break;
> >                          }
> >
> >                          /* Omit CRC. */
> >                          len -= ETH_FCS_LEN;
> >
> >                          skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
> >                          if (unlikely(!skb)) {
> >                                  netdev_err(bgmac->net_dev, "build_skb failed\n");
> >                                  put_page(virt_to_head_page(buf));
> >                                  bgmac->net_dev->stats.rx_errors++;
> >                                  break;
> >                          }
> >                          skb_put(skb, BGMAC_RX_FRAME_OFFSET +
> >                                  BGMAC_RX_BUF_OFFSET + len);
> >                          skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
> >                                   BGMAC_RX_BUF_OFFSET);
> >
> >                          skb_checksum_none_assert(skb);
> >                          skb->protocol = eth_type_trans(skb, bgmac->net_dev);
> >
> > and this is the first access of the actual data. You can make the
> > cache actually work for you, rather than against you, to adding a call to
> >
> >       prefetch(buf);
> >
> > just after the dma_unmap_single(). That will start getting the frame
> > header from DRAM into cache, so hopefully it is available by the time
> > eth_type_trans() is called and you don't have a cache miss.
>
>
> I don't think that analysis is correct.
>
> Please take a look at following lines:
> struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET;
> void *buf = slot->buf;
>
> The first we do after dma_unmap_single() call is rx->len read. That
> actually points to DMA data. There is nothing we could keep CPU busy
> with while preteching data.
>
> FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by
> a single 1 Mb/s. Speed was exactly the same as without prefetch() call.



-- 
FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/
Dave Täht CEO, TekLibre, LLC

  reply	other threads:[~2022-05-10 14:11 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-27 12:04 Optimizing kernel compilation / alignments for network performance Rafał Miłecki
2022-04-27 12:04 ` Rafał Miłecki
2022-04-27 12:56 ` Alexander Lobakin
2022-04-27 12:56   ` Alexander Lobakin
2022-04-27 17:31   ` Rafał Miłecki
2022-04-27 17:31     ` Rafał Miłecki
2022-04-29 14:18     ` Rafał Miłecki
2022-04-29 14:18       ` Rafał Miłecki
2022-04-29 14:49     ` Arnd Bergmann
2022-04-29 14:49       ` Arnd Bergmann
2022-05-05 15:42       ` Rafał Miłecki
2022-05-05 15:42         ` Rafał Miłecki
2022-05-05 16:04         ` Andrew Lunn
2022-05-05 16:04           ` Andrew Lunn
2022-05-05 16:46           ` Felix Fietkau
2022-05-05 16:46             ` Felix Fietkau
2022-05-06  7:47             ` Rafał Miłecki
2022-05-06  7:47               ` Rafał Miłecki
2022-05-06 12:42               ` Andrew Lunn
2022-05-06 12:42                 ` Andrew Lunn
2022-05-10 10:29                 ` Rafał Miłecki
2022-05-10 10:29                   ` Rafał Miłecki
2022-05-10 14:09                   ` Dave Taht [this message]
2022-05-10 14:09                     ` Dave Taht
2022-05-10 19:15                     ` Dave Taht
2022-05-10 19:15                       ` Dave Taht
2022-05-06  7:44           ` Rafał Miłecki
2022-05-06  7:44             ` Rafał Miłecki
2022-05-06  8:45             ` Arnd Bergmann
2022-05-06  8:45               ` Arnd Bergmann
2022-05-06  8:55               ` Rafał Miłecki
2022-05-06  8:55                 ` Rafał Miłecki
2022-05-06  9:44                 ` Arnd Bergmann
2022-05-06  9:44                   ` Arnd Bergmann
2022-05-10 12:51                   ` Rafał Miłecki
2022-05-10 12:51                     ` Rafał Miłecki
2022-05-10 13:19                     ` Arnd Bergmann
2022-05-10 13:19                       ` Arnd Bergmann
2022-05-10 11:23               ` Rafał Miłecki
2022-05-10 11:23                 ` Rafał Miłecki
2022-05-10 13:18                 ` Arnd Bergmann
2022-05-10 13:18                   ` Arnd Bergmann
2022-05-08  9:53             ` Rafał Miłecki
2022-05-08  9:53               ` Rafał Miłecki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAA93jw5=Dh9w6x_EQtuWdAbWVUF00M+5x3idFz-XOvAzG5dMQw@mail.gmail.com' \
    --to=dave.taht@gmail.com \
    --cc=alexandr.lobakin@intel.com \
    --cc=andrew@lunn.ch \
    --cc=arnd@arndb.de \
    --cc=f.fainelli@gmail.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux@armlinux.org.uk \
    --cc=nbd@nbd.name \
    --cc=netdev@vger.kernel.org \
    --cc=openwrt-devel@lists.openwrt.org \
    --cc=zajec5@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.