xdp-newbies.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* XDP redirect throughput with multi-CPU i40e
@ 2022-07-12 19:55 Adam Smith
  2022-07-12 21:19 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 4+ messages in thread
From: Adam Smith @ 2022-07-12 19:55 UTC (permalink / raw)
  To: xdp-newbies

Hello,

I have a question regarding bpf_redirect/bpf_redirect_map and latency
that we are seeing in a test. The environment is as follows:

- Debian Bullseye, running 5.18.0-0.bpo.1-amd64 kernel from
Bullseye-backports (Also tested on 5.16)
- Intel Xeon X3430 @ 2.40GHz. 4 cores, no HT
- Intel X710-DA2 using i40e driver included with the kernel.
- Both interfaces (enp1s0f0 and enps0f1) in a simple netfilter bridge.
- Ring parameters for rx/tx are both set to the max of 4096, with no
other nic-specific parameters changed.

Each interface has 4 combined IRQs, pinned per set_irq_affinity.
`irqbalanced` is not installed.

Traffic is generated by another directly attached machine via iperf3
3.9 (`iperf3 -c -t 0 192.168.1.3 --bidir`) to a directly attached
server on the other side.

The server in question does nothing more than forward packets as a
transparent bridge.

An XDP program is installed on f0 to redirect to f1, and f1 to
redirect to f0. I have tried programs that simply call
`bpf_redirect()`, as well as programs that share a device map and call
`bpf_redirect_map()`, with idententical results.

When channel parameters for each interface are reduced to a single IRQ
via `ethtool -L enp1s0f0 combined 1`, and both interface IRQs are
bound to the same CPU core via smp_affinity, XDP produces improved
bitrate with reduced CPU utilization over non-XDP tests:
- Stock netfilter bridge: 9.11 Gbps in both directions at 98%
utilization of pinned core.
- XDP: Approximately 9.18 Gbps in both directions at 50% utilization
of pinned core.

However, when multiple cores are engaged (combined 4, with
set_irq_affinity), XDP processes markedly fewer packets per second
(950,000 vs approximately 1.6 million). iperf3 also shows a large
number of retransmissions in its output regardless of CPU engagement
(approximately 6,500 with XDP over 2 minutes vs 850 with single core
tests).

This is a sample taken from linux/samples xdp_monitor showing
redirection and transmission of packets with XDP engaged:

Summary                              944,508 redir/s            0
err,drop/s    944,506 xmit/s
  kthread                                           0 pkt/s
   0 drop/s                   0 sched
  redirect total                        944,508 redir/s
      cpu:0                               470,148 redir/s
      cpu:2                                 15,078 redir/s
      cpu:3                               459,282 redir/s
  redirect_err                                    0 error/s
  xdp_exception                                0 hit/s
  devmap_xmit total               944,506 xmit/s               0
drop/s         0 drv_err/s
     cpu:0                                 470,148 xmit/s
 0 drop/s         0 drv_err/s
     cpu:2                                   15,078 xmit/s
  0 drop/s         0 drv_err/s
     cpu:3                                 459,280 xmit/s
 0 drop/s         0 drv_err/s
  xmit enp1s0f0->enp1s0f1    485,249 xmit/s                0 drop/s
     0 drv_err/s
     cpu:0                                 470,172 xmit/s
  0 drop/s         0 drv_err/s
     cpu:2                                   15,078 xmit/s
   0 drop/s         0 drv_err/s
  xmit enp1s0f1->enp1s0f0    459,263 xmit/s                0 drop/s
     0 drv_err/s
     cpu:3                                 459,263 xmit/s
  0 drop/s         0 drv_err/s

Our current hypothesis is that this is a CPU affinity issue. We
believe a different core is being used for transmission. In efforts to
prove this, how can we successfully measure if bpf_redirect() is
causing packets to be transmitted by a different core than they were
received by? We are still trying to understand how bpf_redirect()
selects which core/IRQ to transmit on and would appreciate any insight
or followup material to research.

Any additional information on how we might be able to overcome this
would be deeply appreciated!

Best regards,
Adam Smith

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: XDP redirect throughput with multi-CPU i40e
  2022-07-12 19:55 XDP redirect throughput with multi-CPU i40e Adam Smith
@ 2022-07-12 21:19 ` Toke Høiland-Jørgensen
  2022-07-13 10:16   ` Maciej Fijalkowski
  0 siblings, 1 reply; 4+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-12 21:19 UTC (permalink / raw)
  To: Adam Smith, xdp-newbies

Adam Smith <hrotsvit@gmail.com> writes:

> Hello,
>
> I have a question regarding bpf_redirect/bpf_redirect_map and latency
> that we are seeing in a test. The environment is as follows:
>
> - Debian Bullseye, running 5.18.0-0.bpo.1-amd64 kernel from
> Bullseye-backports (Also tested on 5.16)
> - Intel Xeon X3430 @ 2.40GHz. 4 cores, no HT
> - Intel X710-DA2 using i40e driver included with the kernel.
> - Both interfaces (enp1s0f0 and enps0f1) in a simple netfilter bridge.
> - Ring parameters for rx/tx are both set to the max of 4096, with no
> other nic-specific parameters changed.
>
> Each interface has 4 combined IRQs, pinned per set_irq_affinity.
> `irqbalanced` is not installed.
>
> Traffic is generated by another directly attached machine via iperf3
> 3.9 (`iperf3 -c -t 0 192.168.1.3 --bidir`) to a directly attached
> server on the other side.
>
> The server in question does nothing more than forward packets as a
> transparent bridge.
>
> An XDP program is installed on f0 to redirect to f1, and f1 to
> redirect to f0. I have tried programs that simply call
> `bpf_redirect()`, as well as programs that share a device map and call
> `bpf_redirect_map()`, with idententical results.
>
> When channel parameters for each interface are reduced to a single IRQ
> via `ethtool -L enp1s0f0 combined 1`, and both interface IRQs are
> bound to the same CPU core via smp_affinity, XDP produces improved
> bitrate with reduced CPU utilization over non-XDP tests:
> - Stock netfilter bridge: 9.11 Gbps in both directions at 98%
> utilization of pinned core.
> - XDP: Approximately 9.18 Gbps in both directions at 50% utilization
> of pinned core.
>
> However, when multiple cores are engaged (combined 4, with
> set_irq_affinity), XDP processes markedly fewer packets per second
> (950,000 vs approximately 1.6 million). iperf3 also shows a large
> number of retransmissions in its output regardless of CPU engagement
> (approximately 6,500 with XDP over 2 minutes vs 850 with single core
> tests).
>
> This is a sample taken from linux/samples xdp_monitor showing
> redirection and transmission of packets with XDP engaged:
>
> Summary                              944,508 redir/s            0
> err,drop/s    944,506 xmit/s
>   kthread                                           0 pkt/s
>    0 drop/s                   0 sched
>   redirect total                        944,508 redir/s
>       cpu:0                               470,148 redir/s
>       cpu:2                                 15,078 redir/s
>       cpu:3                               459,282 redir/s
>   redirect_err                                    0 error/s
>   xdp_exception                                0 hit/s
>   devmap_xmit total               944,506 xmit/s               0
> drop/s         0 drv_err/s
>      cpu:0                                 470,148 xmit/s
>  0 drop/s         0 drv_err/s
>      cpu:2                                   15,078 xmit/s
>   0 drop/s         0 drv_err/s
>      cpu:3                                 459,280 xmit/s
>  0 drop/s         0 drv_err/s
>   xmit enp1s0f0->enp1s0f1    485,249 xmit/s                0 drop/s
>      0 drv_err/s
>      cpu:0                                 470,172 xmit/s
>   0 drop/s         0 drv_err/s
>      cpu:2                                   15,078 xmit/s
>    0 drop/s         0 drv_err/s
>   xmit enp1s0f1->enp1s0f0    459,263 xmit/s                0 drop/s
>      0 drv_err/s
>      cpu:3                                 459,263 xmit/s
>   0 drop/s         0 drv_err/s
>
> Our current hypothesis is that this is a CPU affinity issue. We
> believe a different core is being used for transmission. In efforts to
> prove this, how can we successfully measure if bpf_redirect() is
> causing packets to be transmitted by a different core than they were
> received by? We are still trying to understand how bpf_redirect()
> selects which core/IRQ to transmit on and would appreciate any insight
> or followup material to research.

There is no mechanism in bpf_redirect() to switch CPUs (outside of
cpumap). When you call XDP_REDIRECT, the frame will be added to a
per-device per-CPU flush list, which will be flushed (on that same CPU).
The i40e allocates separate rings for XDP, though, and not sure how it
does that, so maybe those are what's missing. You should be able to see
drops in the output if that's what's going on; and the packets should
still be processed by XDP.

So sounds more like the hardware configuration is causing packet loss
before it even hits XDP. Do you see anything in the ethtool stats that
might explain where packets are being dropped?

-Toke


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: XDP redirect throughput with multi-CPU i40e
  2022-07-12 21:19 ` Toke Høiland-Jørgensen
@ 2022-07-13 10:16   ` Maciej Fijalkowski
  2022-07-13 15:18     ` Adam Smith
  0 siblings, 1 reply; 4+ messages in thread
From: Maciej Fijalkowski @ 2022-07-13 10:16 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Adam Smith, xdp-newbies

On Tue, Jul 12, 2022 at 11:19:11PM +0200, Toke Høiland-Jørgensen wrote:
> Adam Smith <hrotsvit@gmail.com> writes:
> 
> > Hello,
> >
> > I have a question regarding bpf_redirect/bpf_redirect_map and latency
> > that we are seeing in a test. The environment is as follows:
> >
> > - Debian Bullseye, running 5.18.0-0.bpo.1-amd64 kernel from
> > Bullseye-backports (Also tested on 5.16)
> > - Intel Xeon X3430 @ 2.40GHz. 4 cores, no HT
> > - Intel X710-DA2 using i40e driver included with the kernel.
> > - Both interfaces (enp1s0f0 and enps0f1) in a simple netfilter bridge.
> > - Ring parameters for rx/tx are both set to the max of 4096, with no
> > other nic-specific parameters changed.
> >
> > Each interface has 4 combined IRQs, pinned per set_irq_affinity.
> > `irqbalanced` is not installed.
> >
> > Traffic is generated by another directly attached machine via iperf3
> > 3.9 (`iperf3 -c -t 0 192.168.1.3 --bidir`) to a directly attached
> > server on the other side.
> >
> > The server in question does nothing more than forward packets as a
> > transparent bridge.
> >
> > An XDP program is installed on f0 to redirect to f1, and f1 to
> > redirect to f0. I have tried programs that simply call
> > `bpf_redirect()`, as well as programs that share a device map and call
> > `bpf_redirect_map()`, with idententical results.
> >
> > When channel parameters for each interface are reduced to a single IRQ
> > via `ethtool -L enp1s0f0 combined 1`, and both interface IRQs are
> > bound to the same CPU core via smp_affinity, XDP produces improved
> > bitrate with reduced CPU utilization over non-XDP tests:
> > - Stock netfilter bridge: 9.11 Gbps in both directions at 98%
> > utilization of pinned core.
> > - XDP: Approximately 9.18 Gbps in both directions at 50% utilization
> > of pinned core.
> >
> > However, when multiple cores are engaged (combined 4, with
> > set_irq_affinity), XDP processes markedly fewer packets per second
> > (950,000 vs approximately 1.6 million). iperf3 also shows a large
> > number of retransmissions in its output regardless of CPU engagement
> > (approximately 6,500 with XDP over 2 minutes vs 850 with single core
> > tests).
> >
> > This is a sample taken from linux/samples xdp_monitor showing
> > redirection and transmission of packets with XDP engaged:
> >
> > Summary                              944,508 redir/s            0
> > err,drop/s    944,506 xmit/s
> >   kthread                                           0 pkt/s
> >    0 drop/s                   0 sched
> >   redirect total                        944,508 redir/s
> >       cpu:0                               470,148 redir/s
> >       cpu:2                                 15,078 redir/s
> >       cpu:3                               459,282 redir/s
> >   redirect_err                                    0 error/s
> >   xdp_exception                                0 hit/s
> >   devmap_xmit total               944,506 xmit/s               0
> > drop/s         0 drv_err/s
> >      cpu:0                                 470,148 xmit/s
> >  0 drop/s         0 drv_err/s
> >      cpu:2                                   15,078 xmit/s
> >   0 drop/s         0 drv_err/s
> >      cpu:3                                 459,280 xmit/s
> >  0 drop/s         0 drv_err/s
> >   xmit enp1s0f0->enp1s0f1    485,249 xmit/s                0 drop/s
> >      0 drv_err/s
> >      cpu:0                                 470,172 xmit/s
> >   0 drop/s         0 drv_err/s
> >      cpu:2                                   15,078 xmit/s
> >    0 drop/s         0 drv_err/s
> >   xmit enp1s0f1->enp1s0f0    459,263 xmit/s                0 drop/s
> >      0 drv_err/s
> >      cpu:3                                 459,263 xmit/s
> >   0 drop/s         0 drv_err/s
> >
> > Our current hypothesis is that this is a CPU affinity issue. We
> > believe a different core is being used for transmission. In efforts to
> > prove this, how can we successfully measure if bpf_redirect() is
> > causing packets to be transmitted by a different core than they were
> > received by? We are still trying to understand how bpf_redirect()
> > selects which core/IRQ to transmit on and would appreciate any insight
> > or followup material to research.
> 
> There is no mechanism in bpf_redirect() to switch CPUs (outside of
> cpumap). When you call XDP_REDIRECT, the frame will be added to a
> per-device per-CPU flush list, which will be flushed (on that same CPU).
> The i40e allocates separate rings for XDP, though, and not sure how it
> does that, so maybe those are what's missing. You should be able to see
> drops in the output if that's what's going on; and the packets should
> still be processed by XDP.
> 
> So sounds more like the hardware configuration is causing packet loss
> before it even hits XDP. Do you see anything in the ethtool stats that
> might explain where packets are being dropped?

I don't know how irqs are exactly bound to which cpus but most probably
this is driver issue as Toke is saying.

i40e_xdp_xmit() uses smp_processor_id() as an index to xdp rings array, so
if you limit queue count to 4 and bound irq to say cpu 10, you'll return
with -ENXIO as queue_index will be >= than vsi->num_queue_pairs.

I believe that such issues were addressed on ice driver. In there, xdp
rings array is sized to num_possible_cpus() regardless of user's queue
count setting and smp_processor_id() can be safely used.

Adam, could you skip the `ethtool -L $IFACE combined 4` and work with your
4 flows to see if there is any difference?

Maciej

> 
> -Toke
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: XDP redirect throughput with multi-CPU i40e
  2022-07-13 10:16   ` Maciej Fijalkowski
@ 2022-07-13 15:18     ` Adam Smith
  0 siblings, 0 replies; 4+ messages in thread
From: Adam Smith @ 2022-07-13 15:18 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: Toke Høiland-Jørgensen, xdp-newbies

Hi,

Maciej - in this particular situation, `combined 4` was selected
because the CPU being used only has 4 cores, and 4 is what the driver
auto-selects upon boot as well.

Toke - we are seeing drops from `port.rx_dropped` on both interfaces:

1 IRQ, same CPU, no XDP -- port.rx_dropped: 0 pps / interface
1 IRQ, same CPU, XDP_REDIRECT -- port.rx_dropped: appx. 50-75 pps / interface
4 IRQ, no XDP -- port.rx_dropped: appx. 25-50 pps / interface
4 IRQ, XDP_REDIRECT -- port.rx_dropped: appx. 2000 pps / interface

`rx_dropped` remains 0 in all cases.

Of note, when XDP is not used in a 4 IRQ setup, CPU load is shown on 2
cores, related to the IRQs handling the 2 primary traffic flows
generated by bidirectional iperf3 (a byproduct of RSS). When XDP is
used, the load on those two cores drops significantly, but we see an
increased load on a 3rd core.

Thanks!
Adam

On Wed, Jul 13, 2022 at 5:17 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Tue, Jul 12, 2022 at 11:19:11PM +0200, Toke Høiland-Jørgensen wrote:
> > Adam Smith <hrotsvit@gmail.com> writes:
> >
> > > Hello,
> > >
> > > I have a question regarding bpf_redirect/bpf_redirect_map and latency
> > > that we are seeing in a test. The environment is as follows:
> > >
> > > - Debian Bullseye, running 5.18.0-0.bpo.1-amd64 kernel from
> > > Bullseye-backports (Also tested on 5.16)
> > > - Intel Xeon X3430 @ 2.40GHz. 4 cores, no HT
> > > - Intel X710-DA2 using i40e driver included with the kernel.
> > > - Both interfaces (enp1s0f0 and enps0f1) in a simple netfilter bridge.
> > > - Ring parameters for rx/tx are both set to the max of 4096, with no
> > > other nic-specific parameters changed.
> > >
> > > Each interface has 4 combined IRQs, pinned per set_irq_affinity.
> > > `irqbalanced` is not installed.
> > >
> > > Traffic is generated by another directly attached machine via iperf3
> > > 3.9 (`iperf3 -c -t 0 192.168.1.3 --bidir`) to a directly attached
> > > server on the other side.
> > >
> > > The server in question does nothing more than forward packets as a
> > > transparent bridge.
> > >
> > > An XDP program is installed on f0 to redirect to f1, and f1 to
> > > redirect to f0. I have tried programs that simply call
> > > `bpf_redirect()`, as well as programs that share a device map and call
> > > `bpf_redirect_map()`, with idententical results.
> > >
> > > When channel parameters for each interface are reduced to a single IRQ
> > > via `ethtool -L enp1s0f0 combined 1`, and both interface IRQs are
> > > bound to the same CPU core via smp_affinity, XDP produces improved
> > > bitrate with reduced CPU utilization over non-XDP tests:
> > > - Stock netfilter bridge: 9.11 Gbps in both directions at 98%
> > > utilization of pinned core.
> > > - XDP: Approximately 9.18 Gbps in both directions at 50% utilization
> > > of pinned core.
> > >
> > > However, when multiple cores are engaged (combined 4, with
> > > set_irq_affinity), XDP processes markedly fewer packets per second
> > > (950,000 vs approximately 1.6 million). iperf3 also shows a large
> > > number of retransmissions in its output regardless of CPU engagement
> > > (approximately 6,500 with XDP over 2 minutes vs 850 with single core
> > > tests).
> > >
> > > This is a sample taken from linux/samples xdp_monitor showing
> > > redirection and transmission of packets with XDP engaged:
> > >
> > > Summary                              944,508 redir/s            0
> > > err,drop/s    944,506 xmit/s
> > >   kthread                                           0 pkt/s
> > >    0 drop/s                   0 sched
> > >   redirect total                        944,508 redir/s
> > >       cpu:0                               470,148 redir/s
> > >       cpu:2                                 15,078 redir/s
> > >       cpu:3                               459,282 redir/s
> > >   redirect_err                                    0 error/s
> > >   xdp_exception                                0 hit/s
> > >   devmap_xmit total               944,506 xmit/s               0
> > > drop/s         0 drv_err/s
> > >      cpu:0                                 470,148 xmit/s
> > >  0 drop/s         0 drv_err/s
> > >      cpu:2                                   15,078 xmit/s
> > >   0 drop/s         0 drv_err/s
> > >      cpu:3                                 459,280 xmit/s
> > >  0 drop/s         0 drv_err/s
> > >   xmit enp1s0f0->enp1s0f1    485,249 xmit/s                0 drop/s
> > >      0 drv_err/s
> > >      cpu:0                                 470,172 xmit/s
> > >   0 drop/s         0 drv_err/s
> > >      cpu:2                                   15,078 xmit/s
> > >    0 drop/s         0 drv_err/s
> > >   xmit enp1s0f1->enp1s0f0    459,263 xmit/s                0 drop/s
> > >      0 drv_err/s
> > >      cpu:3                                 459,263 xmit/s
> > >   0 drop/s         0 drv_err/s
> > >
> > > Our current hypothesis is that this is a CPU affinity issue. We
> > > believe a different core is being used for transmission. In efforts to
> > > prove this, how can we successfully measure if bpf_redirect() is
> > > causing packets to be transmitted by a different core than they were
> > > received by? We are still trying to understand how bpf_redirect()
> > > selects which core/IRQ to transmit on and would appreciate any insight
> > > or followup material to research.
> >
> > There is no mechanism in bpf_redirect() to switch CPUs (outside of
> > cpumap). When you call XDP_REDIRECT, the frame will be added to a
> > per-device per-CPU flush list, which will be flushed (on that same CPU).
> > The i40e allocates separate rings for XDP, though, and not sure how it
> > does that, so maybe those are what's missing. You should be able to see
> > drops in the output if that's what's going on; and the packets should
> > still be processed by XDP.
> >
> > So sounds more like the hardware configuration is causing packet loss
> > before it even hits XDP. Do you see anything in the ethtool stats that
> > might explain where packets are being dropped?
>
> I don't know how irqs are exactly bound to which cpus but most probably
> this is driver issue as Toke is saying.
>
> i40e_xdp_xmit() uses smp_processor_id() as an index to xdp rings array, so
> if you limit queue count to 4 and bound irq to say cpu 10, you'll return
> with -ENXIO as queue_index will be >= than vsi->num_queue_pairs.
>
> I believe that such issues were addressed on ice driver. In there, xdp
> rings array is sized to num_possible_cpus() regardless of user's queue
> count setting and smp_processor_id() can be safely used.
>
> Adam, could you skip the `ethtool -L $IFACE combined 4` and work with your
> 4 flows to see if there is any difference?
>
> Maciej
>
> >
> > -Toke
> >

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-07-13 15:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-12 19:55 XDP redirect throughput with multi-CPU i40e Adam Smith
2022-07-12 21:19 ` Toke Høiland-Jørgensen
2022-07-13 10:16   ` Maciej Fijalkowski
2022-07-13 15:18     ` Adam Smith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).