Dropped packets mapping IRQs for adjusted queue counts on i40e

All of lore.kernel.org
 help / color / mirror / Atom feed

* Dropped packets mapping IRQs for adjusted queue counts on i40e
@ 2021-05-01  0:45 Zvi Effron
       [not found] ` <CADS2XXpjasmJKP__oHsrvv3EG8n-FjB6sqHwgQfh7QgeJ8GrrQ@mail.gmail.com>
  0 siblings, 1 reply; 18+ messages in thread
From: Zvi Effron @ 2021-05-01  0:45 UTC (permalink / raw)
  To: Xdp

Hi-ho, friends!

I'm trying to tune network queues so that there's one queue per core,
as recommended in many, many places. I've noticed that if I adjust the
number of queues from the default AND adjust the IRQ affinities, then
I get some percentage (varying from a small percent to 100%, seemingly
proportional to the number of reduced queues) of packets not making it
through.

For my setup, I have a single 36 core Skylake processor with two dual
port X710 NICs. All traffic coming in one one port of each NIC is
redirected out the other port (traffic arrives on all four ports). 34
cores are isolated for network processing. I adjust the combined
queues from the default of 36 down, and then map the IRQ associated
with each queue to one of the 34 isolated cores. Everything works fine
if I don't map the IRQs.

For a minimum repro case, I reduced my program (reproduced below) to a
blind redirector using a devmap (it doesn't even adjust MACs, which is
not a problem as my DUT is directly connected to a measurement device
in promiscuous mode) reproduced below. I use bpftool to load 4 copies
of the program and pin them, use bpftool to configure the egress
interface in the devmap, and then use ip link to attach the programs
to the interfaces.

I have played around with when I adjust the queue counts and IRQs
(before attaching XDP programs, after, XDP attachment in the middle,
etc.) and it doesn't seem to matter. But with any ordering, if I just
don't remap the IRQs, everything works fine, and if I remap, I lose
packets.

Has anyone encountered anything like this? Does anyone know what might
be causing it? How can I assign a single queue to a single core
without using the default number of queues and without losing packets?

Thanks!
--Zvi

#include <linux/bpf.h>

struct bpf_map_def {
    unsigned int type;
    unsigned int key_size;
    unsigned int value_size;
    unsigned int max_entries;
    unsigned int map_flags;
    struct bpf_map_def* inner_map;
};

struct bpf_map_def __attribute__((section("maps"), used)) device_map = {
    .type = BPF_MAP_TYPE_DEVMAP,
    .key_size = sizeof(__u32),
    .value_size = sizeof(__u32),
    .max_entries = 1,
};

static int (*bpf_redirect_map)(void *map, int key, int flags) = (void
*)BPF_FUNC_redirect_map;

__attribute__((section("xdp/test"), used))
int test(struct xdp_md *context) {
    __u32 key = 0;
    bpf_redirect_map(&device_map, key, 0);
    return XDP_REDIRECT;
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <CADS2XXpjasmJKP__oHsrvv3EG8n-FjB6sqHwgQfh7QgeJ8GrrQ@mail.gmail.com>]

* Re: Dropped packets mapping IRQs for adjusted queue counts on i40e
       [not found] ` <CADS2XXpjasmJKP__oHsrvv3EG8n-FjB6sqHwgQfh7QgeJ8GrrQ@mail.gmail.com>
@ 2021-05-01 17:56   ` Zvi Effron
  2021-05-02 20:16     ` T K Sourabh
  0 siblings, 1 reply; 18+ messages in thread
From: Zvi Effron @ 2021-05-01 17:56 UTC (permalink / raw)
  To: T K Sourabh; +Cc: Xdp

On Sat, May 1, 2021 at 8:23 AM T K Sourabh <sourabhtk37@gmail.com> wrote:
>
> When we turned off irqbalance service, we would not see the issue as they were messing up the affinities.
>
> Another thing to ensure:  equal number of queues on both interfaces.
>
Thanks for the info, it does give me some areas to investigate.

We already have irqbalance turned off (uninstalled, actually), and
we're using the same number of queues on each device (we get the
number of queues by taking the number of cores divided by number of
interfaces).

I am wondering if not having the matching queues for the ingress
interface and egress interface on the same cores might be a
contributing factor?

--Zvi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Dropped packets mapping IRQs for adjusted queue counts on i40e
  2021-05-01 17:56   ` Zvi Effron
@ 2021-05-02 20:16     ` T K Sourabh
  2021-05-04 23:07       ` Zvi Effron
  0 siblings, 1 reply; 18+ messages in thread
From: T K Sourabh @ 2021-05-02 20:16 UTC (permalink / raw)
  To: Zvi Effron; +Cc: Xdp

On Sat, May 1, 2021 at 11:26 PM Zvi Effron <zeffron@riotgames.com> wrote:
>
> On Sat, May 1, 2021 at 8:23 AM T K Sourabh <sourabhtk37@gmail.com> wrote:
> >
> > When we turned off irqbalance service, we would not see the issue as they were messing up the affinities.
> >
> > Another thing to ensure:  equal number of queues on both interfaces.
> >
> Thanks for the info, it does give me some areas to investigate.
>
> We already have irqbalance turned off (uninstalled, actually), and
> we're using the same number of queues on each device (we get the
> number of queues by taking the number of cores divided by number of
> interfaces).
>
> I am wondering if not having the matching queues for the ingress
> interface and egress interface on the same cores might be a
> contributing factor?
I haven't tested but you could give it a try to see if packet drop happens.
>
>
> --Zvi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Dropped packets mapping IRQs for adjusted queue counts on i40e
  2021-05-02 20:16     ` T K Sourabh
@ 2021-05-04 23:07       ` Zvi Effron
  2021-05-05 18:55         ` Zvi Effron
  0 siblings, 1 reply; 18+ messages in thread
From: Zvi Effron @ 2021-05-04 23:07 UTC (permalink / raw)
  To: T K Sourabh; +Cc: Xdp

On Sun, May 2, 2021 at 1:16 PM T K Sourabh <sourabhtk37@gmail.com> wrote:
>
> On Sat, May 1, 2021 at 11:26 PM Zvi Effron <zeffron@riotgames.com> wrote:
> >
> > I am wondering if not having the matching queues for the ingress
> > interface and egress interface on the same cores might be a
> > contributing factor?
> I haven't tested but you could give it a try to see if packet drop happens.

I have now tested matching the IRQs for queue N on both the ingress
and egress interfaces so they are on the same core. (Queue 0 for
ingress and egress are on the same core, the same for queue 1 on each,
etc, but queue 0 and queue 1 are on different cores). This did not
resolve the packet loss. I'm at a loss for what could be causing the
packet loss. It's clearly something to do with the IRQ remapping, as
all I need to do to remove the loss is comment out the write to the
smp_affinity files.

I'm suspecting it's something with how XDP_REDIRECT is implemented in
the i40e driver, but I don't know if this is a) cross driver behavior,
b) expected behavior, or c) a bug.

--Zvi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Dropped packets mapping IRQs for adjusted queue counts on i40e
  2021-05-04 23:07       ` Zvi Effron
@ 2021-05-05 18:55         ` Zvi Effron
  2021-05-05 19:53           ` Zvi Effron
                             ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Zvi Effron @ 2021-05-05 18:55 UTC (permalink / raw)
  To: T K Sourabh; +Cc: Xdp

On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:
> I'm suspecting it's something with how XDP_REDIRECT is implemented in
> the i40e driver, but I don't know if this is a) cross driver behavior,
> b) expected behavior, or c) a bug.
I think I've found the issue, and it appears to be specific to i40e
(and maybe other drivers, too, but not XDP itself).

When performing the XDP xmit, i40e uses the smp_processor_id() to
select the tx queue (see
https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
I'm not 100% clear on how the CPU is selected (since we don't use
cores 0 and 1), we end up on a core whose id is higher than any
available queue.

I'm going to try to modify our IRQ mappings to test this.

If I'm correct, this feels like a bug to me, since it requires a user
to understand low level driver details to do IRQ remapping, which is a
bit higher level. But if it's intended, we'll just have to figure out
how to work around this. (Unfortunately, using split tx and rx queues
is not possible with i40e, so that easy solution is unavailable.)

--Zvi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Dropped packets mapping IRQs for adjusted queue counts on i40e
  2021-05-05 18:55         ` Zvi Effron
@ 2021-05-05 19:53           ` Zvi Effron
  2021-05-05 19:55           ` David Ahern
  2021-05-05 20:01           ` Jesse Brandeburg
  2 siblings, 0 replies; 18+ messages in thread
From: Zvi Effron @ 2021-05-05 19:53 UTC (permalink / raw)
  To: T K Sourabh; +Cc: Xdp

On Wed, May 5, 2021 at 11:55 AM Zvi Effron <zeffron@riotgames.com> wrote:
>
> I'm going to try to modify our IRQ mappings to test this.
Confirmed. When I modify the IRQ mappings so that I don't use cores
with a higher number than the number of queues, the drops go away.

Going to move this over to the i40e driver's mailing list, as it's now
confirmed to not be an XDP issue.

Thanks!
--Zvi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Dropped packets mapping IRQs for adjusted queue counts on i40e
  2021-05-05 18:55         ` Zvi Effron
  2021-05-05 19:53           ` Zvi Effron
@ 2021-05-05 19:55           ` David Ahern
  2021-05-05 20:01           ` Jesse Brandeburg
  2 siblings, 0 replies; 18+ messages in thread
From: David Ahern @ 2021-05-05 19:55 UTC (permalink / raw)
  To: Zvi Effron, T K Sourabh; +Cc: Xdp

On 5/5/21 12:55 PM, Zvi Effron wrote:
> On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:
>> I'm suspecting it's something with how XDP_REDIRECT is implemented in
>> the i40e driver, but I don't know if this is a) cross driver behavior,
>> b) expected behavior, or c) a bug.
> I think I've found the issue, and it appears to be specific to i40e
> (and maybe other drivers, too, but not XDP itself).
> 
> When performing the XDP xmit, i40e uses the smp_processor_id() to
> select the tx queue (see
> https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
> I'm not 100% clear on how the CPU is selected (since we don't use
> cores 0 and 1), we end up on a core whose id is higher than any
> available queue.
> 


that is known problem; mlx5 has that constraint too.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Dropped packets mapping IRQs for adjusted queue counts on i40e
  2021-05-05 18:55         ` Zvi Effron
  2021-05-05 19:53           ` Zvi Effron
  2021-05-05 19:55           ` David Ahern
@ 2021-05-05 20:01           ` Jesse Brandeburg
  2021-05-05 21:21               ` [Intel-wired-lan] " Maciej Fijalkowski
  2 siblings, 1 reply; 18+ messages in thread
From: Jesse Brandeburg @ 2021-05-05 20:01 UTC (permalink / raw)
  To: Zvi Effron; +Cc: T K Sourabh, Xdp

Zvi Effron wrote:

> On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:
> > I'm suspecting it's something with how XDP_REDIRECT is implemented in
> > the i40e driver, but I don't know if this is a) cross driver behavior,
> > b) expected behavior, or c) a bug.
> I think I've found the issue, and it appears to be specific to i40e
> (and maybe other drivers, too, but not XDP itself).
> 
> When performing the XDP xmit, i40e uses the smp_processor_id() to
> select the tx queue (see
> https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
> I'm not 100% clear on how the CPU is selected (since we don't use
> cores 0 and 1), we end up on a core whose id is higher than any
> available queue.
> 
> I'm going to try to modify our IRQ mappings to test this.
> 
> If I'm correct, this feels like a bug to me, since it requires a user
> to understand low level driver details to do IRQ remapping, which is a
> bit higher level. But if it's intended, we'll just have to figure out
> how to work around this. (Unfortunately, using split tx and rx queues
> is not possible with i40e, so that easy solution is unavailable.)
> 
> --Zvi


It seems like for Intel drivers, igc, ixgbe, i40e, ice all have
this problem.

Notably, igb, fixes it like I would expect.

Let's talk about it over on intel-wired-lan and cc netdev.

Jesse

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Dropped packets mapping IRQs for adjusted queue counts on i40e
  2021-05-05 20:01           ` Jesse Brandeburg
@ 2021-05-05 21:21               ` Maciej Fijalkowski
  0 siblings, 0 replies; 18+ messages in thread
From: Maciej Fijalkowski @ 2021-05-05 21:21 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Zvi Effron, T K Sourabh, Xdp, intel-wired-lan, netdev,
	magnus.karlsson, kuba

On Wed, May 05, 2021 at 01:01:28PM -0700, Jesse Brandeburg wrote:
> Zvi Effron wrote:
> 
> > On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:
> > > I'm suspecting it's something with how XDP_REDIRECT is implemented in
> > > the i40e driver, but I don't know if this is a) cross driver behavior,
> > > b) expected behavior, or c) a bug.
> > I think I've found the issue, and it appears to be specific to i40e
> > (and maybe other drivers, too, but not XDP itself).
> > 
> > When performing the XDP xmit, i40e uses the smp_processor_id() to
> > select the tx queue (see
> > https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
> > I'm not 100% clear on how the CPU is selected (since we don't use
> > cores 0 and 1), we end up on a core whose id is higher than any
> > available queue.
> > 
> > I'm going to try to modify our IRQ mappings to test this.
> > 
> > If I'm correct, this feels like a bug to me, since it requires a user
> > to understand low level driver details to do IRQ remapping, which is a
> > bit higher level. But if it's intended, we'll just have to figure out
> > how to work around this. (Unfortunately, using split tx and rx queues
> > is not possible with i40e, so that easy solution is unavailable.)
> > 
> > --Zvi

Hey Zvi, sorry for the lack of assistance, there has been statutory free
time in Poland and today i'm in the birthday mode, but we managed to
discuss the issue with Magnus and we feel like we could have a solution
for that, more below.

> 
> 
> It seems like for Intel drivers, igc, ixgbe, i40e, ice all have
> this problem.
> 
> Notably, igb, fixes it like I would expect.

igb is correct but I think that we would like to avoid the introduction of
locking for higher speed NICs in XDP data path.

We talked with Magnus that for i40e and ice that have lots of HW
resources, we could always create the xdp_rings array of num_online_cpus()
size and use smp_processor_id() for accesses, regardless of the user's
changes to queue count.

This way the smp_processor_id() provides the serialization by itself as
we're under napi on a given cpu, so there's no need for locking
introduction - there is a per-cpu XDP ring provided. If we would stick to
the approach where you adjust the size of xdp_rings down to the shrinked
Rx queue count and use a smp_processor_id() % vsi->num_queue_pairs formula
then we could have a resource contention. Say that you did on a 16 core
system:
$ ethtool -L eth0 combined 2

and then mapped the q0 to cpu1 and q1 to cpu 11. Both queues will grab the
xdp_rings[1], so we would have to introduce the locking.

Proposed approach would just result with more Tx queues packed onto Tx
ring container of queue vector.

Thoughts? Any concerns? Should we have a 'fallback' mode if we would be
out of queues?

Currently I have a patch for ice (which is trivial) that does the thing
described above and I'm in the middle of testing it. I feel like this is
also addressing Jakub's input on the old patch that tried to resolve the
issue around that manner:

https://lore.kernel.org/netdev/20210123195219.55f6d4e1@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

> 
> Let's talk about it over on intel-wired-lan and cc netdev.

Let's do this now.

> 
> Jesse

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] Dropped packets mapping IRQs for adjusted queue counts on i40e
@ 2021-05-05 21:21               ` Maciej Fijalkowski
  0 siblings, 0 replies; 18+ messages in thread
From: Maciej Fijalkowski @ 2021-05-05 21:21 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, May 05, 2021 at 01:01:28PM -0700, Jesse Brandeburg wrote:
> Zvi Effron wrote:
> 
> > On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:
> > > I'm suspecting it's something with how XDP_REDIRECT is implemented in
> > > the i40e driver, but I don't know if this is a) cross driver behavior,
> > > b) expected behavior, or c) a bug.
> > I think I've found the issue, and it appears to be specific to i40e
> > (and maybe other drivers, too, but not XDP itself).
> > 
> > When performing the XDP xmit, i40e uses the smp_processor_id() to
> > select the tx queue (see
> > https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
> > I'm not 100% clear on how the CPU is selected (since we don't use
> > cores 0 and 1), we end up on a core whose id is higher than any
> > available queue.
> > 
> > I'm going to try to modify our IRQ mappings to test this.
> > 
> > If I'm correct, this feels like a bug to me, since it requires a user
> > to understand low level driver details to do IRQ remapping, which is a
> > bit higher level. But if it's intended, we'll just have to figure out
> > how to work around this. (Unfortunately, using split tx and rx queues
> > is not possible with i40e, so that easy solution is unavailable.)
> > 
> > --Zvi

Hey Zvi, sorry for the lack of assistance, there has been statutory free
time in Poland and today i'm in the birthday mode, but we managed to
discuss the issue with Magnus and we feel like we could have a solution
for that, more below.

> 
> 
> It seems like for Intel drivers, igc, ixgbe, i40e, ice all have
> this problem.
> 
> Notably, igb, fixes it like I would expect.

igb is correct but I think that we would like to avoid the introduction of
locking for higher speed NICs in XDP data path.

We talked with Magnus that for i40e and ice that have lots of HW
resources, we could always create the xdp_rings array of num_online_cpus()
size and use smp_processor_id() for accesses, regardless of the user's
changes to queue count.

This way the smp_processor_id() provides the serialization by itself as
we're under napi on a given cpu, so there's no need for locking
introduction - there is a per-cpu XDP ring provided. If we would stick to
the approach where you adjust the size of xdp_rings down to the shrinked
Rx queue count and use a smp_processor_id() % vsi->num_queue_pairs formula
then we could have a resource contention. Say that you did on a 16 core
system:
$ ethtool -L eth0 combined 2

and then mapped the q0 to cpu1 and q1 to cpu 11. Both queues will grab the
xdp_rings[1], so we would have to introduce the locking.

Proposed approach would just result with more Tx queues packed onto Tx
ring container of queue vector.

Thoughts? Any concerns? Should we have a 'fallback' mode if we would be
out of queues?

Currently I have a patch for ice (which is trivial) that does the thing
described above and I'm in the middle of testing it. I feel like this is
also addressing Jakub's input on the old patch that tried to resolve the
issue around that manner:

https://lore.kernel.org/netdev/20210123195219.55f6d4e1 at kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

> 
> Let's talk about it over on intel-wired-lan and cc netdev.

Let's do this now.

> 
> Jesse

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Dropped packets mapping IRQs for adjusted queue counts on i40e
  2021-05-05 21:21               ` [Intel-wired-lan] " Maciej Fijalkowski
@ 2021-05-06 10:29                 ` Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
  -1 siblings, 0 replies; 18+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-05-06 10:29 UTC (permalink / raw)
  To: Maciej Fijalkowski, Jesse Brandeburg
  Cc: Zvi Effron, T K Sourabh, Xdp, intel-wired-lan, netdev,
	magnus.karlsson, kuba

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> On Wed, May 05, 2021 at 01:01:28PM -0700, Jesse Brandeburg wrote:
>> Zvi Effron wrote:
>> 
>> > On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:
>> > > I'm suspecting it's something with how XDP_REDIRECT is implemented in
>> > > the i40e driver, but I don't know if this is a) cross driver behavior,
>> > > b) expected behavior, or c) a bug.
>> > I think I've found the issue, and it appears to be specific to i40e
>> > (and maybe other drivers, too, but not XDP itself).
>> > 
>> > When performing the XDP xmit, i40e uses the smp_processor_id() to
>> > select the tx queue (see
>> > https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
>> > I'm not 100% clear on how the CPU is selected (since we don't use
>> > cores 0 and 1), we end up on a core whose id is higher than any
>> > available queue.
>> > 
>> > I'm going to try to modify our IRQ mappings to test this.
>> > 
>> > If I'm correct, this feels like a bug to me, since it requires a user
>> > to understand low level driver details to do IRQ remapping, which is a
>> > bit higher level. But if it's intended, we'll just have to figure out
>> > how to work around this. (Unfortunately, using split tx and rx queues
>> > is not possible with i40e, so that easy solution is unavailable.)
>> > 
>> > --Zvi
>
> Hey Zvi, sorry for the lack of assistance, there has been statutory free
> time in Poland and today i'm in the birthday mode, but we managed to
> discuss the issue with Magnus and we feel like we could have a solution
> for that, more below.
>
>> 
>> 
>> It seems like for Intel drivers, igc, ixgbe, i40e, ice all have
>> this problem.
>> 
>> Notably, igb, fixes it like I would expect.
>
> igb is correct but I think that we would like to avoid the introduction of
> locking for higher speed NICs in XDP data path.
>
> We talked with Magnus that for i40e and ice that have lots of HW
> resources, we could always create the xdp_rings array of num_online_cpus()
> size and use smp_processor_id() for accesses, regardless of the user's
> changes to queue count.

What is "lots"? Systems with hundreds of CPUs exist (and I seem to
recall an issue with just such a system on Intel hardware(?)). Also,
what if num_online_cpus() changes?

> This way the smp_processor_id() provides the serialization by itself as
> we're under napi on a given cpu, so there's no need for locking
> introduction - there is a per-cpu XDP ring provided. If we would stick to
> the approach where you adjust the size of xdp_rings down to the shrinked
> Rx queue count and use a smp_processor_id() % vsi->num_queue_pairs formula
> then we could have a resource contention. Say that you did on a 16 core
> system:
> $ ethtool -L eth0 combined 2
>
> and then mapped the q0 to cpu1 and q1 to cpu 11. Both queues will grab the
> xdp_rings[1], so we would have to introduce the locking.
>
> Proposed approach would just result with more Tx queues packed onto Tx
> ring container of queue vector.
>
> Thoughts? Any concerns? Should we have a 'fallback' mode if we would be
> out of queues?

Yes, please :)

-Toke


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] Dropped packets mapping IRQs for adjusted queue counts on i40e
@ 2021-05-06 10:29                 ` Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
  0 siblings, 0 replies; 18+ messages in thread
From: Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= @ 2021-05-06 10:29 UTC (permalink / raw)
  To: intel-wired-lan

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> On Wed, May 05, 2021 at 01:01:28PM -0700, Jesse Brandeburg wrote:
>> Zvi Effron wrote:
>> 
>> > On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:
>> > > I'm suspecting it's something with how XDP_REDIRECT is implemented in
>> > > the i40e driver, but I don't know if this is a) cross driver behavior,
>> > > b) expected behavior, or c) a bug.
>> > I think I've found the issue, and it appears to be specific to i40e
>> > (and maybe other drivers, too, but not XDP itself).
>> > 
>> > When performing the XDP xmit, i40e uses the smp_processor_id() to
>> > select the tx queue (see
>> > https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
>> > I'm not 100% clear on how the CPU is selected (since we don't use
>> > cores 0 and 1), we end up on a core whose id is higher than any
>> > available queue.
>> > 
>> > I'm going to try to modify our IRQ mappings to test this.
>> > 
>> > If I'm correct, this feels like a bug to me, since it requires a user
>> > to understand low level driver details to do IRQ remapping, which is a
>> > bit higher level. But if it's intended, we'll just have to figure out
>> > how to work around this. (Unfortunately, using split tx and rx queues
>> > is not possible with i40e, so that easy solution is unavailable.)
>> > 
>> > --Zvi
>
> Hey Zvi, sorry for the lack of assistance, there has been statutory free
> time in Poland and today i'm in the birthday mode, but we managed to
> discuss the issue with Magnus and we feel like we could have a solution
> for that, more below.
>
>> 
>> 
>> It seems like for Intel drivers, igc, ixgbe, i40e, ice all have
>> this problem.
>> 
>> Notably, igb, fixes it like I would expect.
>
> igb is correct but I think that we would like to avoid the introduction of
> locking for higher speed NICs in XDP data path.
>
> We talked with Magnus that for i40e and ice that have lots of HW
> resources, we could always create the xdp_rings array of num_online_cpus()
> size and use smp_processor_id() for accesses, regardless of the user's
> changes to queue count.

What is "lots"? Systems with hundreds of CPUs exist (and I seem to
recall an issue with just such a system on Intel hardware(?)). Also,
what if num_online_cpus() changes?

> This way the smp_processor_id() provides the serialization by itself as
> we're under napi on a given cpu, so there's no need for locking
> introduction - there is a per-cpu XDP ring provided. If we would stick to
> the approach where you adjust the size of xdp_rings down to the shrinked
> Rx queue count and use a smp_processor_id() % vsi->num_queue_pairs formula
> then we could have a resource contention. Say that you did on a 16 core
> system:
> $ ethtool -L eth0 combined 2
>
> and then mapped the q0 to cpu1 and q1 to cpu 11. Both queues will grab the
> xdp_rings[1], so we would have to introduce the locking.
>
> Proposed approach would just result with more Tx queues packed onto Tx
> ring container of queue vector.
>
> Thoughts? Any concerns? Should we have a 'fallback' mode if we would be
> out of queues?

Yes, please :)

-Toke


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Dropped packets mapping IRQs for adjusted queue counts on i40e
  2021-05-06 10:29                 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
@ 2021-05-09 15:50                   ` Maciej Fijalkowski
  -1 siblings, 0 replies; 18+ messages in thread
From: Maciej Fijalkowski @ 2021-05-09 15:50 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jesse Brandeburg, Zvi Effron, T K Sourabh, Xdp, intel-wired-lan,
	netdev, magnus.karlsson, kuba

On Thu, May 06, 2021 at 12:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> 
> > On Wed, May 05, 2021 at 01:01:28PM -0700, Jesse Brandeburg wrote:
> >> Zvi Effron wrote:
> >> 
> >> > On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:
> >> > > I'm suspecting it's something with how XDP_REDIRECT is implemented in
> >> > > the i40e driver, but I don't know if this is a) cross driver behavior,
> >> > > b) expected behavior, or c) a bug.
> >> > I think I've found the issue, and it appears to be specific to i40e
> >> > (and maybe other drivers, too, but not XDP itself).
> >> > 
> >> > When performing the XDP xmit, i40e uses the smp_processor_id() to
> >> > select the tx queue (see
> >> > https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
> >> > I'm not 100% clear on how the CPU is selected (since we don't use
> >> > cores 0 and 1), we end up on a core whose id is higher than any
> >> > available queue.
> >> > 
> >> > I'm going to try to modify our IRQ mappings to test this.
> >> > 
> >> > If I'm correct, this feels like a bug to me, since it requires a user
> >> > to understand low level driver details to do IRQ remapping, which is a
> >> > bit higher level. But if it's intended, we'll just have to figure out
> >> > how to work around this. (Unfortunately, using split tx and rx queues
> >> > is not possible with i40e, so that easy solution is unavailable.)
> >> > 
> >> > --Zvi
> >
> > Hey Zvi, sorry for the lack of assistance, there has been statutory free
> > time in Poland and today i'm in the birthday mode, but we managed to
> > discuss the issue with Magnus and we feel like we could have a solution
> > for that, more below.
> >
> >> 
> >> 
> >> It seems like for Intel drivers, igc, ixgbe, i40e, ice all have
> >> this problem.
> >> 
> >> Notably, igb, fixes it like I would expect.
> >
> > igb is correct but I think that we would like to avoid the introduction of
> > locking for higher speed NICs in XDP data path.
> >
> > We talked with Magnus that for i40e and ice that have lots of HW
> > resources, we could always create the xdp_rings array of num_online_cpus()
> > size and use smp_processor_id() for accesses, regardless of the user's
> > changes to queue count.
> 
> What is "lots"? Systems with hundreds of CPUs exist (and I seem to
> recall an issue with just such a system on Intel hardware(?)). Also,
> what if num_online_cpus() changes?

"Lots" is 16k for ice. For i40e datasheet tells that it's only 1536 for
whole device, so I back off from the statement that i40e has a lot of
resources :)

Also, s/num_online_cpus()/num_possible_cpus().

> 
> > This way the smp_processor_id() provides the serialization by itself as
> > we're under napi on a given cpu, so there's no need for locking
> > introduction - there is a per-cpu XDP ring provided. If we would stick to
> > the approach where you adjust the size of xdp_rings down to the shrinked
> > Rx queue count and use a smp_processor_id() % vsi->num_queue_pairs formula
> > then we could have a resource contention. Say that you did on a 16 core
> > system:
> > $ ethtool -L eth0 combined 2
> >
> > and then mapped the q0 to cpu1 and q1 to cpu 11. Both queues will grab the
> > xdp_rings[1], so we would have to introduce the locking.
> >
> > Proposed approach would just result with more Tx queues packed onto Tx
> > ring container of queue vector.
> >
> > Thoughts? Any concerns? Should we have a 'fallback' mode if we would be
> > out of queues?
> 
> Yes, please :)

How to have a fallback (in drivers that need it) in a way that wouldn't
hurt the scenario where queue per cpu requirement is satisfied?

> 
> -Toke
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] Dropped packets mapping IRQs for adjusted queue counts on i40e
@ 2021-05-09 15:50                   ` Maciej Fijalkowski
  0 siblings, 0 replies; 18+ messages in thread
From: Maciej Fijalkowski @ 2021-05-09 15:50 UTC (permalink / raw)
  To: intel-wired-lan

On Thu, May 06, 2021 at 12:29:40PM +0200, Toke H?iland-J?rgensen wrote:
> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> 
> > On Wed, May 05, 2021 at 01:01:28PM -0700, Jesse Brandeburg wrote:
> >> Zvi Effron wrote:
> >> 
> >> > On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:
> >> > > I'm suspecting it's something with how XDP_REDIRECT is implemented in
> >> > > the i40e driver, but I don't know if this is a) cross driver behavior,
> >> > > b) expected behavior, or c) a bug.
> >> > I think I've found the issue, and it appears to be specific to i40e
> >> > (and maybe other drivers, too, but not XDP itself).
> >> > 
> >> > When performing the XDP xmit, i40e uses the smp_processor_id() to
> >> > select the tx queue (see
> >> > https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
> >> > I'm not 100% clear on how the CPU is selected (since we don't use
> >> > cores 0 and 1), we end up on a core whose id is higher than any
> >> > available queue.
> >> > 
> >> > I'm going to try to modify our IRQ mappings to test this.
> >> > 
> >> > If I'm correct, this feels like a bug to me, since it requires a user
> >> > to understand low level driver details to do IRQ remapping, which is a
> >> > bit higher level. But if it's intended, we'll just have to figure out
> >> > how to work around this. (Unfortunately, using split tx and rx queues
> >> > is not possible with i40e, so that easy solution is unavailable.)
> >> > 
> >> > --Zvi
> >
> > Hey Zvi, sorry for the lack of assistance, there has been statutory free
> > time in Poland and today i'm in the birthday mode, but we managed to
> > discuss the issue with Magnus and we feel like we could have a solution
> > for that, more below.
> >
> >> 
> >> 
> >> It seems like for Intel drivers, igc, ixgbe, i40e, ice all have
> >> this problem.
> >> 
> >> Notably, igb, fixes it like I would expect.
> >
> > igb is correct but I think that we would like to avoid the introduction of
> > locking for higher speed NICs in XDP data path.
> >
> > We talked with Magnus that for i40e and ice that have lots of HW
> > resources, we could always create the xdp_rings array of num_online_cpus()
> > size and use smp_processor_id() for accesses, regardless of the user's
> > changes to queue count.
> 
> What is "lots"? Systems with hundreds of CPUs exist (and I seem to
> recall an issue with just such a system on Intel hardware(?)). Also,
> what if num_online_cpus() changes?

"Lots" is 16k for ice. For i40e datasheet tells that it's only 1536 for
whole device, so I back off from the statement that i40e has a lot of
resources :)

Also, s/num_online_cpus()/num_possible_cpus().

> 
> > This way the smp_processor_id() provides the serialization by itself as
> > we're under napi on a given cpu, so there's no need for locking
> > introduction - there is a per-cpu XDP ring provided. If we would stick to
> > the approach where you adjust the size of xdp_rings down to the shrinked
> > Rx queue count and use a smp_processor_id() % vsi->num_queue_pairs formula
> > then we could have a resource contention. Say that you did on a 16 core
> > system:
> > $ ethtool -L eth0 combined 2
> >
> > and then mapped the q0 to cpu1 and q1 to cpu 11. Both queues will grab the
> > xdp_rings[1], so we would have to introduce the locking.
> >
> > Proposed approach would just result with more Tx queues packed onto Tx
> > ring container of queue vector.
> >
> > Thoughts? Any concerns? Should we have a 'fallback' mode if we would be
> > out of queues?
> 
> Yes, please :)

How to have a fallback (in drivers that need it) in a way that wouldn't
hurt the scenario where queue per cpu requirement is satisfied?

> 
> -Toke
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Dropped packets mapping IRQs for adjusted queue counts on i40e
  2021-05-09 15:50                   ` [Intel-wired-lan] " Maciej Fijalkowski
@ 2021-05-10 11:22                     ` Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
  -1 siblings, 0 replies; 18+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-05-10 11:22 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Jesse Brandeburg, Zvi Effron, T K Sourabh, Xdp, intel-wired-lan,
	netdev, magnus.karlsson, kuba

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> On Thu, May 06, 2021 at 12:29:40PM +0200, Toke Høiland-Jørgensen wrote:
>> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
>> 
>> > On Wed, May 05, 2021 at 01:01:28PM -0700, Jesse Brandeburg wrote:
>> >> Zvi Effron wrote:
>> >> 
>> >> > On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:
>> >> > > I'm suspecting it's something with how XDP_REDIRECT is implemented in
>> >> > > the i40e driver, but I don't know if this is a) cross driver behavior,
>> >> > > b) expected behavior, or c) a bug.
>> >> > I think I've found the issue, and it appears to be specific to i40e
>> >> > (and maybe other drivers, too, but not XDP itself).
>> >> > 
>> >> > When performing the XDP xmit, i40e uses the smp_processor_id() to
>> >> > select the tx queue (see
>> >> > https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
>> >> > I'm not 100% clear on how the CPU is selected (since we don't use
>> >> > cores 0 and 1), we end up on a core whose id is higher than any
>> >> > available queue.
>> >> > 
>> >> > I'm going to try to modify our IRQ mappings to test this.
>> >> > 
>> >> > If I'm correct, this feels like a bug to me, since it requires a user
>> >> > to understand low level driver details to do IRQ remapping, which is a
>> >> > bit higher level. But if it's intended, we'll just have to figure out
>> >> > how to work around this. (Unfortunately, using split tx and rx queues
>> >> > is not possible with i40e, so that easy solution is unavailable.)
>> >> > 
>> >> > --Zvi
>> >
>> > Hey Zvi, sorry for the lack of assistance, there has been statutory free
>> > time in Poland and today i'm in the birthday mode, but we managed to
>> > discuss the issue with Magnus and we feel like we could have a solution
>> > for that, more below.
>> >
>> >> 
>> >> 
>> >> It seems like for Intel drivers, igc, ixgbe, i40e, ice all have
>> >> this problem.
>> >> 
>> >> Notably, igb, fixes it like I would expect.
>> >
>> > igb is correct but I think that we would like to avoid the introduction of
>> > locking for higher speed NICs in XDP data path.
>> >
>> > We talked with Magnus that for i40e and ice that have lots of HW
>> > resources, we could always create the xdp_rings array of num_online_cpus()
>> > size and use smp_processor_id() for accesses, regardless of the user's
>> > changes to queue count.
>> 
>> What is "lots"? Systems with hundreds of CPUs exist (and I seem to
>> recall an issue with just such a system on Intel hardware(?)). Also,
>> what if num_online_cpus() changes?
>
> "Lots" is 16k for ice. For i40e datasheet tells that it's only 1536 for
> whole device, so I back off from the statement that i40e has a lot of
> resources :)
>
> Also, s/num_online_cpus()/num_possible_cpus().

OK, even 1536 is more than I expected; I figured it would be way lower,
which is why you were suggesting to use num_online_cpus() instead; but
yeah, num_possible_cpus() is obviously better, then :)

>> > This way the smp_processor_id() provides the serialization by itself as
>> > we're under napi on a given cpu, so there's no need for locking
>> > introduction - there is a per-cpu XDP ring provided. If we would stick to
>> > the approach where you adjust the size of xdp_rings down to the shrinked
>> > Rx queue count and use a smp_processor_id() % vsi->num_queue_pairs formula
>> > then we could have a resource contention. Say that you did on a 16 core
>> > system:
>> > $ ethtool -L eth0 combined 2
>> >
>> > and then mapped the q0 to cpu1 and q1 to cpu 11. Both queues will grab the
>> > xdp_rings[1], so we would have to introduce the locking.
>> >
>> > Proposed approach would just result with more Tx queues packed onto Tx
>> > ring container of queue vector.
>> >
>> > Thoughts? Any concerns? Should we have a 'fallback' mode if we would be
>> > out of queues?
>> 
>> Yes, please :)
>
> How to have a fallback (in drivers that need it) in a way that wouldn't
> hurt the scenario where queue per cpu requirement is satisfied?

Well, it should be possible to detect this at setup time, right? Not too
familiar with the driver code, but would it be possible to statically
dispatch to an entirely different code path if this happens?

-Toke


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] Dropped packets mapping IRQs for adjusted queue counts on i40e
@ 2021-05-10 11:22                     ` Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
  0 siblings, 0 replies; 18+ messages in thread
From: Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= @ 2021-05-10 11:22 UTC (permalink / raw)
  To: intel-wired-lan

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> On Thu, May 06, 2021 at 12:29:40PM +0200, Toke H?iland-J?rgensen wrote:
>> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
>> 
>> > On Wed, May 05, 2021 at 01:01:28PM -0700, Jesse Brandeburg wrote:
>> >> Zvi Effron wrote:
>> >> 
>> >> > On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:
>> >> > > I'm suspecting it's something with how XDP_REDIRECT is implemented in
>> >> > > the i40e driver, but I don't know if this is a) cross driver behavior,
>> >> > > b) expected behavior, or c) a bug.
>> >> > I think I've found the issue, and it appears to be specific to i40e
>> >> > (and maybe other drivers, too, but not XDP itself).
>> >> > 
>> >> > When performing the XDP xmit, i40e uses the smp_processor_id() to
>> >> > select the tx queue (see
>> >> > https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
>> >> > I'm not 100% clear on how the CPU is selected (since we don't use
>> >> > cores 0 and 1), we end up on a core whose id is higher than any
>> >> > available queue.
>> >> > 
>> >> > I'm going to try to modify our IRQ mappings to test this.
>> >> > 
>> >> > If I'm correct, this feels like a bug to me, since it requires a user
>> >> > to understand low level driver details to do IRQ remapping, which is a
>> >> > bit higher level. But if it's intended, we'll just have to figure out
>> >> > how to work around this. (Unfortunately, using split tx and rx queues
>> >> > is not possible with i40e, so that easy solution is unavailable.)
>> >> > 
>> >> > --Zvi
>> >
>> > Hey Zvi, sorry for the lack of assistance, there has been statutory free
>> > time in Poland and today i'm in the birthday mode, but we managed to
>> > discuss the issue with Magnus and we feel like we could have a solution
>> > for that, more below.
>> >
>> >> 
>> >> 
>> >> It seems like for Intel drivers, igc, ixgbe, i40e, ice all have
>> >> this problem.
>> >> 
>> >> Notably, igb, fixes it like I would expect.
>> >
>> > igb is correct but I think that we would like to avoid the introduction of
>> > locking for higher speed NICs in XDP data path.
>> >
>> > We talked with Magnus that for i40e and ice that have lots of HW
>> > resources, we could always create the xdp_rings array of num_online_cpus()
>> > size and use smp_processor_id() for accesses, regardless of the user's
>> > changes to queue count.
>> 
>> What is "lots"? Systems with hundreds of CPUs exist (and I seem to
>> recall an issue with just such a system on Intel hardware(?)). Also,
>> what if num_online_cpus() changes?
>
> "Lots" is 16k for ice. For i40e datasheet tells that it's only 1536 for
> whole device, so I back off from the statement that i40e has a lot of
> resources :)
>
> Also, s/num_online_cpus()/num_possible_cpus().

OK, even 1536 is more than I expected; I figured it would be way lower,
which is why you were suggesting to use num_online_cpus() instead; but
yeah, num_possible_cpus() is obviously better, then :)

>> > This way the smp_processor_id() provides the serialization by itself as
>> > we're under napi on a given cpu, so there's no need for locking
>> > introduction - there is a per-cpu XDP ring provided. If we would stick to
>> > the approach where you adjust the size of xdp_rings down to the shrinked
>> > Rx queue count and use a smp_processor_id() % vsi->num_queue_pairs formula
>> > then we could have a resource contention. Say that you did on a 16 core
>> > system:
>> > $ ethtool -L eth0 combined 2
>> >
>> > and then mapped the q0 to cpu1 and q1 to cpu 11. Both queues will grab the
>> > xdp_rings[1], so we would have to introduce the locking.
>> >
>> > Proposed approach would just result with more Tx queues packed onto Tx
>> > ring container of queue vector.
>> >
>> > Thoughts? Any concerns? Should we have a 'fallback' mode if we would be
>> > out of queues?
>> 
>> Yes, please :)
>
> How to have a fallback (in drivers that need it) in a way that wouldn't
> hurt the scenario where queue per cpu requirement is satisfied?

Well, it should be possible to detect this at setup time, right? Not too
familiar with the driver code, but would it be possible to statically
dispatch to an entirely different code path if this happens?

-Toke


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Dropped packets mapping IRQs for adjusted queue counts on i40e
  2021-05-10 11:22                     ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
@ 2021-05-10 15:01                       ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2021-05-10 15:01 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: brouer, Maciej Fijalkowski, Jesse Brandeburg, Zvi Effron,
	T K Sourabh, Xdp, intel-wired-lan, netdev, magnus.karlsson, kuba,
	Tony Nguyen

On Mon, 10 May 2021 13:22:54 +0200
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> 
> > On Thu, May 06, 2021 at 12:29:40PM +0200, Toke Høiland-Jørgensen wrote:  
> >> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> >>   
> >> > On Wed, May 05, 2021 at 01:01:28PM -0700, Jesse Brandeburg wrote:  
> >> >> Zvi Effron wrote:
> >> >>   
> >> >> > On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:  
> >> >> > > I'm suspecting it's something with how XDP_REDIRECT is implemented in
> >> >> > > the i40e driver, but I don't know if this is a) cross driver behavior,
> >> >> > > b) expected behavior, or c) a bug.  
> >> >> > I think I've found the issue, and it appears to be specific to i40e
> >> >> > (and maybe other drivers, too, but not XDP itself).
> >> >> > 
> >> >> > When performing the XDP xmit, i40e uses the smp_processor_id() to
> >> >> > select the tx queue (see
> >> >> > https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
> >> >> > I'm not 100% clear on how the CPU is selected (since we don't use
> >> >> > cores 0 and 1), we end up on a core whose id is higher than any
> >> >> > available queue.
> >> >> > 
> >> >> > I'm going to try to modify our IRQ mappings to test this.
> >> >> > 
> >> >> > If I'm correct, this feels like a bug to me, since it requires a user
> >> >> > to understand low level driver details to do IRQ remapping, which is a
> >> >> > bit higher level. But if it's intended, we'll just have to figure out
> >> >> > how to work around this. (Unfortunately, using split tx and rx queues
> >> >> > is not possible with i40e, so that easy solution is unavailable.)
> >> >> > 
> >> >> > --Zvi  
> >> >
> >> > Hey Zvi, sorry for the lack of assistance, there has been statutory free
> >> > time in Poland and today i'm in the birthday mode, but we managed to
> >> > discuss the issue with Magnus and we feel like we could have a solution
> >> > for that, more below.
> >> >  
> >> >> 
> >> >> 
> >> >> It seems like for Intel drivers, igc, ixgbe, i40e, ice all have
> >> >> this problem.
> >> >> 
> >> >> Notably, igb, fixes it like I would expect.  
> >> >
> >> > igb is correct but I think that we would like to avoid the introduction of
> >> > locking for higher speed NICs in XDP data path.
> >> >
> >> > We talked with Magnus that for i40e and ice that have lots of HW
> >> > resources, we could always create the xdp_rings array of num_online_cpus()
> >> > size and use smp_processor_id() for accesses, regardless of the user's
> >> > changes to queue count.  
> >> 
> >> What is "lots"? Systems with hundreds of CPUs exist (and I seem to
> >> recall an issue with just such a system on Intel hardware(?)). Also,
> >> what if num_online_cpus() changes?  
> >
> > "Lots" is 16k for ice. For i40e datasheet tells that it's only 1536 for
> > whole device, so I back off from the statement that i40e has a lot of
> > resources :)
> >
> > Also, s/num_online_cpus()/num_possible_cpus().  
> 
> OK, even 1536 is more than I expected; I figured it would be way lower,
> which is why you were suggesting to use num_online_cpus() instead; but
> yeah, num_possible_cpus() is obviously better, then :)
> 
> >> > This way the smp_processor_id() provides the serialization by itself as
> >> > we're under napi on a given cpu, so there's no need for locking
> >> > introduction - there is a per-cpu XDP ring provided. If we would stick to
> >> > the approach where you adjust the size of xdp_rings down to the shrinked
> >> > Rx queue count and use a smp_processor_id() % vsi->num_queue_pairs formula
> >> > then we could have a resource contention. Say that you did on a 16 core
> >> > system:
> >> > $ ethtool -L eth0 combined 2
> >> >
> >> > and then mapped the q0 to cpu1 and q1 to cpu 11. Both queues will grab the
> >> > xdp_rings[1], so we would have to introduce the locking.
> >> >
> >> > Proposed approach would just result with more Tx queues packed onto Tx
> >> > ring container of queue vector.
> >> >
> >> > Thoughts? Any concerns? Should we have a 'fallback' mode if we would be
> >> > out of queues?  
> >> 
> >> Yes, please :)  
> >
> > How to have a fallback (in drivers that need it) in a way that wouldn't
> > hurt the scenario where queue per cpu requirement is satisfied?  
> 
> Well, it should be possible to detect this at setup time, right? Not too
> familiar with the driver code, but would it be possible to statically
> dispatch to an entirely different code path if this happens?

The ndo_xdp_xmit call is a function pointer.  Thus, if it happens at
this level, then at setup time the driver can simply change the NDO to
use a TX-queue-locked variant.

I actually consider it a bug that i40e allow this misconfig to happen.
The ixgbe driver solves the problem by rejecting XDP attach if the
system have more CPUs than TXQs available.

IMHO it is a better solution to add shard'ed/partitioned TXQ-locking
when this situation happens, instead of denying XDP attach.  Since the
original XDP-redirect the ndo_xdp_xmit call have gotten bulking added,
thus the locking will be amortized over the bulk.

One question is how do we inform the end-user that XDP will be using
a slightly slower TXQ-locking scheme?  Given we have no XDP-features
exposed, I suggest a simple kernel log message, which we already have
for other XDP situations when the MTU is too large, or TSO is enabled.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] Dropped packets mapping IRQs for adjusted queue counts on i40e
@ 2021-05-10 15:01                       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2021-05-10 15:01 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, 10 May 2021 13:22:54 +0200
Toke H?iland-J?rgensen <toke@redhat.com> wrote:

> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> 
> > On Thu, May 06, 2021 at 12:29:40PM +0200, Toke H?iland-J?rgensen wrote:  
> >> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> >>   
> >> > On Wed, May 05, 2021 at 01:01:28PM -0700, Jesse Brandeburg wrote:  
> >> >> Zvi Effron wrote:
> >> >>   
> >> >> > On Tue, May 4, 2021 at 4:07 PM Zvi Effron <zeffron@riotgames.com> wrote:  
> >> >> > > I'm suspecting it's something with how XDP_REDIRECT is implemented in
> >> >> > > the i40e driver, but I don't know if this is a) cross driver behavior,
> >> >> > > b) expected behavior, or c) a bug.  
> >> >> > I think I've found the issue, and it appears to be specific to i40e
> >> >> > (and maybe other drivers, too, but not XDP itself).
> >> >> > 
> >> >> > When performing the XDP xmit, i40e uses the smp_processor_id() to
> >> >> > select the tx queue (see
> >> >> > https://elixir.bootlin.com/linux/v5.12.1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L3846).
> >> >> > I'm not 100% clear on how the CPU is selected (since we don't use
> >> >> > cores 0 and 1), we end up on a core whose id is higher than any
> >> >> > available queue.
> >> >> > 
> >> >> > I'm going to try to modify our IRQ mappings to test this.
> >> >> > 
> >> >> > If I'm correct, this feels like a bug to me, since it requires a user
> >> >> > to understand low level driver details to do IRQ remapping, which is a
> >> >> > bit higher level. But if it's intended, we'll just have to figure out
> >> >> > how to work around this. (Unfortunately, using split tx and rx queues
> >> >> > is not possible with i40e, so that easy solution is unavailable.)
> >> >> > 
> >> >> > --Zvi  
> >> >
> >> > Hey Zvi, sorry for the lack of assistance, there has been statutory free
> >> > time in Poland and today i'm in the birthday mode, but we managed to
> >> > discuss the issue with Magnus and we feel like we could have a solution
> >> > for that, more below.
> >> >  
> >> >> 
> >> >> 
> >> >> It seems like for Intel drivers, igc, ixgbe, i40e, ice all have
> >> >> this problem.
> >> >> 
> >> >> Notably, igb, fixes it like I would expect.  
> >> >
> >> > igb is correct but I think that we would like to avoid the introduction of
> >> > locking for higher speed NICs in XDP data path.
> >> >
> >> > We talked with Magnus that for i40e and ice that have lots of HW
> >> > resources, we could always create the xdp_rings array of num_online_cpus()
> >> > size and use smp_processor_id() for accesses, regardless of the user's
> >> > changes to queue count.  
> >> 
> >> What is "lots"? Systems with hundreds of CPUs exist (and I seem to
> >> recall an issue with just such a system on Intel hardware(?)). Also,
> >> what if num_online_cpus() changes?  
> >
> > "Lots" is 16k for ice. For i40e datasheet tells that it's only 1536 for
> > whole device, so I back off from the statement that i40e has a lot of
> > resources :)
> >
> > Also, s/num_online_cpus()/num_possible_cpus().  
> 
> OK, even 1536 is more than I expected; I figured it would be way lower,
> which is why you were suggesting to use num_online_cpus() instead; but
> yeah, num_possible_cpus() is obviously better, then :)
> 
> >> > This way the smp_processor_id() provides the serialization by itself as
> >> > we're under napi on a given cpu, so there's no need for locking
> >> > introduction - there is a per-cpu XDP ring provided. If we would stick to
> >> > the approach where you adjust the size of xdp_rings down to the shrinked
> >> > Rx queue count and use a smp_processor_id() % vsi->num_queue_pairs formula
> >> > then we could have a resource contention. Say that you did on a 16 core
> >> > system:
> >> > $ ethtool -L eth0 combined 2
> >> >
> >> > and then mapped the q0 to cpu1 and q1 to cpu 11. Both queues will grab the
> >> > xdp_rings[1], so we would have to introduce the locking.
> >> >
> >> > Proposed approach would just result with more Tx queues packed onto Tx
> >> > ring container of queue vector.
> >> >
> >> > Thoughts? Any concerns? Should we have a 'fallback' mode if we would be
> >> > out of queues?  
> >> 
> >> Yes, please :)  
> >
> > How to have a fallback (in drivers that need it) in a way that wouldn't
> > hurt the scenario where queue per cpu requirement is satisfied?  
> 
> Well, it should be possible to detect this at setup time, right? Not too
> familiar with the driver code, but would it be possible to statically
> dispatch to an entirely different code path if this happens?

The ndo_xdp_xmit call is a function pointer.  Thus, if it happens at
this level, then at setup time the driver can simply change the NDO to
use a TX-queue-locked variant.

I actually consider it a bug that i40e allow this misconfig to happen.
The ixgbe driver solves the problem by rejecting XDP attach if the
system have more CPUs than TXQs available.

IMHO it is a better solution to add shard'ed/partitioned TXQ-locking
when this situation happens, instead of denying XDP attach.  Since the
original XDP-redirect the ndo_xdp_xmit call have gotten bulking added,
thus the locking will be amortized over the bulk.

One question is how do we inform the end-user that XDP will be using
a slightly slower TXQ-locking scheme?  Given we have no XDP-features
exposed, I suggest a simple kernel log message, which we already have
for other XDP situations when the MTU is too large, or TSO is enabled.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-05-10 15:05 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-01  0:45 Dropped packets mapping IRQs for adjusted queue counts on i40e Zvi Effron
     [not found] ` <CADS2XXpjasmJKP__oHsrvv3EG8n-FjB6sqHwgQfh7QgeJ8GrrQ@mail.gmail.com>
2021-05-01 17:56   ` Zvi Effron
2021-05-02 20:16     ` T K Sourabh
2021-05-04 23:07       ` Zvi Effron
2021-05-05 18:55         ` Zvi Effron
2021-05-05 19:53           ` Zvi Effron
2021-05-05 19:55           ` David Ahern
2021-05-05 20:01           ` Jesse Brandeburg
2021-05-05 21:21             ` Maciej Fijalkowski
2021-05-05 21:21               ` [Intel-wired-lan] " Maciej Fijalkowski
2021-05-06 10:29               ` Toke Høiland-Jørgensen
2021-05-06 10:29                 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2021-05-09 15:50                 ` Maciej Fijalkowski
2021-05-09 15:50                   ` [Intel-wired-lan] " Maciej Fijalkowski
2021-05-10 11:22                   ` Toke Høiland-Jørgensen
2021-05-10 11:22                     ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2021-05-10 15:01                     ` Jesper Dangaard Brouer
2021-05-10 15:01                       ` [Intel-wired-lan] " Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.