xdp-newbies.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* XSK + ZC, shared UMEM and multiple Fill/Completion queues - broken?
@ 2022-08-09 14:25 Alasdair McWilliam
  2022-08-09 14:43 ` Magnus Karlsson
  0 siblings, 1 reply; 7+ messages in thread
From: Alasdair McWilliam @ 2022-08-09 14:25 UTC (permalink / raw)
  To: Xdp

Hi list. This is my first post so be gentle. :-)

I’m developing a piece of software that uses XSK in zero copy mode so we can pick up packets fast, do some work on them, then either transmit them back to a network or drop them. For the sake of this mail, we can say this involves pulling all traffic up into user-space via XSK.

The software sits directly on top of libbpf/libxdp, it does not use higher level abstractions.

Our current setup uses a multi-threaded user-space process. The process queries the system for the number of channels on a NIC (num_channels) and allocates enough UMEM to accommodate (num_channels * num_frames * frame_size). The umem is divided into a number of buckets before it loads its eBPF program into the kernel and creates its worker threads.

There are an equal number of threads to channels, and each thread receives a number of umem buckets as well as its own AF_XDP socket to work on. Structurally, each XSK has its own umem FQ/CQ as well as TXQ/RXQ by virtue of the xsk_socket__create_shared() API, and RSS facilitates a nice distribution of packets over each NIC channel and worker thread.

We’ve had a lot of success scaling across multi-core servers with Intel E800 cards, with synthetic tests getting up to 20-30Mpps data rates. Over the last few months we’ve also inserted the software into a production network for test runs with customer workloads, whereby the software is forwarding gigabits of legitimate traffic across an array of different workloads, with no impact to the U/X of customer traffic flows. Ultimately, to date, we've been quite confident as to the mechanics of the packet forwarding pipeline implemented with XSK.

But we’ve hit a snag. Everything’s worked great up to Linux 5.15, and from 5.16 onwards, it’s quite broken. If I could summarise the behaviour of this on kernel 5.16 onwards, I would say the main issues are:

* Channel 0 receives traffic but channel 1+ may not. (In this case, channel 0 tends to receive the right amount of traffic, e.g. 4 channels with RSS means channel 0 sees 1/4th the total ingress).

* Channels can stall. Superficially it looks like they only process frames up to the number of descriptors initially pushed onto the FQ, and then stops.

* eBPF programs running for frames via channel 0 work as expected. That is, if one is parsing layer 3 and 4 headers to identify certain traffic types, headers are where you would expect them to be in memory. However, this isn’t true for frames via channel 1+; headers don’t seem to be at the right position relative to the data pointer in the eBPF program. It could be there’s actually nothing in the descriptor, but this is experienced by the software as parser errors, because we can’t decode the IP frames properly.

We’ve been debugging this for some time and concluded the best way was to take our software out the equation, and use xdpsock from the kernel tree. In doing so, we realised that while xdpsock does test shared umem, it’s still a single thread, and maintains a single FQ/CQ despite opening 8x XSK sockets.

To move forward and validate with multiple FQ/CQ via the xsk_socket__create_shared() API, we’ve tweaked the xdpsock app to scale out umem allocation by num_channels, then split it into num_channels regions (by virtue of an offset), open multiple XSK sockets bound to num_channels NIC channels, insert the XSK FD’s into an XSK map indexed by channel number, and tweak xdpsock_kern to lookup rx_channel for redirect, vs. the RR approach in the original sample. And, on the whole, surprisingly, we *think* we can reproduce the issues.

We need to be a bit more scientific about our testing but I wanted to know if anyone else has had odd behaviour/experiences with XSK using shared umem, with multiple fill/completion queues, on kernel 5.16 and above?

We were under the impression that multi-FQ/CQ is a supported configuration - it worked perfectly in 5.15. Is this something that is actually going away, and we need to re-think our approach?

In all test cases we’ve been on x86_64 (Xeon E5’s or Xeon Platinum), on E810 or MLX Connect-X 4 cards. Tested on a range of different kernels, up to 5.19-rc4. In all cases we’re using aligned memory mode and the L2fwd behaviour of xdpsock.

In tracing back kernel commits we have actually found where the problems start to occur. ICE breaks from commit 57f7f8b6bc0bc80d94443f94fe5f21f266499a2b ("ice: Use xdp_buf instead of rx_buf for xsk zero-copy”) [1], and testing suggests MLX5 is broken from commit 94033cd8e73b8632bab7c8b7bb54caa4f5616db7 ("xsk: Optimize for aligned case”) [2]. I appreciate MLX5 don’t support XSK Z/C + RSS, but there are ways we can test multiple queues with some flow steering, and we see the same behaviour.

We’ve actually just published our modified xdpsock code at our open source repository [3] because we noticed the xdpsock code got removed from the tree a while ago. Our modifications are compile-time enabled/disabled because we wanted to be explicit in where we’ve changed logic in xdpsock. But the repo is available for peer review to see if there’s issues in how we’ve approached testing the logic.

Any and all feedback welcomed/appreciated - we’re a bit stumped!

Thanks
Alasdair

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=57f7f8b6bc0bc80d94443f94fe5f21f266499a2b

[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=94033cd8e73b8632bab7c8b7bb54caa4f5616db7

[3] https://github.com/OpenSource-THG/xdpsock-sample


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XSK + ZC, shared UMEM and multiple Fill/Completion queues - broken?
  2022-08-09 14:25 XSK + ZC, shared UMEM and multiple Fill/Completion queues - broken? Alasdair McWilliam
@ 2022-08-09 14:43 ` Magnus Karlsson
  2022-08-09 14:50   ` Alasdair McWilliam
  0 siblings, 1 reply; 7+ messages in thread
From: Magnus Karlsson @ 2022-08-09 14:43 UTC (permalink / raw)
  To: Alasdair McWilliam; +Cc: Xdp

On Tue, Aug 9, 2022 at 4:27 PM Alasdair McWilliam
<alasdair.mcwilliam@outlook.com> wrote:
>
> Hi list. This is my first post so be gentle. :-)
>
> I’m developing a piece of software that uses XSK in zero copy mode so we can pick up packets fast, do some work on them, then either transmit them back to a network or drop them. For the sake of this mail, we can say this involves pulling all traffic up into user-space via XSK.
>
> The software sits directly on top of libbpf/libxdp, it does not use higher level abstractions.
>
> Our current setup uses a multi-threaded user-space process. The process queries the system for the number of channels on a NIC (num_channels) and allocates enough UMEM to accommodate (num_channels * num_frames * frame_size). The umem is divided into a number of buckets before it loads its eBPF program into the kernel and creates its worker threads.
>
> There are an equal number of threads to channels, and each thread receives a number of umem buckets as well as its own AF_XDP socket to work on. Structurally, each XSK has its own umem FQ/CQ as well as TXQ/RXQ by virtue of the xsk_socket__create_shared() API, and RSS facilitates a nice distribution of packets over each NIC channel and worker thread.
>
> We’ve had a lot of success scaling across multi-core servers with Intel E800 cards, with synthetic tests getting up to 20-30Mpps data rates. Over the last few months we’ve also inserted the software into a production network for test runs with customer workloads, whereby the software is forwarding gigabits of legitimate traffic across an array of different workloads, with no impact to the U/X of customer traffic flows. Ultimately, to date, we've been quite confident as to the mechanics of the packet forwarding pipeline implemented with XSK.
>
> But we’ve hit a snag. Everything’s worked great up to Linux 5.15, and from 5.16 onwards, it’s quite broken. If I could summarise the behaviour of this on kernel 5.16 onwards, I would say the main issues are:
>
> * Channel 0 receives traffic but channel 1+ may not. (In this case, channel 0 tends to receive the right amount of traffic, e.g. 4 channels with RSS means channel 0 sees 1/4th the total ingress).
>
> * Channels can stall. Superficially it looks like they only process frames up to the number of descriptors initially pushed onto the FQ, and then stops.
>
> * eBPF programs running for frames via channel 0 work as expected. That is, if one is parsing layer 3 and 4 headers to identify certain traffic types, headers are where you would expect them to be in memory. However, this isn’t true for frames via channel 1+; headers don’t seem to be at the right position relative to the data pointer in the eBPF program. It could be there’s actually nothing in the descriptor, but this is experienced by the software as parser errors, because we can’t decode the IP frames properly.
>
> We’ve been debugging this for some time and concluded the best way was to take our software out the equation, and use xdpsock from the kernel tree. In doing so, we realised that while xdpsock does test shared umem, it’s still a single thread, and maintains a single FQ/CQ despite opening 8x XSK sockets.
>
> To move forward and validate with multiple FQ/CQ via the xsk_socket__create_shared() API, we’ve tweaked the xdpsock app to scale out umem allocation by num_channels, then split it into num_channels regions (by virtue of an offset), open multiple XSK sockets bound to num_channels NIC channels, insert the XSK FD’s into an XSK map indexed by channel number, and tweak xdpsock_kern to lookup rx_channel for redirect, vs. the RR approach in the original sample. And, on the whole, surprisingly, we *think* we can reproduce the issues.
>
> We need to be a bit more scientific about our testing but I wanted to know if anyone else has had odd behaviour/experiences with XSK using shared umem, with multiple fill/completion queues, on kernel 5.16 and above?
>
> We were under the impression that multi-FQ/CQ is a supported configuration - it worked perfectly in 5.15. Is this something that is actually going away, and we need to re-think our approach?

It is not supposed to go away ever, so this is most likely a bug.
Thank you for reporting it and posting a program I can use to
reproduce it. I will get back when I have reproduced it, or failed to.
But let us hope it is the former.

BTW, there is one more person/company that has reported a similar
issue as you are stating, so it is likely real. But in that case, we
were not able to reproduce it on our end.

/Magnus

> In all test cases we’ve been on x86_64 (Xeon E5’s or Xeon Platinum), on E810 or MLX Connect-X 4 cards. Tested on a range of different kernels, up to 5.19-rc4. In all cases we’re using aligned memory mode and the L2fwd behaviour of xdpsock.
>
> In tracing back kernel commits we have actually found where the problems start to occur. ICE breaks from commit 57f7f8b6bc0bc80d94443f94fe5f21f266499a2b ("ice: Use xdp_buf instead of rx_buf for xsk zero-copy”) [1], and testing suggests MLX5 is broken from commit 94033cd8e73b8632bab7c8b7bb54caa4f5616db7 ("xsk: Optimize for aligned case”) [2]. I appreciate MLX5 don’t support XSK Z/C + RSS, but there are ways we can test multiple queues with some flow steering, and we see the same behaviour.
>
> We’ve actually just published our modified xdpsock code at our open source repository [3] because we noticed the xdpsock code got removed from the tree a while ago. Our modifications are compile-time enabled/disabled because we wanted to be explicit in where we’ve changed logic in xdpsock. But the repo is available for peer review to see if there’s issues in how we’ve approached testing the logic.
>
> Any and all feedback welcomed/appreciated - we’re a bit stumped!
>
> Thanks
> Alasdair
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=57f7f8b6bc0bc80d94443f94fe5f21f266499a2b
>
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=94033cd8e73b8632bab7c8b7bb54caa4f5616db7
>
> [3] https://github.com/OpenSource-THG/xdpsock-sample
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XSK + ZC, shared UMEM and multiple Fill/Completion queues - broken?
  2022-08-09 14:43 ` Magnus Karlsson
@ 2022-08-09 14:50   ` Alasdair McWilliam
  2022-08-09 14:58     ` Magnus Karlsson
  0 siblings, 1 reply; 7+ messages in thread
From: Alasdair McWilliam @ 2022-08-09 14:50 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: Xdp

Hi Magus,

We’ve actually reported it via our hardware vendor team as well. You might find my name’s on the other end of that bug report!

The repo went live this morning so I just posted it here for further visibility/review really. If we can help in any way, hit me up!

Many thanks
Alasdair


> On 9 Aug 2022, at 15:43, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> 
> On Tue, Aug 9, 2022 at 4:27 PM Alasdair McWilliam
> <alasdair.mcwilliam@outlook.com> wrote:
>> 
>> Hi list. This is my first post so be gentle. :-)
>> 
>> I’m developing a piece of software that uses XSK in zero copy mode so we can pick up packets fast, do some work on them, then either transmit them back to a network or drop them. For the sake of this mail, we can say this involves pulling all traffic up into user-space via XSK.
>> 
>> The software sits directly on top of libbpf/libxdp, it does not use higher level abstractions.
>> 
>> Our current setup uses a multi-threaded user-space process. The process queries the system for the number of channels on a NIC (num_channels) and allocates enough UMEM to accommodate (num_channels * num_frames * frame_size). The umem is divided into a number of buckets before it loads its eBPF program into the kernel and creates its worker threads.
>> 
>> There are an equal number of threads to channels, and each thread receives a number of umem buckets as well as its own AF_XDP socket to work on. Structurally, each XSK has its own umem FQ/CQ as well as TXQ/RXQ by virtue of the xsk_socket__create_shared() API, and RSS facilitates a nice distribution of packets over each NIC channel and worker thread.
>> 
>> We’ve had a lot of success scaling across multi-core servers with Intel E800 cards, with synthetic tests getting up to 20-30Mpps data rates. Over the last few months we’ve also inserted the software into a production network for test runs with customer workloads, whereby the software is forwarding gigabits of legitimate traffic across an array of different workloads, with no impact to the U/X of customer traffic flows. Ultimately, to date, we've been quite confident as to the mechanics of the packet forwarding pipeline implemented with XSK.
>> 
>> But we’ve hit a snag. Everything’s worked great up to Linux 5.15, and from 5.16 onwards, it’s quite broken. If I could summarise the behaviour of this on kernel 5.16 onwards, I would say the main issues are:
>> 
>> * Channel 0 receives traffic but channel 1+ may not. (In this case, channel 0 tends to receive the right amount of traffic, e.g. 4 channels with RSS means channel 0 sees 1/4th the total ingress).
>> 
>> * Channels can stall. Superficially it looks like they only process frames up to the number of descriptors initially pushed onto the FQ, and then stops.
>> 
>> * eBPF programs running for frames via channel 0 work as expected. That is, if one is parsing layer 3 and 4 headers to identify certain traffic types, headers are where you would expect them to be in memory. However, this isn’t true for frames via channel 1+; headers don’t seem to be at the right position relative to the data pointer in the eBPF program. It could be there’s actually nothing in the descriptor, but this is experienced by the software as parser errors, because we can’t decode the IP frames properly.
>> 
>> We’ve been debugging this for some time and concluded the best way was to take our software out the equation, and use xdpsock from the kernel tree. In doing so, we realised that while xdpsock does test shared umem, it’s still a single thread, and maintains a single FQ/CQ despite opening 8x XSK sockets.
>> 
>> To move forward and validate with multiple FQ/CQ via the xsk_socket__create_shared() API, we’ve tweaked the xdpsock app to scale out umem allocation by num_channels, then split it into num_channels regions (by virtue of an offset), open multiple XSK sockets bound to num_channels NIC channels, insert the XSK FD’s into an XSK map indexed by channel number, and tweak xdpsock_kern to lookup rx_channel for redirect, vs. the RR approach in the original sample. And, on the whole, surprisingly, we *think* we can reproduce the issues.
>> 
>> We need to be a bit more scientific about our testing but I wanted to know if anyone else has had odd behaviour/experiences with XSK using shared umem, with multiple fill/completion queues, on kernel 5.16 and above?
>> 
>> We were under the impression that multi-FQ/CQ is a supported configuration - it worked perfectly in 5.15. Is this something that is actually going away, and we need to re-think our approach?
> 
> It is not supposed to go away ever, so this is most likely a bug.
> Thank you for reporting it and posting a program I can use to
> reproduce it. I will get back when I have reproduced it, or failed to.
> But let us hope it is the former.
> 
> BTW, there is one more person/company that has reported a similar
> issue as you are stating, so it is likely real. But in that case, we
> were not able to reproduce it on our end.
> 
> /Magnus
> 
>> In all test cases we’ve been on x86_64 (Xeon E5’s or Xeon Platinum), on E810 or MLX Connect-X 4 cards. Tested on a range of different kernels, up to 5.19-rc4. In all cases we’re using aligned memory mode and the L2fwd behaviour of xdpsock.
>> 
>> In tracing back kernel commits we have actually found where the problems start to occur. ICE breaks from commit 57f7f8b6bc0bc80d94443f94fe5f21f266499a2b ("ice: Use xdp_buf instead of rx_buf for xsk zero-copy”) [1], and testing suggests MLX5 is broken from commit 94033cd8e73b8632bab7c8b7bb54caa4f5616db7 ("xsk: Optimize for aligned case”) [2]. I appreciate MLX5 don’t support XSK Z/C + RSS, but there are ways we can test multiple queues with some flow steering, and we see the same behaviour.
>> 
>> We’ve actually just published our modified xdpsock code at our open source repository [3] because we noticed the xdpsock code got removed from the tree a while ago. Our modifications are compile-time enabled/disabled because we wanted to be explicit in where we’ve changed logic in xdpsock. But the repo is available for peer review to see if there’s issues in how we’ve approached testing the logic.
>> 
>> Any and all feedback welcomed/appreciated - we’re a bit stumped!
>> 
>> Thanks
>> Alasdair
>> 
>> [1] https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D57f7f8b6bc0bc80d94443f94fe5f21f266499a2b&amp;data=05%7C01%7C%7C9bb4ea3f876a45048d6608da7a158e69%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637956530246796785%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=ynHKVYAffY2K2bAm8cmbEFFF651eJ6cOlTvZLmI%2Fj7Y%3D&amp;reserved=0
>> 
>> [2] https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D94033cd8e73b8632bab7c8b7bb54caa4f5616db7&amp;data=05%7C01%7C%7C9bb4ea3f876a45048d6608da7a158e69%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637956530246796785%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=8uFHROythMvWn0f4VzKQSPE310eE7r8guggJrRM%2FEKU%3D&amp;reserved=0
>> 
>> [3] https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FOpenSource-THG%2Fxdpsock-sample&amp;data=05%7C01%7C%7C9bb4ea3f876a45048d6608da7a158e69%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637956530246796785%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=fTezXtq4T0CmexRuru%2FBZmMOyc5Eyym1ONdaIFi53Ac%3D&amp;reserved=0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XSK + ZC, shared UMEM and multiple Fill/Completion queues - broken?
  2022-08-09 14:50   ` Alasdair McWilliam
@ 2022-08-09 14:58     ` Magnus Karlsson
  2022-08-09 15:12       ` Alasdair McWilliam
  0 siblings, 1 reply; 7+ messages in thread
From: Magnus Karlsson @ 2022-08-09 14:58 UTC (permalink / raw)
  To: Alasdair McWilliam; +Cc: Xdp

On Tue, Aug 9, 2022 at 4:50 PM Alasdair McWilliam
<alasdair.mcwilliam@outlook.com> wrote:
>
> Hi Magus,
>
> We’ve actually reported it via our hardware vendor team as well. You might find my name’s on the other end of that bug report!
>
> The repo went live this morning so I just posted it here for further visibility/review really. If we can help in any way, hit me up!

I can reach it thanks. Could you please send me the command line you
use to trigger the problem so I can try out exactly that on my system?
And I guess it is the "multi FCQ" build that breaks.

> Many thanks
> Alasdair
>
>
> > On 9 Aug 2022, at 15:43, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> >
> > On Tue, Aug 9, 2022 at 4:27 PM Alasdair McWilliam
> > <alasdair.mcwilliam@outlook.com> wrote:
> >>
> >> Hi list. This is my first post so be gentle. :-)
> >>
> >> I’m developing a piece of software that uses XSK in zero copy mode so we can pick up packets fast, do some work on them, then either transmit them back to a network or drop them. For the sake of this mail, we can say this involves pulling all traffic up into user-space via XSK.
> >>
> >> The software sits directly on top of libbpf/libxdp, it does not use higher level abstractions.
> >>
> >> Our current setup uses a multi-threaded user-space process. The process queries the system for the number of channels on a NIC (num_channels) and allocates enough UMEM to accommodate (num_channels * num_frames * frame_size). The umem is divided into a number of buckets before it loads its eBPF program into the kernel and creates its worker threads.
> >>
> >> There are an equal number of threads to channels, and each thread receives a number of umem buckets as well as its own AF_XDP socket to work on. Structurally, each XSK has its own umem FQ/CQ as well as TXQ/RXQ by virtue of the xsk_socket__create_shared() API, and RSS facilitates a nice distribution of packets over each NIC channel and worker thread.
> >>
> >> We’ve had a lot of success scaling across multi-core servers with Intel E800 cards, with synthetic tests getting up to 20-30Mpps data rates. Over the last few months we’ve also inserted the software into a production network for test runs with customer workloads, whereby the software is forwarding gigabits of legitimate traffic across an array of different workloads, with no impact to the U/X of customer traffic flows. Ultimately, to date, we've been quite confident as to the mechanics of the packet forwarding pipeline implemented with XSK.
> >>
> >> But we’ve hit a snag. Everything’s worked great up to Linux 5.15, and from 5.16 onwards, it’s quite broken. If I could summarise the behaviour of this on kernel 5.16 onwards, I would say the main issues are:
> >>
> >> * Channel 0 receives traffic but channel 1+ may not. (In this case, channel 0 tends to receive the right amount of traffic, e.g. 4 channels with RSS means channel 0 sees 1/4th the total ingress).
> >>
> >> * Channels can stall. Superficially it looks like they only process frames up to the number of descriptors initially pushed onto the FQ, and then stops.
> >>
> >> * eBPF programs running for frames via channel 0 work as expected. That is, if one is parsing layer 3 and 4 headers to identify certain traffic types, headers are where you would expect them to be in memory. However, this isn’t true for frames via channel 1+; headers don’t seem to be at the right position relative to the data pointer in the eBPF program. It could be there’s actually nothing in the descriptor, but this is experienced by the software as parser errors, because we can’t decode the IP frames properly.
> >>
> >> We’ve been debugging this for some time and concluded the best way was to take our software out the equation, and use xdpsock from the kernel tree. In doing so, we realised that while xdpsock does test shared umem, it’s still a single thread, and maintains a single FQ/CQ despite opening 8x XSK sockets.
> >>
> >> To move forward and validate with multiple FQ/CQ via the xsk_socket__create_shared() API, we’ve tweaked the xdpsock app to scale out umem allocation by num_channels, then split it into num_channels regions (by virtue of an offset), open multiple XSK sockets bound to num_channels NIC channels, insert the XSK FD’s into an XSK map indexed by channel number, and tweak xdpsock_kern to lookup rx_channel for redirect, vs. the RR approach in the original sample. And, on the whole, surprisingly, we *think* we can reproduce the issues.
> >>
> >> We need to be a bit more scientific about our testing but I wanted to know if anyone else has had odd behaviour/experiences with XSK using shared umem, with multiple fill/completion queues, on kernel 5.16 and above?
> >>
> >> We were under the impression that multi-FQ/CQ is a supported configuration - it worked perfectly in 5.15. Is this something that is actually going away, and we need to re-think our approach?
> >
> > It is not supposed to go away ever, so this is most likely a bug.
> > Thank you for reporting it and posting a program I can use to
> > reproduce it. I will get back when I have reproduced it, or failed to.
> > But let us hope it is the former.
> >
> > BTW, there is one more person/company that has reported a similar
> > issue as you are stating, so it is likely real. But in that case, we
> > were not able to reproduce it on our end.
> >
> > /Magnus
> >
> >> In all test cases we’ve been on x86_64 (Xeon E5’s or Xeon Platinum), on E810 or MLX Connect-X 4 cards. Tested on a range of different kernels, up to 5.19-rc4. In all cases we’re using aligned memory mode and the L2fwd behaviour of xdpsock.
> >>
> >> In tracing back kernel commits we have actually found where the problems start to occur. ICE breaks from commit 57f7f8b6bc0bc80d94443f94fe5f21f266499a2b ("ice: Use xdp_buf instead of rx_buf for xsk zero-copy”) [1], and testing suggests MLX5 is broken from commit 94033cd8e73b8632bab7c8b7bb54caa4f5616db7 ("xsk: Optimize for aligned case”) [2]. I appreciate MLX5 don’t support XSK Z/C + RSS, but there are ways we can test multiple queues with some flow steering, and we see the same behaviour.
> >>
> >> We’ve actually just published our modified xdpsock code at our open source repository [3] because we noticed the xdpsock code got removed from the tree a while ago. Our modifications are compile-time enabled/disabled because we wanted to be explicit in where we’ve changed logic in xdpsock. But the repo is available for peer review to see if there’s issues in how we’ve approached testing the logic.
> >>
> >> Any and all feedback welcomed/appreciated - we’re a bit stumped!
> >>
> >> Thanks
> >> Alasdair
> >>
> >> [1] https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D57f7f8b6bc0bc80d94443f94fe5f21f266499a2b&amp;data=05%7C01%7C%7C9bb4ea3f876a45048d6608da7a158e69%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637956530246796785%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=ynHKVYAffY2K2bAm8cmbEFFF651eJ6cOlTvZLmI%2Fj7Y%3D&amp;reserved=0
> >>
> >> [2] https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D94033cd8e73b8632bab7c8b7bb54caa4f5616db7&amp;data=05%7C01%7C%7C9bb4ea3f876a45048d6608da7a158e69%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637956530246796785%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=8uFHROythMvWn0f4VzKQSPE310eE7r8guggJrRM%2FEKU%3D&amp;reserved=0
> >>
> >> [3] https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FOpenSource-THG%2Fxdpsock-sample&amp;data=05%7C01%7C%7C9bb4ea3f876a45048d6608da7a158e69%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637956530246796785%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=fTezXtq4T0CmexRuru%2FBZmMOyc5Eyym1ONdaIFi53Ac%3D&amp;reserved=0
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XSK + ZC, shared UMEM and multiple Fill/Completion queues - broken?
  2022-08-09 14:58     ` Magnus Karlsson
@ 2022-08-09 15:12       ` Alasdair McWilliam
  2022-08-10 13:16         ` Magnus Karlsson
  0 siblings, 1 reply; 7+ messages in thread
From: Alasdair McWilliam @ 2022-08-09 15:12 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: Xdp



> On 9 Aug 2022, at 15:58, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> 
> 
> I can reach it thanks. Could you please send me the command line you
> use to trigger the problem so I can try out exactly that on my system?
> And I guess it is the "multi FCQ" build that breaks.

Certainly:

./xdpsock_multi --l2fwd --interface ice0 --queue 0 --channels 4 --poll --busy-poll --zero-copy

Hardware info:

# ethtool -i ice0
driver: ice
version: 5.18.10-1.el8.elrepo.x86_64
firmware-version: 2.50 0x800077a8 1.2960.0
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

# lspci -s 03:00.0 | grep -i E810
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)

# ethtool -g ice0
Ring parameters for ice0:
Pre-set maximums:
RX:		8160
RX Mini:	n/a
RX Jumbo:	n/a
TX:		8160
Current hardware settings:
RX:		4096
RX Mini:	n/a
RX Jumbo:	n/a
TX:		4096

# ethtool -l ice0
Channel parameters for ice0:
Pre-set maximums:
RX:		16
TX:		16
Other:		1
Combined:	16
Current hardware settings:
RX:		0
TX:		0
Other:		1
Combined:	4


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XSK + ZC, shared UMEM and multiple Fill/Completion queues - broken?
  2022-08-09 15:12       ` Alasdair McWilliam
@ 2022-08-10 13:16         ` Magnus Karlsson
  2022-08-10 14:06           ` Alasdair McWilliam
  0 siblings, 1 reply; 7+ messages in thread
From: Magnus Karlsson @ 2022-08-10 13:16 UTC (permalink / raw)
  To: Alasdair McWilliam, Fijalkowski, Maciej; +Cc: Xdp

On Tue, Aug 9, 2022 at 5:12 PM Alasdair McWilliam
<alasdair.mcwilliam@outlook.com> wrote:
>
>
>
> > On 9 Aug 2022, at 15:58, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> >
> >
> > I can reach it thanks. Could you please send me the command line you
> > use to trigger the problem so I can try out exactly that on my system?
> > And I guess it is the "multi FCQ" build that breaks.
>
> Certainly:
>
> ./xdpsock_multi --l2fwd --interface ice0 --queue 0 --channels 4 --poll --busy-poll --zero-copy
>
> Hardware info:
>
> # ethtool -i ice0
> driver: ice
> version: 5.18.10-1.el8.elrepo.x86_64
> firmware-version: 2.50 0x800077a8 1.2960.0
> expansion-rom-version:
> bus-info: 0000:03:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
> # lspci -s 03:00.0 | grep -i E810
> 03:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
>
> # ethtool -g ice0
> Ring parameters for ice0:
> Pre-set maximums:
> RX:             8160
> RX Mini:        n/a
> RX Jumbo:       n/a
> TX:             8160
> Current hardware settings:
> RX:             4096
> RX Mini:        n/a
> RX Jumbo:       n/a
> TX:             4096
>
> # ethtool -l ice0
> Channel parameters for ice0:
> Pre-set maximums:
> RX:             16
> TX:             16
> Other:          1
> Combined:       16
> Current hardware settings:
> RX:             0
> TX:             0
> Other:          1
> Combined:       4

OK, I believe I have found the problem and have produced a fix for it.
As usual, it is something simple and stupid, sigh. I will post a patch
here tomorrow. Would you be able to test it and see if it fixes the
problem with corrupted packets in your app? If it does, then I will
post the patch on the netdev mailing list. Just note that even if this
fixes this, there is still your RSS problem. Maciej is looking into
that. You were also reporting stalls, which we should examine if they
still occur after fixing the corrupted packets and the RSS problem.

/Magnus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XSK + ZC, shared UMEM and multiple Fill/Completion queues - broken?
  2022-08-10 13:16         ` Magnus Karlsson
@ 2022-08-10 14:06           ` Alasdair McWilliam
  0 siblings, 0 replies; 7+ messages in thread
From: Alasdair McWilliam @ 2022-08-10 14:06 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: Fijalkowski, Maciej, Xdp


> 
> OK, I believe I have found the problem and have produced a fix for it.
> As usual, it is something simple and stupid, sigh. I will post a patch
> here tomorrow. Would you be able to test it and see if it fixes the
> problem with corrupted packets in your app? If it does, then I will
> post the patch on the netdev mailing list. Just note that even if this
> fixes this, there is still your RSS problem. Maciej is looking into
> that. You were also reporting stalls, which we should examine if they
> still occur after fixing the corrupted packets and the RSS problem.
> 
> /Magnus

That’s awesome - thank you! Happy to test - just fire it over.

We’ve observed “differing” behaviour, and we did wonder if there were two overlapping issues. Unfortunately we could not identify a specific order of operation to trigger behaviour A (e.g. queue 1+ not working) vs behaviour B (e.g. queue stalls). Have you observed alternating behaviour with xdpsock_multi?

Kind regards
Alasdair


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-08-10 14:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-09 14:25 XSK + ZC, shared UMEM and multiple Fill/Completion queues - broken? Alasdair McWilliam
2022-08-09 14:43 ` Magnus Karlsson
2022-08-09 14:50   ` Alasdair McWilliam
2022-08-09 14:58     ` Magnus Karlsson
2022-08-09 15:12       ` Alasdair McWilliam
2022-08-10 13:16         ` Magnus Karlsson
2022-08-10 14:06           ` Alasdair McWilliam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).