XSK + ZC, shared UMEM and multiple Fill/Completion queues - broken?

From: Alasdair McWilliam <alasdair.mcwilliam@outlook.com>
To: Xdp <xdp-newbies@vger.kernel.org>
Subject: XSK + ZC, shared UMEM and multiple Fill/Completion queues - broken?
Date: Tue, 9 Aug 2022 14:25:47 +0000	[thread overview]
Message-ID: <6205E10C-292E-4995-9D10-409649354226@outlook.com> (raw)

Hi list. This is my first post so be gentle. :-)

I’m developing a piece of software that uses XSK in zero copy mode so we can pick up packets fast, do some work on them, then either transmit them back to a network or drop them. For the sake of this mail, we can say this involves pulling all traffic up into user-space via XSK.

The software sits directly on top of libbpf/libxdp, it does not use higher level abstractions.

Our current setup uses a multi-threaded user-space process. The process queries the system for the number of channels on a NIC (num_channels) and allocates enough UMEM to accommodate (num_channels * num_frames * frame_size). The umem is divided into a number of buckets before it loads its eBPF program into the kernel and creates its worker threads.

There are an equal number of threads to channels, and each thread receives a number of umem buckets as well as its own AF_XDP socket to work on. Structurally, each XSK has its own umem FQ/CQ as well as TXQ/RXQ by virtue of the xsk_socket__create_shared() API, and RSS facilitates a nice distribution of packets over each NIC channel and worker thread.

We’ve had a lot of success scaling across multi-core servers with Intel E800 cards, with synthetic tests getting up to 20-30Mpps data rates. Over the last few months we’ve also inserted the software into a production network for test runs with customer workloads, whereby the software is forwarding gigabits of legitimate traffic across an array of different workloads, with no impact to the U/X of customer traffic flows. Ultimately, to date, we've been quite confident as to the mechanics of the packet forwarding pipeline implemented with XSK.

But we’ve hit a snag. Everything’s worked great up to Linux 5.15, and from 5.16 onwards, it’s quite broken. If I could summarise the behaviour of this on kernel 5.16 onwards, I would say the main issues are:

* Channel 0 receives traffic but channel 1+ may not. (In this case, channel 0 tends to receive the right amount of traffic, e.g. 4 channels with RSS means channel 0 sees 1/4th the total ingress).

* Channels can stall. Superficially it looks like they only process frames up to the number of descriptors initially pushed onto the FQ, and then stops.

* eBPF programs running for frames via channel 0 work as expected. That is, if one is parsing layer 3 and 4 headers to identify certain traffic types, headers are where you would expect them to be in memory. However, this isn’t true for frames via channel 1+; headers don’t seem to be at the right position relative to the data pointer in the eBPF program. It could be there’s actually nothing in the descriptor, but this is experienced by the software as parser errors, because we can’t decode the IP frames properly.

We’ve been debugging this for some time and concluded the best way was to take our software out the equation, and use xdpsock from the kernel tree. In doing so, we realised that while xdpsock does test shared umem, it’s still a single thread, and maintains a single FQ/CQ despite opening 8x XSK sockets.

To move forward and validate with multiple FQ/CQ via the xsk_socket__create_shared() API, we’ve tweaked the xdpsock app to scale out umem allocation by num_channels, then split it into num_channels regions (by virtue of an offset), open multiple XSK sockets bound to num_channels NIC channels, insert the XSK FD’s into an XSK map indexed by channel number, and tweak xdpsock_kern to lookup rx_channel for redirect, vs. the RR approach in the original sample. And, on the whole, surprisingly, we *think* we can reproduce the issues.

We need to be a bit more scientific about our testing but I wanted to know if anyone else has had odd behaviour/experiences with XSK using shared umem, with multiple fill/completion queues, on kernel 5.16 and above?

We were under the impression that multi-FQ/CQ is a supported configuration - it worked perfectly in 5.15. Is this something that is actually going away, and we need to re-think our approach?

In all test cases we’ve been on x86_64 (Xeon E5’s or Xeon Platinum), on E810 or MLX Connect-X 4 cards. Tested on a range of different kernels, up to 5.19-rc4. In all cases we’re using aligned memory mode and the L2fwd behaviour of xdpsock.

In tracing back kernel commits we have actually found where the problems start to occur. ICE breaks from commit 57f7f8b6bc0bc80d94443f94fe5f21f266499a2b ("ice: Use xdp_buf instead of rx_buf for xsk zero-copy”) [1], and testing suggests MLX5 is broken from commit 94033cd8e73b8632bab7c8b7bb54caa4f5616db7 ("xsk: Optimize for aligned case”) [2]. I appreciate MLX5 don’t support XSK Z/C + RSS, but there are ways we can test multiple queues with some flow steering, and we see the same behaviour.

We’ve actually just published our modified xdpsock code at our open source repository [3] because we noticed the xdpsock code got removed from the tree a while ago. Our modifications are compile-time enabled/disabled because we wanted to be explicit in where we’ve changed logic in xdpsock. But the repo is available for peer review to see if there’s issues in how we’ve approached testing the logic.

Any and all feedback welcomed/appreciated - we’re a bit stumped!

Thanks
Alasdair

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=57f7f8b6bc0bc80d94443f94fe5f21f266499a2b

[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=94033cd8e73b8632bab7c8b7bb54caa4f5616db7

[3] https://github.com/OpenSource-THG/xdpsock-sample