From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= Subject: Re: [RFC feedback] AF_XDP and non-Intel hardware Date: Tue, 22 May 2018 14:09:46 +0200 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Cc: Netdev , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , "Karlsson, Magnus" , "Zhang, Qi Z" , Francois Ozog , Ilias Apalodimas , Brian Brooks , Jesper Dangaard Brouer , Andy Gospodarek , michael.chan@broadcom.com, Luke Gorrie To: Mykyta Iziumtsev Return-path: Received: from mail-qk0-f196.google.com ([209.85.220.196]:46811 "EHLO mail-qk0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752902AbeEVMJs (ORCPT ); Tue, 22 May 2018 08:09:48 -0400 Received: by mail-qk0-f196.google.com with SMTP id s70-v6so14307292qks.13 for ; Tue, 22 May 2018 05:09:48 -0700 (PDT) In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: 2018-05-22 9:45 GMT+02:00 Mykyta Iziumtsev : > On 21 May 2018 at 20:55, Bj=C3=B6rn T=C3=B6pel wr= ote: >> 2018-05-21 14:34 GMT+02:00 Mykyta Iziumtsev : >>> Hi Bj=C3=B6rn and Magnus, >>> >>> (This thread is a follow up to private dialogue. The intention is to >>> let community know that AF_XDP can be enhanced further to make it >>> compatible with wider range of NIC vendors). >>> >> >> Mykyta, thanks for doing the write-up and sending it to the netdev >> list! The timing could not be better -- we need to settle on an uapi >> that works for all vendors prior enabling it in the kernel. >> >>> There are two NIC variations which don't fit well with current AF_XDP p= roposal. >>> >>> The first variation is represented by some NXP and Cavium NICs. AF_XDP >>> expects NIC to put incoming frames into slots defined by UMEM area. >>> Here slot size is set in XDP_UMEM_REG xdp_umem.reg.frame_size and >>> slots available to NIC are communicated to the kernel via UMEM fill >>> queue. While Intel NICs support only one slot size, NXP and Cavium >>> support multiple slot sizes to optimize memory usage (e.g. { 128, 512, >>> 2048 }, please refer to [1] for rationale). On frame reception the NIC >>> pulls a slot from best fit pool based on frame size. >>> >>> The second variation is represented by e.g. Chelsio T5/T6 and Netcope >>> NICs. As shown above, AF_XDP expects NIC to put incoming frames at >>> predefined addresses. This is not the case here as the NIC is in >>> charge of placing frames in memory based on it's own algorithm. For >>> example, Chelsio T5/T6 is expecting to get whole pages from the driver >>> and puts incoming frames on the page in a 'tape recorder' fashion. >>> Assuming 'base' is page address and flen[N] is an array of frame >>> lengths, the frame placement in memory will look like that: >>> base + 0 >>> base + frame_len[0] >>> base + frame_len[0] + frame_len[1] >>> base + frame_len[0] + frame_len[1] + frame_len[2] >>> ... >>> >>> To better support these two NIC variations I suggest to abandon 'frame >>> size' structuring in UMEM and stick with 'pages' instead. The NIC >>> kernel driver is then responsible for splitting provided pages into >>> slots expected by underlying HW (or not splitting at all in case of >>> Chelsio/Netcope). >>> >> >> Let's call the first variation "multi-pool" and the second >> "type-writer" for simplicity. The type-writer model is very >> interesting, and makes a lot of sense when the PCIe bus is a >> constraint. >> >>> On XDP_UMEM_REG the application needs to specify page_size. Then the >>> application can pass empty pages to the kernel driver using UMEM >>> 'fill' queue by specifying page offset within the UMEM area. xdp_desc >>> format needs to be changed as well: frame location will be defined by >>> offset from the beginning of UMEM area instead of frame index. As >>> payload headroom can vary with AF_XDP we'll need to specify it in >>> xdp_desc as well. Beside that it could be worth to consider changing >>> payload length to u16 as 64k+ frames aren't very common in networking. >>> The resulting xdp_desc would look like that: >>> >>> struct xdp_desc { >>> __u64 offset; >>> __u16 headroom; >>> __u16 len; >>> __u8 flags; >>> __u8 padding[3]; >>> }; >>> >> >> Let's expand a bit here; Instead of passing indicies to fixed sized >> frames to the fill ring, we now pass a memory region. For simplicity, >> let's say a page. The fill ring descriptor requires offset and >> len. The offset is a global offset from an UMEM perspective, and len >> is the size of the region. >> > > I would rather stick with region equal to page (regular or huge page, > defined by application). The page size can be extracted from > vm_area_struct in XDP_UMEM_REG (preferred) or configured by > application. > Ok, thinking more about it I prefer this as well. This means that we only need to grow the UMEM fring/cring descriptors to u64, and not care about length. As you state below, this makes the validation simple. We might consider exposing a "page size hint", that the user can set. For 4G hugepage scenario, it might make sense to have a chunk *less* than 4G to avoid HW Rx memory running low when the end of a chunk is approaching. As for THP, I need to think about proper behavior here. >> On the Rx ring the descriptor, as you wrote, must be changed as well >> to your suggestion above. Note, that headroom is still needed, since >> XDP can change the size of a packet, so the fixed headroom stated in >> UMEM registration is not sufficient. >> >> This model is obviously more flexible, but then also a bit more >> complex. E.g. a fixed-frame NIC (like ixgbe), what should the >> behaviour be? Should the fill ring entry be used only for *one* frame, >> or chopped up to multiple receive frames? Should it be configurable? >> Driver specific? > > I think driver-specific makes most sense here. In case of fixed-frame > NIC the driver shall chop the ring entry into multiple receive frames. > Let's start there, keeping the configuration space small. >> >> Also, validating the entries in the fill queue require more work >> (compared to the current index variant). Currently, we only skip >> invalid indicies. What should we do when say, you pass a memory window >> that is too narrow (say 128B) but correct from a UMEM >> perspective. Going this path, we need to add pretty hard constraints >> so we don't end up it too complex code -- because then it'll be too >> slow. > > If we stick with pages -- the only possible erroneous input will be > 'page out of UMEM boundaries'. The validation will be essentially: > > if ((offset > umem->size) || (offset & (umem->page_size - 1)) > fail > > The question is what shall be done if validation fails ? Would > SEGFAULT be reasonable ? This is more or less equivalent to > dereferencing invalid pointer. > The current scheme is simply dropping that kernel skips the invalid fill ring entry. SIGSEGV is an interesting idea! >> >>> In current proposal you have a notion of 'frame ownership': 'owned by >>> kernel' or 'owned by application'. The ownership is transferred by >>> means of enqueueing frame index in UMEM 'fill' queue (from application >>> to kernel) or in UMEM 'tx completion' queue (from kernel to >>> application). If you decide to adopt 'page' approach this notion needs >>> to be changed a bit. This is because in case of packet forwarding one >>> and the same page can be used for RX (parts of it enqueued in HW 'free >>> lists') and TX (forwarding of previously RXed packets). >>> >>> I propose to define 'ownership' as a right to manipulate the >>> partitioning of the page into frames. Whenever application passes a >>> page to the kernel via UMEM 'fill' queue -- the ownership is >>> transferred to the kernel. The application can't allocate packets on >>> this page until kernel is done with it, but it can change payload of >>> RXed packets before forwarding them. The kernel can pass ownership >>> back by means of 'end-of-page' in xdp_desc.flags. >>> >> >> I like the end-of-page mechanism. >> >>> The pages are definitely supposed to be recycled sooner or later. Even >>> if it's not part of kernel API and the housekeeping implementation >>> resided completely in application I still would like to propose >>> possible (hopefully, cost efficient) solution to that. The recycling >>> could be achieved by keeping refcount on pages and recycling the page >>> only when it's owned by application and refcount reaches 0. >>> >>> Whenever application transfers page ownership to the kernel the >>> refcount shall be initialized to 0. With each incoming RX xdp_desc the >>> corresponding page needs to be identified (xdp_desc.offset >> >>> PAGE_SHIFT) and refcount incremented. When the packet gets freed the >>> refcount shall be decremented. If packet is forwarded in TX xdp_desc >>> -- the refcount gets decremented only on TX completion (again, >>> tx_completion.desc >> PAGE_SHIFT). For packets originating from the >>> application itself the payload buffers needs to be allocated from >>> empty page owned by the application and refcount needs to be >>> incremented as well. >>> >> >> Strictly speaking, we're not enforcing correctness in the current >> solution. If the userspace application passes index 1 mulitple times >> to the fill ring, and at the same time send index 1, things will >> break. So, with the existing solution the userspace application >> *still* needs to track the frames. With this new model, the >> tracking/refcounting will be more complex. That might be a concern. >> >> For the multi-pool NICs I think we can still just have one UMEM, and >> let the driver decide where in which pool to place this certain chunk >> of memory. Thoughts? > > Definitely agree with that. This is HW specific and exposing it to the > application would only harm portability. > Good stuff, we're on the same page then. >> >> Now, how do we go forward? I think this is very important, and I will >> hack a copy-mode version for this new API. I'm a bit worried that the >> complexity/configuration space will grow and impact performance, but >> let's see. >> >> To prove that the API works for the NICs you mentioned, we need an >> actual zero-copy implementation for them. Do you think Linaro could >> work on a zero-copy variant for any of the NICs above? >> > > Linaro will definitely contribute zero-copy implementation for some > ARM-based NICs with 'multi-pool' variation. Very nice! > Implementation of > 'type-writer' variation is up to Chelsio/Netcope, we only try to come > up with API which (most probably) will fit them as well. > Let's hope we get an implementation from these vendors as well! :-) Bj=C3=B6rn >> >> Again thanks for bringing this up! >> Bj=C3=B6rn >> >> >> >>> [1] https://en.wikipedia.org/wiki/Internet_Mix >>> >>> With best regards, >>> Mykyta