Re: [RFC feedback] AF_XDP and non-Intel hardware

From: "Björn Töpel" <bjorn.topel@gmail.com>
To: MykytaI Iziumtsev <mykyta.iziumtsev@linaro.org>
Cc: Netdev <netdev@vger.kernel.org>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	"Karlsson, Magnus" <magnus.karlsson@intel.com>,
	"Zhang, Qi Z" <qi.z.zhang@intel.com>,
	"Francois Ozog" <francois.ozog@linaro.org>,
	"Ilias Apalodimas" <ilias.apalodimas@linaro.org>,
	"Brian Brooks" <brian.brooks@linaro.org>,
	"Jesper Dangaard Brouer" <brouer@redhat.com>,
	"Andy Gospodarek" <andy@greyhouse.net>,
	michael.chan@broadcom.com, "Luke Gorrie" <luke@snabb.co>
Subject: Re: [RFC feedback] AF_XDP and non-Intel hardware
Date: Mon, 4 Jun 2018 17:43:41 +0200	[thread overview]
Message-ID: <CAJ+HfNh3kb_pruXWk8ZY7U1f=Wt_oXz5djW1yujknb88rA=0ww@mail.gmail.com> (raw)
In-Reply-To: <CAPvmOuEeag+R9sa_yGRy2ZVpmcofoa7Ey6y1To3qi=QV4BV6ww@mail.gmail.com>

Den mån 4 juni 2018 kl 11:44 skrev Mykyta Iziumtsev
<mykyta.iziumtsev@linaro.org>:
>
> Hi Björn and Magnus,
>
> Here is a short update on the proposed changes to AF_XDP.
>
> During implementation phase I figured out that 'end-of-page' in RX
> descriptor's flags won't work, unfortunately. This is because the
> kernel driver can't know if the frame will be last on the page when RX
> descriptor is being filled in. Imagine that the driver is processing a
> frame on a page, and there is still 1000 bytes of non utilized memory
> beyond this frame. If the next frame (which haven't yet arrived) is
> less than that -- the NIC can place it on the same page, otherwise the
> NIC will skip remaining 1000 bytes and take next page. Of course, the
> driver can update RX descriptor retrospectively, or use 'empty' RX
> descriptors just to indicate page completion, but it's too complex or
> inefficient and as such doesn't look good.
>

I think both "update RX descriptor retrospectively" and "zero-length
end-of-chunk descriptor" are OK (as long as the zero-length descs much
less common that the non-zero one). Out of curiosity, does Chelsio (or
any other type-writer based) NIC has a public data sheet? It would be
interesting to see how this is solved in practice.

> What I could propose instead is to make the 'filling' and 'page
> completion' symmetric. That is, UMEM will have 1 queue to pass pages
> from application to the kernel driver (fring in your terminology) and
> one to indicate complete pages back to the application.
>
> That 'page completion' queue can be added as a third queue to UMEM, or
> alternatively TX completions can be decoupled from UMEM to make it an
> simple 'memory area' object with two directional queues which work
> with RX pages only. TX completions can be handled in a simpler manner
> as part of TX queues: just make sure consumer index is advanced only
> when the frame is finally sent out. The TX completion processing can
> be then done by observing consume index advance from application. That
> would  not only be more in line with best NIC ring design practices
> but will improve cache locality and eliminate critical section  if TX
> queues are assigned to different CPUs.
>
...and this is also a valid option. Adding an additional umem ring for
end-of-chunk completions, might make sense, but let's the zero-copy
implementations drive that requirement.

As you probably noted Magnus and I just sent out a patch set with an
updated descriptor layout. A next step would be adding support for
type-writer styled NICs.

Thanks for the information, and hopefully we'll see an additional
AF_XDP zero-copy implementation on list (nudge, nudge). ;-)

Björn

> With best regards,
> Mykyta
>
> On 22 May 2018 at 14:09, Björn Töpel <bjorn.topel@gmail.com> wrote:
> > 2018-05-22 9:45 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org>:
> >> On 21 May 2018 at 20:55, Björn Töpel <bjorn.topel@gmail.com> wrote:
> >>> 2018-05-21 14:34 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org>:
> >>>> Hi Björn and Magnus,
> >>>>
> >>>> (This thread is a follow up to private dialogue. The intention is to
> >>>> let community know that AF_XDP can be enhanced further to make it
> >>>> compatible with wider range of NIC vendors).
> >>>>
> >>>
> >>> Mykyta, thanks for doing the write-up and sending it to the netdev
> >>> list! The timing could not be better -- we need to settle on an uapi
> >>> that works for all vendors prior enabling it in the kernel.
> >>>
> >>>> There are two NIC variations which don't fit well with current AF_XDP proposal.
> >>>>
> >>>> The first variation is represented by some NXP and Cavium NICs. AF_XDP
> >>>> expects NIC to put incoming frames into slots defined by UMEM area.
> >>>> Here slot size is set in XDP_UMEM_REG xdp_umem.reg.frame_size and
> >>>> slots available to NIC are communicated to the kernel via UMEM fill
> >>>> queue. While Intel NICs support only one slot size, NXP and Cavium
> >>>> support multiple slot sizes to optimize memory usage (e.g. { 128, 512,
> >>>> 2048 }, please refer to [1] for rationale). On frame reception the NIC
> >>>> pulls a slot from best fit pool based on frame size.
> >>>>
> >>>> The second variation is represented by e.g. Chelsio T5/T6 and Netcope
> >>>> NICs. As shown above, AF_XDP expects NIC to put incoming frames at
> >>>> predefined addresses. This is not the case here as the NIC is in
> >>>> charge of placing frames in memory based on it's own algorithm. For
> >>>> example, Chelsio T5/T6 is expecting to get whole pages from the driver
> >>>> and puts incoming frames on the page in a 'tape recorder' fashion.
> >>>> Assuming 'base' is page address and flen[N] is an array of frame
> >>>> lengths, the frame placement in memory will look like that:
> >>>>   base + 0
> >>>>   base + frame_len[0]
> >>>>   base + frame_len[0] + frame_len[1]
> >>>>   base + frame_len[0] + frame_len[1] + frame_len[2]
> >>>>   ...
> >>>>
> >>>> To better support these two NIC variations I suggest to abandon 'frame
> >>>> size' structuring in UMEM and stick with 'pages' instead. The NIC
> >>>> kernel driver is then responsible for splitting provided pages into
> >>>> slots expected by underlying HW (or not splitting at all in case of
> >>>> Chelsio/Netcope).
> >>>>
> >>>
> >>> Let's call the first variation "multi-pool" and the second
> >>> "type-writer" for simplicity. The type-writer model is very
> >>> interesting, and makes a lot of sense when the PCIe bus is a
> >>> constraint.
> >>>
> >>>> On XDP_UMEM_REG the application needs to specify page_size. Then the
> >>>> application can pass empty pages to the kernel driver using UMEM
> >>>> 'fill' queue by specifying page offset within the UMEM area. xdp_desc
> >>>> format needs to be changed as well: frame location will be defined by
> >>>> offset from the beginning of UMEM area instead of frame index. As
> >>>> payload headroom can vary with AF_XDP we'll need to specify it in
> >>>> xdp_desc as well. Beside that it could be worth to consider changing
> >>>> payload length to u16 as 64k+ frames aren't very common in networking.
> >>>> The resulting xdp_desc would look like that:
> >>>>
> >>>> struct xdp_desc {
> >>>>         __u64 offset;
> >>>>         __u16 headroom;
> >>>>         __u16 len;
> >>>>         __u8 flags;
> >>>>         __u8 padding[3];
> >>>> };
> >>>>
> >>>
> >>> Let's expand a bit here; Instead of passing indicies to fixed sized
> >>> frames to the fill ring, we now pass a memory region. For simplicity,
> >>> let's say a page. The fill ring descriptor requires offset and
> >>> len. The offset is a global offset from an UMEM perspective, and len
> >>> is the size of the region.
> >>>
> >>
> >> I would rather stick with region equal to page (regular or huge page,
> >> defined by application). The page size can be extracted from
> >> vm_area_struct in XDP_UMEM_REG (preferred) or configured by
> >> application.
> >>
> >
> > Ok, thinking more about it I prefer this as well. This means that we
> > only need to grow the UMEM fring/cring descriptors to u64, and not
> > care about length. As you state below, this makes the validation
> > simple.
> >
> > We might consider exposing a "page size hint", that the user can set.
> > For 4G hugepage scenario, it might make sense to have a chunk *less*
> > than 4G to avoid HW Rx memory running low when the end of a chunk is
> > approaching.
> >
> > As for THP, I need to think about proper behavior here.
> >
> >>> On the Rx ring the descriptor, as you wrote, must be changed as well
> >>> to your suggestion above. Note, that headroom is still needed, since
> >>> XDP can change the size of a packet, so the fixed headroom stated in
> >>> UMEM registration is not sufficient.
> >>>
> >>> This model is obviously more flexible, but then also a bit more
> >>> complex. E.g. a fixed-frame NIC (like ixgbe), what should the
> >>> behaviour be? Should the fill ring entry be used only for *one* frame,
> >>> or chopped up to multiple receive frames? Should it be configurable?
> >>> Driver specific?
> >>
> >> I think driver-specific makes most sense here. In case of fixed-frame
> >> NIC the driver shall chop the ring entry into multiple receive frames.
> >>
> >
> > Let's start there, keeping the configuration space small.
> >
> >>>
> >>> Also, validating the entries in the fill queue require more work
> >>> (compared to the current index variant). Currently, we only skip
> >>> invalid indicies. What should we do when say, you pass a memory window
> >>> that is too narrow (say 128B) but correct from a UMEM
> >>> perspective. Going this path, we need to add pretty hard constraints
> >>> so we don't end up it too complex code -- because then it'll be too
> >>> slow.
> >>
> >> If we stick with pages -- the only possible erroneous input will be
> >> 'page out of UMEM boundaries'. The validation will be essentially:
> >>
> >> if ((offset > umem->size) || (offset & (umem->page_size - 1))
> >>     fail
> >>
> >> The question is what shall be done if validation fails ? Would
> >> SEGFAULT be reasonable ? This is more or less equivalent to
> >> dereferencing invalid pointer.
> >>
> >
> > The current scheme is simply dropping that kernel skips the invalid
> > fill ring entry. SIGSEGV is an interesting idea!
> >
> >>>
> >>>> In current proposal you have a notion of 'frame ownership': 'owned by
> >>>> kernel' or 'owned by application'. The ownership is transferred by
> >>>> means of enqueueing frame index in UMEM 'fill' queue (from application
> >>>> to kernel) or in UMEM 'tx completion' queue (from kernel to
> >>>> application). If you decide to adopt 'page' approach this notion needs
> >>>> to be changed a bit. This is because in case of packet forwarding one
> >>>> and the same page can be used for RX (parts of it enqueued in HW 'free
> >>>> lists') and TX (forwarding of previously RXed packets).
> >>>>
> >>>> I propose to define 'ownership' as a right to manipulate the
> >>>> partitioning of the page into frames. Whenever application passes a
> >>>> page to the kernel via UMEM 'fill' queue -- the ownership is
> >>>> transferred to the kernel. The application can't allocate packets on
> >>>> this page until kernel is done with it, but it can change payload of
> >>>> RXed packets before forwarding them. The kernel can pass ownership
> >>>> back by means of 'end-of-page' in xdp_desc.flags.
> >>>>
> >>>
> >>> I like the end-of-page mechanism.
> >>>
> >>>> The pages are definitely supposed to be recycled sooner or later. Even
> >>>> if it's not part of kernel API and the housekeeping implementation
> >>>> resided completely in application I still would like to propose
> >>>> possible (hopefully, cost efficient) solution to that. The recycling
> >>>> could be achieved by keeping refcount on pages and recycling the page
> >>>> only when it's owned by application and refcount reaches 0.
> >>>>
> >>>> Whenever application transfers page ownership to the kernel the
> >>>> refcount shall be initialized to 0. With each incoming RX xdp_desc the
> >>>> corresponding page needs to be identified (xdp_desc.offset >>
> >>>> PAGE_SHIFT) and refcount incremented. When the packet gets freed the
> >>>> refcount shall be decremented. If packet is forwarded in TX xdp_desc
> >>>> -- the refcount gets decremented only on TX completion (again,
> >>>> tx_completion.desc >> PAGE_SHIFT). For packets originating from the
> >>>> application itself the payload buffers needs to be allocated from
> >>>> empty page owned by the application and refcount needs to be
> >>>> incremented as well.
> >>>>
> >>>
> >>> Strictly speaking, we're not enforcing correctness in the current
> >>> solution. If the userspace application passes index 1 mulitple times
> >>> to the fill ring, and at the same time send index 1, things will
> >>> break. So, with the existing solution the userspace application
> >>> *still* needs to track the frames. With this new model, the
> >>> tracking/refcounting will be more complex. That might be a concern.
> >>>
> >>> For the multi-pool NICs I think we can still just have one UMEM, and
> >>> let the driver decide where in which pool to place this certain chunk
> >>> of memory. Thoughts?
> >>
> >> Definitely agree with that. This is HW specific and exposing it to the
> >> application would only harm portability.
> >>
> >
> > Good stuff, we're on the same page then.
> >
> >>>
> >>> Now, how do we go forward? I think this is very important, and I will
> >>> hack a copy-mode version for this new API. I'm a bit worried that the
> >>> complexity/configuration space will grow and impact performance, but
> >>> let's see.
> >>>
> >>> To prove that the API works for the NICs you mentioned, we need an
> >>> actual zero-copy implementation for them. Do you think Linaro could
> >>> work on a zero-copy variant for any of the NICs above?
> >>>
> >>
> >> Linaro will definitely contribute zero-copy implementation for some
> >> ARM-based NICs with 'multi-pool' variation.
> >
> > Very nice!
> >
> >> Implementation of
> >> 'type-writer' variation is up to Chelsio/Netcope, we only try to come
> >> up with API which (most probably) will fit them as well.
> >>
> >
> > Let's hope we get an implementation from these vendors as well! :-)
> >
> >
> > Björn
> >
> >>>
> >>> Again thanks for bringing this up!
> >>> Björn
> >>>
> >>>
> >>>
> >>>> [1] https://en.wikipedia.org/wiki/Internet_Mix
> >>>>
> >>>> With best regards,
> >>>> Mykyta