From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@gmail.com>
Subject: Re: [RFC feedback] AF_XDP and non-Intel hardware
Date: Tue, 22 May 2018 14:09:46 +0200
Message-ID: <CAJ+HfNjxyeOGCw0Fi70HcG_aq1c_dc6JHF3+87K8G6YqBjj54A@mail.gmail.com>
References: <CAPvmOuHhoLDB+214_1BptHtUtyU0of=4_99cSXNyv8T-WYjvvg@mail.gmail.com>
 <CAJ+HfNhB_E6RDn-K4Wp0h8194bqqicVzWJ0Kn18HAG3B7K=hXA@mail.gmail.com> <CAPvmOuHFXKOqfY5z4UUh7qEMygp7NMBsMG_AW4BQu1vG+qJBjg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Cc: Netdev <netdev@vger.kernel.org>,
        =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@intel.com>,
        "Karlsson, Magnus" <magnus.karlsson@intel.com>,
        "Zhang, Qi Z" <qi.z.zhang@intel.com>,
        Francois Ozog <francois.ozog@linaro.org>,
        Ilias Apalodimas <ilias.apalodimas@linaro.org>,
        Brian Brooks <brian.brooks@linaro.org>,
        Jesper Dangaard Brouer <brouer@redhat.com>,
        Andy Gospodarek <andy@greyhouse.net>,
        michael.chan@broadcom.com, Luke Gorrie <luke@snabb.co>
To: Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-qk0-f196.google.com ([209.85.220.196]:46811 "EHLO
        mail-qk0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1752902AbeEVMJs (ORCPT
        <rfc822;netdev@vger.kernel.org>); Tue, 22 May 2018 08:09:48 -0400
Received: by mail-qk0-f196.google.com with SMTP id s70-v6so14307292qks.13
        for <netdev@vger.kernel.org>; Tue, 22 May 2018 05:09:48 -0700 (PDT)
In-Reply-To: <CAPvmOuHFXKOqfY5z4UUh7qEMygp7NMBsMG_AW4BQu1vG+qJBjg@mail.gmail.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

2018-05-22 9:45 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org>:
> On 21 May 2018 at 20:55, Bj=C3=B6rn T=C3=B6pel <bjorn.topel@gmail.com> wr=
ote:
>> 2018-05-21 14:34 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org=
>:
>>> Hi Bj=C3=B6rn and Magnus,
>>>
>>> (This thread is a follow up to private dialogue. The intention is to
>>> let community know that AF_XDP can be enhanced further to make it
>>> compatible with wider range of NIC vendors).
>>>
>>
>> Mykyta, thanks for doing the write-up and sending it to the netdev
>> list! The timing could not be better -- we need to settle on an uapi
>> that works for all vendors prior enabling it in the kernel.
>>
>>> There are two NIC variations which don't fit well with current AF_XDP p=
roposal.
>>>
>>> The first variation is represented by some NXP and Cavium NICs. AF_XDP
>>> expects NIC to put incoming frames into slots defined by UMEM area.
>>> Here slot size is set in XDP_UMEM_REG xdp_umem.reg.frame_size and
>>> slots available to NIC are communicated to the kernel via UMEM fill
>>> queue. While Intel NICs support only one slot size, NXP and Cavium
>>> support multiple slot sizes to optimize memory usage (e.g. { 128, 512,
>>> 2048 }, please refer to [1] for rationale). On frame reception the NIC
>>> pulls a slot from best fit pool based on frame size.
>>>
>>> The second variation is represented by e.g. Chelsio T5/T6 and Netcope
>>> NICs. As shown above, AF_XDP expects NIC to put incoming frames at
>>> predefined addresses. This is not the case here as the NIC is in
>>> charge of placing frames in memory based on it's own algorithm. For
>>> example, Chelsio T5/T6 is expecting to get whole pages from the driver
>>> and puts incoming frames on the page in a 'tape recorder' fashion.
>>> Assuming 'base' is page address and flen[N] is an array of frame
>>> lengths, the frame placement in memory will look like that:
>>>   base + 0
>>>   base + frame_len[0]
>>>   base + frame_len[0] + frame_len[1]
>>>   base + frame_len[0] + frame_len[1] + frame_len[2]
>>>   ...
>>>
>>> To better support these two NIC variations I suggest to abandon 'frame
>>> size' structuring in UMEM and stick with 'pages' instead. The NIC
>>> kernel driver is then responsible for splitting provided pages into
>>> slots expected by underlying HW (or not splitting at all in case of
>>> Chelsio/Netcope).
>>>
>>
>> Let's call the first variation "multi-pool" and the second
>> "type-writer" for simplicity. The type-writer model is very
>> interesting, and makes a lot of sense when the PCIe bus is a
>> constraint.
>>
>>> On XDP_UMEM_REG the application needs to specify page_size. Then the
>>> application can pass empty pages to the kernel driver using UMEM
>>> 'fill' queue by specifying page offset within the UMEM area. xdp_desc
>>> format needs to be changed as well: frame location will be defined by
>>> offset from the beginning of UMEM area instead of frame index. As
>>> payload headroom can vary with AF_XDP we'll need to specify it in
>>> xdp_desc as well. Beside that it could be worth to consider changing
>>> payload length to u16 as 64k+ frames aren't very common in networking.
>>> The resulting xdp_desc would look like that:
>>>
>>> struct xdp_desc {
>>>         __u64 offset;
>>>         __u16 headroom;
>>>         __u16 len;
>>>         __u8 flags;
>>>         __u8 padding[3];
>>> };
>>>
>>
>> Let's expand a bit here; Instead of passing indicies to fixed sized
>> frames to the fill ring, we now pass a memory region. For simplicity,
>> let's say a page. The fill ring descriptor requires offset and
>> len. The offset is a global offset from an UMEM perspective, and len
>> is the size of the region.
>>
>
> I would rather stick with region equal to page (regular or huge page,
> defined by application). The page size can be extracted from
> vm_area_struct in XDP_UMEM_REG (preferred) or configured by
> application.
>

Ok, thinking more about it I prefer this as well. This means that we
only need to grow the UMEM fring/cring descriptors to u64, and not
care about length. As you state below, this makes the validation
simple.

We might consider exposing a "page size hint", that the user can set.
For 4G hugepage scenario, it might make sense to have a chunk *less*
than 4G to avoid HW Rx memory running low when the end of a chunk is
approaching.

As for THP, I need to think about proper behavior here.

>> On the Rx ring the descriptor, as you wrote, must be changed as well
>> to your suggestion above. Note, that headroom is still needed, since
>> XDP can change the size of a packet, so the fixed headroom stated in
>> UMEM registration is not sufficient.
>>
>> This model is obviously more flexible, but then also a bit more
>> complex. E.g. a fixed-frame NIC (like ixgbe), what should the
>> behaviour be? Should the fill ring entry be used only for *one* frame,
>> or chopped up to multiple receive frames? Should it be configurable?
>> Driver specific?
>
> I think driver-specific makes most sense here. In case of fixed-frame
> NIC the driver shall chop the ring entry into multiple receive frames.
>

Let's start there, keeping the configuration space small.

>>
>> Also, validating the entries in the fill queue require more work
>> (compared to the current index variant). Currently, we only skip
>> invalid indicies. What should we do when say, you pass a memory window
>> that is too narrow (say 128B) but correct from a UMEM
>> perspective. Going this path, we need to add pretty hard constraints
>> so we don't end up it too complex code -- because then it'll be too
>> slow.
>
> If we stick with pages -- the only possible erroneous input will be
> 'page out of UMEM boundaries'. The validation will be essentially:
>
> if ((offset > umem->size) || (offset & (umem->page_size - 1))
>     fail
>
> The question is what shall be done if validation fails ? Would
> SEGFAULT be reasonable ? This is more or less equivalent to
> dereferencing invalid pointer.
>

The current scheme is simply dropping that kernel skips the invalid
fill ring entry. SIGSEGV is an interesting idea!

>>
>>> In current proposal you have a notion of 'frame ownership': 'owned by
>>> kernel' or 'owned by application'. The ownership is transferred by
>>> means of enqueueing frame index in UMEM 'fill' queue (from application
>>> to kernel) or in UMEM 'tx completion' queue (from kernel to
>>> application). If you decide to adopt 'page' approach this notion needs
>>> to be changed a bit. This is because in case of packet forwarding one
>>> and the same page can be used for RX (parts of it enqueued in HW 'free
>>> lists') and TX (forwarding of previously RXed packets).
>>>
>>> I propose to define 'ownership' as a right to manipulate the
>>> partitioning of the page into frames. Whenever application passes a
>>> page to the kernel via UMEM 'fill' queue -- the ownership is
>>> transferred to the kernel. The application can't allocate packets on
>>> this page until kernel is done with it, but it can change payload of
>>> RXed packets before forwarding them. The kernel can pass ownership
>>> back by means of 'end-of-page' in xdp_desc.flags.
>>>
>>
>> I like the end-of-page mechanism.
>>
>>> The pages are definitely supposed to be recycled sooner or later. Even
>>> if it's not part of kernel API and the housekeeping implementation
>>> resided completely in application I still would like to propose
>>> possible (hopefully, cost efficient) solution to that. The recycling
>>> could be achieved by keeping refcount on pages and recycling the page
>>> only when it's owned by application and refcount reaches 0.
>>>
>>> Whenever application transfers page ownership to the kernel the
>>> refcount shall be initialized to 0. With each incoming RX xdp_desc the
>>> corresponding page needs to be identified (xdp_desc.offset >>
>>> PAGE_SHIFT) and refcount incremented. When the packet gets freed the
>>> refcount shall be decremented. If packet is forwarded in TX xdp_desc
>>> -- the refcount gets decremented only on TX completion (again,
>>> tx_completion.desc >> PAGE_SHIFT). For packets originating from the
>>> application itself the payload buffers needs to be allocated from
>>> empty page owned by the application and refcount needs to be
>>> incremented as well.
>>>
>>
>> Strictly speaking, we're not enforcing correctness in the current
>> solution. If the userspace application passes index 1 mulitple times
>> to the fill ring, and at the same time send index 1, things will
>> break. So, with the existing solution the userspace application
>> *still* needs to track the frames. With this new model, the
>> tracking/refcounting will be more complex. That might be a concern.
>>
>> For the multi-pool NICs I think we can still just have one UMEM, and
>> let the driver decide where in which pool to place this certain chunk
>> of memory. Thoughts?
>
> Definitely agree with that. This is HW specific and exposing it to the
> application would only harm portability.
>

Good stuff, we're on the same page then.

>>
>> Now, how do we go forward? I think this is very important, and I will
>> hack a copy-mode version for this new API. I'm a bit worried that the
>> complexity/configuration space will grow and impact performance, but
>> let's see.
>>
>> To prove that the API works for the NICs you mentioned, we need an
>> actual zero-copy implementation for them. Do you think Linaro could
>> work on a zero-copy variant for any of the NICs above?
>>
>
> Linaro will definitely contribute zero-copy implementation for some
> ARM-based NICs with 'multi-pool' variation.

Very nice!

> Implementation of
> 'type-writer' variation is up to Chelsio/Netcope, we only try to come
> up with API which (most probably) will fit them as well.
>

Let's hope we get an implementation from these vendors as well! :-)


Bj=C3=B6rn

>>
>> Again thanks for bringing this up!
>> Bj=C3=B6rn
>>
>>
>>
>>> [1] https://en.wikipedia.org/wiki/Internet_Mix
>>>
>>> With best regards,
>>> Mykyta