From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jakub Kicinski <jakub.kicinski@netronome.com>
Subject: Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support
Date: Fri, 31 Aug 2018 09:55:21 +0200
Message-ID: <20180831095521.08b3515f@cakuba.netronome.com>
References: <20180828124435.30578-1-bjorn.topel@gmail.com>
        <20180828124435.30578-9-bjorn.topel@gmail.com>
        <20180829211428.6cce4a1b@cakuba.netronome.com>
        <65684cea-76bc-2f2d-600f-402f8504301d@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@gmail.com>,
        magnus.karlsson@intel.com, magnus.karlsson@gmail.com,
        alexander.h.duyck@intel.com, alexander.duyck@gmail.com,
        ast@kernel.org, brouer@redhat.com, daniel@iogearbox.net,
        netdev@vger.kernel.org, jesse.brandeburg@intel.com,
        anjali.singhai@intel.com, peter.waskiewicz.jr@intel.com,
        michael.lundkvist@ericsson.com, willemdebruijn.kernel@gmail.com,
        john.fastabend@gmail.com, neerav.parikh@intel.com,
        mykyta.iziumtsev@linaro.org, francois.ozog@linaro.org,
        ilias.apalodimas@linaro.org, brian.brooks@linaro.org,
        u9012063@gmail.com, pavel@fastnetmon.com, qi.z.zhang@intel.com
To: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@intel.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-pl1-f194.google.com ([209.85.214.194]:42468 "EHLO
        mail-pl1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727233AbeHaMBo (ORCPT
        <rfc822;netdev@vger.kernel.org>); Fri, 31 Aug 2018 08:01:44 -0400
Received: by mail-pl1-f194.google.com with SMTP id g23-v6so5094787plq.9
        for <netdev@vger.kernel.org>; Fri, 31 Aug 2018 00:55:33 -0700 (PDT)
In-Reply-To: <65684cea-76bc-2f2d-600f-402f8504301d@intel.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Thu, 30 Aug 2018 14:06:24 +0200, Bj=C3=B6rn T=C3=B6pel wrote:
> On 2018-08-29 21:14, Jakub Kicinski wrote:
>  > On Tue, 28 Aug 2018 14:44:32 +0200, Bj=C3=B6rn T=C3=B6pel wrote: =20
>  >> From: Bj=C3=B6rn T=C3=B6pel <bjorn.topel@intel.com>
>  >>
>  >> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
>  >> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
>  >> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
>  >> queue.
>  >>
>  >> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
>  >>
>  >> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
>  >> will allocate a new buffer and copy the zero-copy frame prior passing
>  >> it to the kernel stack.
>  >>
>  >> Signed-off-by: Bj=C3=B6rn T=C3=B6pel <bjorn.topel@intel.com> =20
>  >
>  > Mm.. I'm surprised you don't run into buffer reuse issues that I had
>  > when playing with AF_XDP.  What happens in i40e if someone downs the
>  > interface?  Will UMEMs get destroyed?  Will the RX buffers get freed?
>  > =20
>=20
> The UMEM will linger in the driver until the sockets are dead.
>=20
>  > I'll shortly send an RFC with my quick and dirty RX buffer reuse queue,
>  > FWIW.
>  > =20
>=20
> Some background for folks that don't know the details: A zero-copy
> capable driver picks buffers off the fill ring and places them on the
> hardware Rx ring to be completed at a later point when DMA is
> complete. Analogous for the Tx side; The driver picks buffers off the
> Tx ring and places them on the Tx hardware ring.
>=20
> In the typical flow, the Rx buffer will be placed onto an Rx ring
> (completed to the user), and the Tx buffer will be placed on the
> completion ring to notify the user that the transfer is done.
>=20
> However, if the driver needs to tear down the hardware rings for some
> reason (interface goes down, reconfiguration and such), what should be
> done with the Rx and Tx buffers that has been given to the driver?
>=20
> So, to frame the problem: What should a driver do when this happens,
> so that buffers aren't leaked?
>=20
> Note that when the UMEM is going down, there's no need to complete
> anything, since the sockets are dying/dead already.
>=20
> This is, as you state, a missing piece in the implementation and needs
> to be fixed.
>=20
> Now on to possible solutions:
>=20
> 1. Complete the buffers back to the user. For Tx, this is probably the
>     best way -- just place the buffers onto the completion ring.
>=20
>     For Rx, we can give buffers back to user space by setting the
>     length in the Rx descriptor to zero And putting them on the Rx
>     ring. However, one complication here is that we do not have any
>     back-pressure mechanism for the Rx side like we have on Tx. If the
>     Rx ring(s) is (are) full the kernel will have to leak them or
>     implement a retry mechanism (ugly and should be avoided).
>=20
>     Another option to solve this without needing any retry or leaking
>     for Rx is to implement the same back-pressure mechanism that we
>     have on the Tx path in the Rx path. In the Tx path, the driver will
>     only get a Tx packet to send if there is space for it in the
>     completion ring. On Rx, this would be that the driver would only
>     get a buffer from the fill ring if there is space for it to put it
>     on the Rx ring. The drawback of this is that it would likely impact
>     performance negatively since the Rx ring would have to be touch one
>     more time (in the Tx path, it increased performance since it made
>     it possible to implement the Tx path without any buffering), but it
>     would guarantee that all buffers can always be returned to user
>     space making solution this a viable option.
>=20
> 2. Store the buffers internally in the driver, and make sure that they
>     are inserted into the "normal flow" again. For Rx that would be
>     putting the buffers back into the allocation scheme that the driver
>     is using. For Tx, placing the buffers back onto the Tx HW ring
>     (plus all the logic for making sure that all corner cases work).
>=20
> 3. Mark the socket(s) as in error state, en require the user to redo
>     the setup. This is bit harsh...

That's a good summary, one more (bad) option:

4. Disallow any operations on the device which could lead to RX buffer
   discards if any UMEMs are attached.

> For i40e I think #2 for Rx (buffers reside in kernel, return to
> allocator) and #1 for Tx (complete to userland).
>=20
> Your RFC is plumbing to implement #2 for Rx in a driver. I'm not a fan
> of extending the umem with the "reuse queue". This decision is really
> up the driver. Some driver might prefer another scheme, or simply
> prefer storing the buffers in another manner.

The only performance cost is the extra pointer in xdp_umem.  Drivers
have to opt-in by using the *_rq() helpers.  We can move the extra
pointer to driver structs, and have them pass it to the helpers, so that
would be zero extra cost, and we can reuse the implementation.

> Looking forward, as both you and Jesper has alluded to, we need a
> proper allocator for zero-copy. Then it would be a matter of injecting
> the Rx buffers back to the allocator.
>=20
> I'll send out a patch, so that we don't leak the buffers, but I'd like
> to hear your thoughts on what the behavior should be.
>=20
> And what should the behavior be when the netdev is removed from the
> kernel?

What would AF_PACKET do, return ENXIO?

> And thanks for looking into the code!
>=20
> Bj=C3=B6rn