From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jakub Kicinski Subject: Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support Date: Fri, 31 Aug 2018 09:55:21 +0200 Message-ID: <20180831095521.08b3515f@cakuba.netronome.com> References: <20180828124435.30578-1-bjorn.topel@gmail.com> <20180828124435.30578-9-bjorn.topel@gmail.com> <20180829211428.6cce4a1b@cakuba.netronome.com> <65684cea-76bc-2f2d-600f-402f8504301d@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , magnus.karlsson@intel.com, magnus.karlsson@gmail.com, alexander.h.duyck@intel.com, alexander.duyck@gmail.com, ast@kernel.org, brouer@redhat.com, daniel@iogearbox.net, netdev@vger.kernel.org, jesse.brandeburg@intel.com, anjali.singhai@intel.com, peter.waskiewicz.jr@intel.com, michael.lundkvist@ericsson.com, willemdebruijn.kernel@gmail.com, john.fastabend@gmail.com, neerav.parikh@intel.com, mykyta.iziumtsev@linaro.org, francois.ozog@linaro.org, ilias.apalodimas@linaro.org, brian.brooks@linaro.org, u9012063@gmail.com, pavel@fastnetmon.com, qi.z.zhang@intel.com To: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= Return-path: Received: from mail-pl1-f194.google.com ([209.85.214.194]:42468 "EHLO mail-pl1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727233AbeHaMBo (ORCPT ); Fri, 31 Aug 2018 08:01:44 -0400 Received: by mail-pl1-f194.google.com with SMTP id g23-v6so5094787plq.9 for ; Fri, 31 Aug 2018 00:55:33 -0700 (PDT) In-Reply-To: <65684cea-76bc-2f2d-600f-402f8504301d@intel.com> Sender: netdev-owner@vger.kernel.org List-ID: On Thu, 30 Aug 2018 14:06:24 +0200, Bj=C3=B6rn T=C3=B6pel wrote: > On 2018-08-29 21:14, Jakub Kicinski wrote: > > On Tue, 28 Aug 2018 14:44:32 +0200, Bj=C3=B6rn T=C3=B6pel wrote: =20 > >> From: Bj=C3=B6rn T=C3=B6pel > >> > >> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of > >> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are > >> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain > >> queue. > >> > >> All AF_XDP specific functions are added to a new file, i40e_xsk.c. > >> > >> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS > >> will allocate a new buffer and copy the zero-copy frame prior passing > >> it to the kernel stack. > >> > >> Signed-off-by: Bj=C3=B6rn T=C3=B6pel =20 > > > > Mm.. I'm surprised you don't run into buffer reuse issues that I had > > when playing with AF_XDP. What happens in i40e if someone downs the > > interface? Will UMEMs get destroyed? Will the RX buffers get freed? > > =20 >=20 > The UMEM will linger in the driver until the sockets are dead. >=20 > > I'll shortly send an RFC with my quick and dirty RX buffer reuse queue, > > FWIW. > > =20 >=20 > Some background for folks that don't know the details: A zero-copy > capable driver picks buffers off the fill ring and places them on the > hardware Rx ring to be completed at a later point when DMA is > complete. Analogous for the Tx side; The driver picks buffers off the > Tx ring and places them on the Tx hardware ring. >=20 > In the typical flow, the Rx buffer will be placed onto an Rx ring > (completed to the user), and the Tx buffer will be placed on the > completion ring to notify the user that the transfer is done. >=20 > However, if the driver needs to tear down the hardware rings for some > reason (interface goes down, reconfiguration and such), what should be > done with the Rx and Tx buffers that has been given to the driver? >=20 > So, to frame the problem: What should a driver do when this happens, > so that buffers aren't leaked? >=20 > Note that when the UMEM is going down, there's no need to complete > anything, since the sockets are dying/dead already. >=20 > This is, as you state, a missing piece in the implementation and needs > to be fixed. >=20 > Now on to possible solutions: >=20 > 1. Complete the buffers back to the user. For Tx, this is probably the > best way -- just place the buffers onto the completion ring. >=20 > For Rx, we can give buffers back to user space by setting the > length in the Rx descriptor to zero And putting them on the Rx > ring. However, one complication here is that we do not have any > back-pressure mechanism for the Rx side like we have on Tx. If the > Rx ring(s) is (are) full the kernel will have to leak them or > implement a retry mechanism (ugly and should be avoided). >=20 > Another option to solve this without needing any retry or leaking > for Rx is to implement the same back-pressure mechanism that we > have on the Tx path in the Rx path. In the Tx path, the driver will > only get a Tx packet to send if there is space for it in the > completion ring. On Rx, this would be that the driver would only > get a buffer from the fill ring if there is space for it to put it > on the Rx ring. The drawback of this is that it would likely impact > performance negatively since the Rx ring would have to be touch one > more time (in the Tx path, it increased performance since it made > it possible to implement the Tx path without any buffering), but it > would guarantee that all buffers can always be returned to user > space making solution this a viable option. >=20 > 2. Store the buffers internally in the driver, and make sure that they > are inserted into the "normal flow" again. For Rx that would be > putting the buffers back into the allocation scheme that the driver > is using. For Tx, placing the buffers back onto the Tx HW ring > (plus all the logic for making sure that all corner cases work). >=20 > 3. Mark the socket(s) as in error state, en require the user to redo > the setup. This is bit harsh... That's a good summary, one more (bad) option: 4. Disallow any operations on the device which could lead to RX buffer discards if any UMEMs are attached. > For i40e I think #2 for Rx (buffers reside in kernel, return to > allocator) and #1 for Tx (complete to userland). >=20 > Your RFC is plumbing to implement #2 for Rx in a driver. I'm not a fan > of extending the umem with the "reuse queue". This decision is really > up the driver. Some driver might prefer another scheme, or simply > prefer storing the buffers in another manner. The only performance cost is the extra pointer in xdp_umem. Drivers have to opt-in by using the *_rq() helpers. We can move the extra pointer to driver structs, and have them pass it to the helpers, so that would be zero extra cost, and we can reuse the implementation. > Looking forward, as both you and Jesper has alluded to, we need a > proper allocator for zero-copy. Then it would be a matter of injecting > the Rx buffers back to the allocator. >=20 > I'll send out a patch, so that we don't leak the buffers, but I'd like > to hear your thoughts on what the behavior should be. >=20 > And what should the behavior be when the netdev is removed from the > kernel? What would AF_PACKET do, return ENXIO? > And thanks for looking into the code! >=20 > Bj=C3=B6rn