From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= Subject: Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support Date: Thu, 30 Aug 2018 14:06:24 +0200 Message-ID: <65684cea-76bc-2f2d-600f-402f8504301d@intel.com> References: <20180828124435.30578-1-bjorn.topel@gmail.com> <20180828124435.30578-9-bjorn.topel@gmail.com> <20180829211428.6cce4a1b@cakuba.netronome.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Cc: magnus.karlsson@intel.com, magnus.karlsson@gmail.com, alexander.h.duyck@intel.com, alexander.duyck@gmail.com, ast@kernel.org, brouer@redhat.com, daniel@iogearbox.net, netdev@vger.kernel.org, jesse.brandeburg@intel.com, anjali.singhai@intel.com, peter.waskiewicz.jr@intel.com, michael.lundkvist@ericsson.com, willemdebruijn.kernel@gmail.com, john.fastabend@gmail.com, neerav.parikh@intel.com, mykyta.iziumtsev@linaro.org, francois.ozog@linaro.org, ilias.apalodimas@linaro.org, brian.brooks@linaro.org, u9012063@gmail.com, pavel@fastnetmon.com, qi.z.zhang@intel.com To: Jakub Kicinski , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= Return-path: Received: from mga17.intel.com ([192.55.52.151]:44319 "EHLO mga17.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728321AbeH3QIS (ORCPT ); Thu, 30 Aug 2018 12:08:18 -0400 In-Reply-To: <20180829211428.6cce4a1b@cakuba.netronome.com> Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 2018-08-29 21:14, Jakub Kicinski wrote: > On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote: >> From: Björn Töpel >> >> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of >> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are >> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain >> queue. >> >> All AF_XDP specific functions are added to a new file, i40e_xsk.c. >> >> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS >> will allocate a new buffer and copy the zero-copy frame prior passing >> it to the kernel stack. >> >> Signed-off-by: Björn Töpel > > Mm.. I'm surprised you don't run into buffer reuse issues that I had > when playing with AF_XDP. What happens in i40e if someone downs the > interface? Will UMEMs get destroyed? Will the RX buffers get freed? > The UMEM will linger in the driver until the sockets are dead. > I'll shortly send an RFC with my quick and dirty RX buffer reuse queue, > FWIW. > Some background for folks that don't know the details: A zero-copy capable driver picks buffers off the fill ring and places them on the hardware Rx ring to be completed at a later point when DMA is complete. Analogous for the Tx side; The driver picks buffers off the Tx ring and places them on the Tx hardware ring. In the typical flow, the Rx buffer will be placed onto an Rx ring (completed to the user), and the Tx buffer will be placed on the completion ring to notify the user that the transfer is done. However, if the driver needs to tear down the hardware rings for some reason (interface goes down, reconfiguration and such), what should be done with the Rx and Tx buffers that has been given to the driver? So, to frame the problem: What should a driver do when this happens, so that buffers aren't leaked? Note that when the UMEM is going down, there's no need to complete anything, since the sockets are dying/dead already. This is, as you state, a missing piece in the implementation and needs to be fixed. Now on to possible solutions: 1. Complete the buffers back to the user. For Tx, this is probably the best way -- just place the buffers onto the completion ring. For Rx, we can give buffers back to user space by setting the length in the Rx descriptor to zero And putting them on the Rx ring. However, one complication here is that we do not have any back-pressure mechanism for the Rx side like we have on Tx. If the Rx ring(s) is (are) full the kernel will have to leak them or implement a retry mechanism (ugly and should be avoided). Another option to solve this without needing any retry or leaking for Rx is to implement the same back-pressure mechanism that we have on the Tx path in the Rx path. In the Tx path, the driver will only get a Tx packet to send if there is space for it in the completion ring. On Rx, this would be that the driver would only get a buffer from the fill ring if there is space for it to put it on the Rx ring. The drawback of this is that it would likely impact performance negatively since the Rx ring would have to be touch one more time (in the Tx path, it increased performance since it made it possible to implement the Tx path without any buffering), but it would guarantee that all buffers can always be returned to user space making solution this a viable option. 2. Store the buffers internally in the driver, and make sure that they are inserted into the "normal flow" again. For Rx that would be putting the buffers back into the allocation scheme that the driver is using. For Tx, placing the buffers back onto the Tx HW ring (plus all the logic for making sure that all corner cases work). 3. Mark the socket(s) as in error state, en require the user to redo the setup. This is bit harsh... For i40e I think #2 for Rx (buffers reside in kernel, return to allocator) and #1 for Tx (complete to userland). Your RFC is plumbing to implement #2 for Rx in a driver. I'm not a fan of extending the umem with the "reuse queue". This decision is really up the driver. Some driver might prefer another scheme, or simply prefer storing the buffers in another manner. Looking forward, as both you and Jesper has alluded to, we need a proper allocator for zero-copy. Then it would be a matter of injecting the Rx buffers back to the allocator. I'll send out a patch, so that we don't leak the buffers, but I'd like to hear your thoughts on what the behavior should be. And what should the behavior be when the netdev is removed from the kernel? And thanks for looking into the code! Björn