From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.7 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,FROM_EXCESS_BASE64, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21AF1C43381 for ; Tue, 5 Mar 2019 18:27:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D41EA20643 for ; Tue, 5 Mar 2019 18:27:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fSSKFO4z" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726471AbfCES01 (ORCPT ); Tue, 5 Mar 2019 13:26:27 -0500 Received: from mail-qt1-f193.google.com ([209.85.160.193]:37375 "EHLO mail-qt1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726076AbfCES00 (ORCPT ); Tue, 5 Mar 2019 13:26:26 -0500 Received: by mail-qt1-f193.google.com with SMTP id a48so9968963qtb.4 for ; Tue, 05 Mar 2019 10:26:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=H2wsu8lYrAk8BC3HQ2xxopIQSA8lZtmBgKJ+DvyQwAM=; b=fSSKFO4z9p4DpMSb4A8Z83q6Q/Lk0I1dyzxF1Bd2kKPNz1Tt/HsT9KIkNHyYLpOmHI ByNb2C7s3KylcUN5e0x7GXssIIBBNScASeh5CejQxfFpyvechaq/wuhpCKbEDrA5mHwq 54ag7zMhC+WKsJwY6W+pW0Jy4b8R/cnyyLPREvT2FJOYL8kAFveypDIm4iVF1IpT8EfB eMHSXGEJHw528cTE9sStEkqOUYG83wASpZYHLQFVJXGNqzQtrMQO3otuqwTMdj2j/mjZ MAKheuB+Fg3WErvWh8jVvD98TwM0Z8Mt0GlX1yZ5o1nZuCrczt4bpoIYr7ZPWM49bGuZ n2pg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=H2wsu8lYrAk8BC3HQ2xxopIQSA8lZtmBgKJ+DvyQwAM=; b=HmQ+90CCNrpQUuADUVgbLGf8bgFm8y3JMo1q9C1nf6ai6Zi6Byv0QoYtKatpsxecTX 69vPtFC1AKILMuzGdSldLtOycKBYKjRZaNcuJ2iULHjnsVc3PSkFG/I3misOXeRcUc6t SBuD6YUsQvbrRWMn0cgQvu5GhKe5Hf9eJo7HGIkwb8uQod+Fc7zPh/HtG4SsYQExlVAb T+dZYxRayR+txaCnc26JoJfTSo1EIuZTCsbHOQ8eFuu9yYLWdDKBDwsCBpUDpeHEbNXU a5YmWMTKkt9OSEP815xwc5RbPzlrh+CFV8z/kWlPGNRo2S4TN3DGqo0PwGzWS9R/nvDI 59og== X-Gm-Message-State: APjAAAVQuGP857RZGI7CVwENux530v6OqZ6fHIO3RgfxrCaDWNK4e+vQ k5FhcDwB520ESuBdbX6jZitMvkgRBKdcb2oD2zxPaWvvkYprng== X-Google-Smtp-Source: APXvYqzYl1X3zKlbuq71IizQZ8KIf8f3/D6c/qXqp4qQJ8RK65c/y2La4mKUpHdlyDKO1ab7Brdnv3JcIj2K/AS6coc= X-Received: by 2002:aed:3783:: with SMTP id j3mr2591088qtb.16.1551810385274; Tue, 05 Mar 2019 10:26:25 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= Date: Tue, 5 Mar 2019 19:26:14 +0100 Message-ID: Subject: Re: AF_XDP design flaws To: Maxim Mikityanskiy Cc: Jonathan Lemon , John Fastabend , "netdev@vger.kernel.org" , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , Magnus Karlsson , "David S. Miller" , Tariq Toukan , Saeed Mahameed , Eran Ben Elisha Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Thu, 28 Feb 2019 at 11:50, Maxim Mikityanskiy wro= te: > [...] Back in the saddle! Sorry for the delay! Ok, let me try to summarize. First, let's go through the current AF_XDP semantics so that we're all on the same page, and then pull Max' suggestions in. Ingress ------- The simplified flow is: 1. Userland puts buffers on the fill ring 2. The fill ring is dequeued by the kernel 3. The kernel places the received buffer on the socket Rx ring If 2 doesn't get a buffer, no feedback (other than a driver level counter) is provided to userland. What re-try policy the driver should use, is up to the driver implementation. The i40e busy-polls, which is, as Max points out, will spend a lot of time in napi without a proper back-off mechanism. If the Rx ring is full, so that 3 fails, the packet is dropped and no feedback (other than a counter) is provided to userland. Egress ------ 1. Userland puts buffer(s) on the Tx ring 2. Userland calls sendto 3. The Tx ring is dequeued by the kernel 4. The kernel enqueues the buffer on the completion ring Again little or no feedback is provided to userland. If the completion ring is full, no packets are sent. Further, if the napi is running, the Tx ring will potentially be drained *without* calling sendto. So, it's really up to the userland application to determine when to call sendto. Further, if the napi is running and the driver cannot drain the Tx ring (completion full or HW full), i40e will busy-poll to get the packets out. Again, as Max points out, this will make the kernel spend a lot of time in napi context. The kernel "kick" on egress via sendto is something that we'd like to make optionally, such that the egress side is identical to the Rx side. Four rings per socket, that the user fills (fill ring/Tx) and drains (Rx/completion ring) without any syscalls at all. Again, this is doable with kernel-side napi-threads. The API is throughput oriented, and hence the current design. Now, onto Max' concerns, from my perspective: 1. The kernel spins too much in napi mode. Yes, the i40e driver does spin for throughput and latency reasons. I agree that we should add a back-off mechanism. I would prefer *not* adding this to the AF_XDP uapi, but having it as a driver knob. Another idea would be to move to a napi-thread similar to what Paolo Abeni suggested in [1], and let the scheduler deal with the fairness issue. 2. No/little error feedback to userland Max would like a mode where feedback when "fill ring has run dry", "completion queue is full", "HW queue full" returned to userland via the poll() syscall. In this mode, Max suggests that sendto() will return error if not all packets in the Tx ring can be sent. Further, the kernel should be kicked when there has been items placed in the fill ring. Again, all good and valid points! I think we can address this with the upcoming busy-poll() support. In the busy-poll mode (which will be a new AF_XDP bind option), the napi will be executed in the poll() context. Ingress would be: 1. Userland puts buffers on the fill ring 2. Call poll(), and from the poll context: a. The fill ring is dequeued by the kernel b. The kernel places the received buffer on the socket Rx ring If a. fails, poll() will return an POLLERR, and userland can act on it. Dito for egress, and poll() will return an POLLERR if the completion ring has less than Tx ring entries. So, we're addressing your concerns with the busy-poll mode, and let the throughput/non-busy-poll API as it is today. What do you think about that, Max? Would that be a path forward for Mellanox -- i.e. implementing the busy-poll and the current API? 3 Introduce an API to schedule a napi on a certain core I think this is outside the AF_XDP scope (given my points above). This is mainly kernel internals, and I have not strong options/thoughts here. As long as you guys are hacking AF_XDP, I'm happy. :-P Finally, yes, we need to work on the documentation! Patches are welcome! ;-) Max, thanks for the input and for looking into this! Very much appreciated! Cheers, Bj=C3=B6rn [1] https://lwn.net/Articles/686985/