From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 06445C04AAA for ; Mon, 13 May 2019 23:35:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id CBF452084F for ; Mon, 13 May 2019 23:35:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726700AbfEMXfB (ORCPT ); Mon, 13 May 2019 19:35:01 -0400 Received: from mga04.intel.com ([192.55.52.120]:39736 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726511AbfEMXfB (ORCPT ); Mon, 13 May 2019 19:35:01 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 13 May 2019 16:30:58 -0700 X-ExtLoop1: 1 Received: from samudral-mobl1.amr.corp.intel.com (HELO [134.134.177.79]) ([134.134.177.79]) by orsmga007.jf.intel.com with ESMTP; 13 May 2019 16:30:58 -0700 Subject: Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets To: Jonathan Lemon , Alexei Starovoitov Cc: Magnus Karlsson , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , Daniel Borkmann , Network Development , "bpf@vger.kernel.org" , Jakub Kicinski References: <1556786363-28743-1-git-send-email-magnus.karlsson@intel.com> <20190506163135.blyqrxitmk5yrw7c@ast-mbp> <20190507182435.6f2toprk7jus6jid@ast-mbp> From: "Samudrala, Sridhar" Message-ID: Date: Mon, 13 May 2019 16:30:58 -0700 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On 5/13/2019 1:42 PM, Jonathan Lemon wrote: > Tossing in my .02 cents: > > > I anticipate that most users of AF_XDP will want packet processing > for a given RX queue occurring on a single core - otherwise we end > up with cache delays. The usual model is one thread, one socket, > one core, but this isn't enforced anywhere in the AF_XDP code and is > up to the user to set this up. AF_XDP with busypoll should allow a single thread to poll a given RX queue and use a single core. > > On 7 May 2019, at 11:24, Alexei Starovoitov wrote: >> I'm not saying that we shouldn't do busy-poll. I'm saying it's >> complimentary, but in all cases single core per af_xdp rq queue >> with user thread pinning is preferred. > > So I think we're on the same page here. > >> Stack rx queues and af_xdp rx queues should look almost the same from >> napi point of view. Stack -> normal napi in softirq. af_xdp -> new >> kthread >> to work with both poll and busy-poll. The only difference between >> poll and busy-poll will be the running context: new kthread vs user >> task. > ... >> A burst of 64 packets on stack queues or some other work in softirqd >> will spike the latency for af_xdp queues if softirq is shared. > > True, but would it be shared? This goes back to the current model, > which > as used by Intel is: > > (channel == RX, TX, softirq) > > MLX, on the other hand, wants: > > (channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq) > > Which would indeed lead to sharing. The more I look at the above, the > stronger I start to dislike it. Perhaps this should be disallowed? > > I believe there was some mention at LSF/MM that the 'channel' concept > was something specific to HW and really shouldn't be part of the SW API. > >> Hence the proposal for new napi_kthreads: >> - user creates af_xdp socket and binds to _CPU_ X then >> - driver allocates single af_xdp rq queue (queue ID doesn't need to be >> exposed) >> - spawns kthread pinned to cpu X >> - configures irq for that af_xdp queue to fire on cpu X >> - user space with the help of libbpf pins its processing thread to >> that cpu X >> - repeat above for as many af_xdp sockets as there as cpus >> (its also ok to pick the same cpu X for different af_xdp socket >> then new kthread is shared) >> - user space configures hw to RSS to these set of af_xdp sockets. >> since ethtool api is a mess I propose to use af_xdp api to do this >> rss config > > > From a high level point of view, this sounds quite sensible, but does > need > some details ironed out. The model above essentially enforces a model > of: > > (af_xdp = RX.af_xdp + bound_cpu) > (bound_cpu = hw.cpu + af_xdp.kthread + hw.irq) > > (temporarily ignoring TX for right now) > > > I forsee two issues with the above approach: > 1. hardware limitations in the number of queues/rings > 2. RSS/steering rules > >> - user creates af_xdp socket and binds to _CPU_ X then >> - driver allocates single af_xdp rq queue (queue ID doesn't need to be >> exposed) > > Here, the driver may not be able to create an arbitrary RQ, but may need > to > tear down/reuse an existing one used by the stack. This may not be an > issue > for modern hardware. > >> - user space configures hw to RSS to these set of af_xdp sockets. >> since ethtool api is a mess I propose to use af_xdp api to do this >> rss config > > Currently, RSS only steers default traffic. On a system with shared > stack/af_xdp queues, there should be a way to split the traffic types, > unless we're talking about a model where all traffic goes to AF_XDP. > > This classification has to be done by the NIC, since it comes before RSS > steering - which currently means sending flow match rules to the NIC, > which > is less than ideal. I agree that the ethtool interface is non optimal, > but > it does make things clear to the user what's going on. 'tc' provides another interface to split NIC queues into groups of queues each with its own RSS. For ex: tc qdisc add dev root mqprio num_tc 3 map 0 1 2 queues 2@0 32@2 8@34 hw 1 mode channel will split NIC queues into 3 groups of 2, 32 and 8 queues. By default all the packets goto only the first queue group with 2 queues. Filters can be added to redirect packets to the other queues groups. tc filter add dev protocol ip ingress prio 1 flower dst_ip 192.168.0.2 ip_proto tcp dst_port 1234 skip_sw hw_tc 1 tc filter add dev protocol ip ingress prio 1 flower dst_ip 192.168.0.3 ip_proto tcp dst_port 1234 skip_sw hw_tc 2 Here hw_tc indicates the queue group. It should be possible to run AF_XDP on queue group 3 by creating 8 af-xdp sockets and binding them to queues 34-42. Does this look like a reasonable model to use a subset of nic queues for af-xdp applications? > > Perhaps an af_xdp library that does some bookkeeping: > - open af_xdp socket > - define af_xdp_set as (classification, steering rules, other?) > - bind socket to (cpu, af_xdp_set) > - kernel: > - pins calling thread to cpu > - creates kthread if one doesn't exist, binds to irq and cpu > - has driver create RQ.af_xdp, possibly replacing RQ.stack > - applies (af_xdp_set) to NIC. > > Seems workable, but a little complicated? The complexity could be moved > into a separate library. > > >> imo that would be the simplest and performant way of using af_xdp. >> All configuration apis are under libbpf (or libxdp if we choose to >> fork it) >> End result is one af_xdp rx queue - one napi - one kthread - one user >> thread. >> All pinned to the same cpu with irq on that cpu. >> Both poll and busy-poll approaches will not bounce data between cpus. >> No 'shadow' queues to speak of and should solve the issues that >> folks were bringing up in different threads. > > Sounds like a sensible model from my POV. >