From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=/cfd=RC=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,FROM_EXCESS_BASE64,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E39D9C43381
	for <netdev@archiver.kernel.org>; Wed, 27 Feb 2019 08:08:58 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 9EBB1218D9
	for <netdev@archiver.kernel.org>; Wed, 27 Feb 2019 08:08:58 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gFBBkZYZ"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727187AbfB0II5 (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 27 Feb 2019 03:08:57 -0500
Received: from mail-qt1-f196.google.com ([209.85.160.196]:46536 "EHLO
        mail-qt1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726907AbfB0II5 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 27 Feb 2019 03:08:57 -0500
Received: by mail-qt1-f196.google.com with SMTP id z25so18151481qti.13
        for <netdev@vger.kernel.org>; Wed, 27 Feb 2019 00:08:56 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc:content-transfer-encoding;
        bh=OtYY7Ps5jqmKuIPqzHZWSiT6ysuNrSpmXtmVWS89n4k=;
        b=gFBBkZYZ2008YUYH1QNvpOAa2w/D+oSH3RsAuHF4qYYwGaQR47Colcw/xxWu9+vbOv
         NUtyD/F37yovfuEibVU92rXuESo61SzEKxqcPTKXGUaCbtxU4ZOUfSBULOd8J5OWXpBP
         Obz7k1ZQ0U1JnL3Pj9HHMt6cvRlVqcI+oeHaZFZwt0xSnLcgTSm52UzzwIjWKwvLPrcX
         Bvsm/X3MHBSig3ht02h3Msd7j/cD4KxQf9koVmnwAx7hwnZHJw1BpajkcKld7rxmLbPb
         LE2G67OhAPJuya94BNYOX/rfx8xnbhVleWrsr0NdaS6mPJulrCwtiR4MwH+q+tpprgb5
         aBgg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc:content-transfer-encoding;
        bh=OtYY7Ps5jqmKuIPqzHZWSiT6ysuNrSpmXtmVWS89n4k=;
        b=XUdAd/Q5WBQNvC/YY4oTtcLkXCBIGzDTfAoVxwCUIhKKcs6Iz7wF9vBR0Onz23gzns
         mQR8CUJ0yzjR4ngtjJGOliWcEaaorp4IQWt3SCbETt0qggEFIcWVMxsJNSTKPka16lhQ
         5GA7bcfL6boO8x5zRrQKHlNqXa+xXMaLgo4b83xGCPjA5hIBbAJtQqdv+DfkrNQHVSJ6
         x3J8dB9bW8TC9odmhh517jxwSIe7j+VlOEnUVdqFphJhk5fXsaq2pFshUobtEjHoIfcS
         SqNl0Lqk3LCzP4B/2KG8r4URH+4pmvfI/1f4ejxBKHwZHxzkFOsgQbUk7uyJCAqoqmhW
         rThA==
X-Gm-Message-State: APjAAAUtZDQMa/4NC8UN1vRn8rWzk+N2sT6azha6/8zbdBGh72p3fxsf
        P1NFBXLm2aGNHUGBm8QF611AAexC7PBdRhI1wP0=
X-Google-Smtp-Source: APXvYqysdzlrjkIj8FM/Wtuo3Nibo6pvRnrwHH6+aOJfCqRRf1j+0ClqZDfV9ql4mH7qMKGxU3GsLid+d52OxFsKyEA=
X-Received: by 2002:a0c:88c9:: with SMTP id 9mr881406qvo.178.1551254935589;
 Wed, 27 Feb 2019 00:08:55 -0800 (PST)
MIME-Version: 1.0
References: <AM6PR05MB5879DF6B2BD7DC426869875ED17B0@AM6PR05MB5879.eurprd05.prod.outlook.com>
 <b48e282f-8405-8974-8b71-df15f4bda8ab@gmail.com>
In-Reply-To: <b48e282f-8405-8974-8b71-df15f4bda8ab@gmail.com>
From:   =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@gmail.com>
Date:   Wed, 27 Feb 2019 09:08:43 +0100
Message-ID: <CAJ+HfNjaaHTVdM4L6a8Pfto+oj0pPigvrYYTwoEga+FadtSeNw@mail.gmail.com>
Subject: Re: AF_XDP design flaws
To:     John Fastabend <john.fastabend@gmail.com>
Cc:     Maxim Mikityanskiy <maximmi@mellanox.com>,
        "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
        =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@intel.com>,
        Magnus Karlsson <magnus.karlsson@intel.com>,
        "David S. Miller" <davem@davemloft.net>,
        Tariq Toukan <tariqt@mellanox.com>,
        Saeed Mahameed <saeedm@mellanox.com>,
        Eran Ben Elisha <eranbe@mellanox.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

On 2019-02-26 17:41, John Fastabend wrote:
> On 2/26/19 6:49 AM, Maxim Mikityanskiy wrote:
>> Hi everyone,
>>

Hi! It's exciting to see more vendors looking into AF_XDP. ARM was
involved early on, and now Mellanox. :-) It's really important that
the AF_XDP model is a good fit for all drivers/vendors! Thanks for
looking into this.

>> I would like to discuss some design flaws of AF_XDP socket (XSK) impleme=
ntation
>> in kernel. At the moment I don't see a way to work around them without c=
hanging
>> the API, so I would like to make sure that I'm not missing anything and =
to
>> suggest and discuss some possible improvements that can be made.
>>
>> The issues I describe below are caused by the fact that the driver depen=
ds on
>> the application doing some things, and if the application is
>> slow/buggy/malicious, the driver is forced to busy poll because of the l=
ack of a
>> notification mechanism from the application side. I will refer to the i4=
0e
>> driver implementation a lot, as it is the first implementation of AF_XDP=
, but
>> the issues are general and affect any driver. I already considered tryin=
g to fix
>> it on driver level, but it doesn't seem possible, so it looks like the b=
ehavior
>> and implementation of AF_XDP in the kernel has to be changed.
>>
>> RX side busy polling
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>
>> On the RX side, the driver expects the application to put some descripto=
rs in
>> the Fill Ring. There is no way for the application to notify the driver =
that
>> there are more Fill Ring descriptors to take, so the driver is forced to=
 busy
>> poll the Fill Ring if it gets empty. E.g., the i40e driver does it in NA=
PI poll:
>>
>> int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>> {
>> ...
>>                          failure =3D failure ||
>>                                    !i40e_alloc_rx_buffers_fast_zc(rx_rin=
g,
>>                                                                   cleane=
d_count);
>> ...
>>          return failure ? budget : (int)total_rx_packets;
>> }
>>
>> Basically, it means that if there are no descriptors in the Fill Ring, N=
API will
>> never stop, draining CPU.
>>
>> Possible cases when it happens
>> ------------------------------
>>
>> 1. The application is slow, it received some frames in the RX Ring, and =
it is
>> still handling the data, so it has no free frames to put to the Fill Rin=
g.
>>
>> 2. The application is malicious, it opens an XSK and puts no frames to t=
he Fill
>> Ring. It can be used as a local DoS attack.
>>
>> 3. The application is buggy and stops filling the Fill Ring for whatever=
 reason
>> (deadlock, waiting for another blocking operation, other bugs).
>>
>> Although loading an XDP program requires root access, the DoS attack can=
 be
>> targeted to setups that already use XDP, i.e. an XDP program is already =
loaded.
>> Even under root, userspace applications should not be able to disrupt sy=
stem
>> stability by just calling normal APIs without an intention to destroy th=
e
>> system, and here it happens in case 1.
> I believe this is by design. If packets per second is your top priority
> (at the expense of power, cpu, etc.) this is the behavior you might
> want. To resolve your points if you don't trust the application it
> should be isolated to a queue pair and cores so the impact is known and
> managed.
>

Correct, and I agree with John here. AF_XDP is a raw interface, and
needs to be managed.

> That said having a flag to back-off seems like a good idea. But should
> be doable as a feature without breaking current API.
>
>> Possible way to solve the issue
>> -------------------------------
>>
>> When the driver can't take new Fill Ring frames, it shouldn't busy poll.
>> Instead, it signals the failure to the application (e.g., with POLLERR),=
 and
>> after that it's up to the application to restart polling (e.g., by calli=
ng
>> sendto()) after refilling the Fill Ring. The issue with this approach is=
 that it
>> changes the API, so we either have to deal with it or to introduce some =
API
>> version field.
> See above. I like the idea here though.
>

The uapi doesn't mandate *how* the driver should behave. The rational
for the aggressive i40e fill ring polling (when the ring is empty) is
to minimize the latency. The "fill ring is empty" behavior might be
different on different drivers.

I see a number of different alternatives to the current i40e way of
busy-polling, all ranging from watchdog timers to actual HW
implementations where writing to the tail pointer of the fill ring
kicks a door bell.

That being said, I agree with previous speakers; Being able to set a
policy/driver behavior from userland is a good idea. This can be added
without breaking the existing model.

Max, any input to what this interface should look like from your
perspective?


>> TX side getting stuck
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>
>> On the TX side, there is the Completion Ring that the application has to=
 clean.
>> If it doesn't, the i40e driver stops taking descriptors from the TX Ring=
. If the
>> application finally completes something, the driver can go on transmitti=
ng.
>> However, it would require busy polling the Completion Ring (just like wi=
th the
>> Fill Ring on the RX side). i40e doesn't do it, instead, it relies on the
>> application to kick the TX by calling sendto(). The issue is that poll()=
 doesn't
>> return POLLOUT in this case, because the TX Ring is full, so the applica=
tion
>> will never call sendto(), and the ring is stuck forever (or at least unt=
il
>> something triggers NAPI).
>>
>> Possible way to solve the issue
>> -------------------------------
>>
>> When the driver can't reserve a descriptor in the Completion Ring, it sh=
ould
>> signal the failure to the application (e.g., with POLLERR). The applicat=
ion
>> shouldn't call sendto() every time it sees that the number of not comple=
ted
>> frames is greater than zero (like xdpsock sample does). Instead, the app=
lication
>> should kick the TX only when it wants to flush the ring, and, in additio=
n, after
>> resolving the cause for POLLERR, i.e. after handling Completion Ring ent=
ries.
>> The API will also have to change with this approach.
>>
> +1 seems to me this can be done without breaking existing API though.
>

Keep in mind that the xdpsock application is sample/toy, and should
not be seen as the API documentation.

I don't see that the API need to be changed. AF_XDP requires that the
fill ring has entries, in order to receive data. Analogous, the
completion ring needs to managed by the application to send frames.

Are you suggesting that using poll() et al should be mandatory? Again,
take me through your send scenario, step by step.

To me, the userland application has all the knobs to determine whether
the Tx ring can be flushed or not.


>> Triggering NAPI on a different CPU core
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>
>> .ndo_xsk_async_xmit runs on a random CPU core, so, to preserve CPU affin=
ity,
>> i40e triggers an interrupt to schedule NAPI, instead of calling napi_sch=
edule
>> directly. Scheduling NAPI on the correct CPU is what would every driver =
do, I
>> guess, but currently it has to be implemented differently in every drive=
r, and
>> it relies on hardware features (the ability to trigger an IRQ).
> Ideally the application would be pinned to a core and the traffic
> steered to that core using ethtool/tc. Yes it requires a bit of mgmt on
> the platform but I think this is needed for best pps numbers.
>
>> I suggest introducing a kernel API that would allow triggering NAPI on a=
 given
>> CPU. A brief look shows that something like smp_call_function_single_asy=
nc can
>> be used. Advantages:
> Assuming you want to avoid pinning cores/traffic for some reason. Could
> this be done with some combination of cpumap + af_xdp? I haven't thought
> too much about it though. Maybe Bjorn has some ideas.
>

For the Intel drivers, the NAPI is associated to a certain
interrupt. This does not change with AF_XDP, so ndo_xsk_async_xmit
simply kicks the NAPI to run on the CPU bound defined by the irq
affinity.

I'm all open for additionally kernel APIs, and I'm happy to hack the
i40e internals as well. Again, it must be really simple to add
XDP/AF_XDP-ZC support for other vendor.

Ideally, what I would like is a generic way of associating netdev
queues (think ethtools-channels) with NAPI contexts. E.g. "queues
1,2,3 on NAPI foo, and I,II,III on NAPI bar, and the NAPIs should run
on on the baz cgroup". This is a much bigger task and outside the
AF_XDP scope. It ties in to the whole "threaded NAPIs and netdev
queues as a first class kernel object" discussion. ;-)

Magnus is working on proper busy-poll() (i.e. that the poll() syscall
executes the NAPI context). With that solution the user application
can select which core it wants to execute the poll() on -- and the
NAPI will be executed from the poll().


>> 1. It lifts the hardware requirement to be able to raise an interrupt on=
 demand.
>>
>> 2. It would allow to move common code to the kernel (.ndo_xsk_async_xmit=
).
>>
>> 3. It is also useful in the situation where CPU affinity changes while b=
eing in
>> NAPI poll. Currently, i40e and mlx5e try to stop NAPI polling by returni=
ng
>> a value less than budget if CPU affinity changes. However, there are cas=
es
>> (e.g., NAPIF_STATE_MISSED) when NAPI will be rescheduled on a wrong CPU.=
 It's a
>> race between the interrupt, which will move NAPI to the correct CPU, and
>> __napi_schedule from a wrong CPU. Having an API to schedule NAPI on a gi=
ven CPU
>> will benefit both mlx5e and i40e, because when this situation happens, i=
t kills
>> the performance.
> How would we know what core to trigger NAPI on?
>
>> I would be happy to hear your thoughts about these issues.
> At least the first two observations seem incrementally solvable to me
> and would be nice improvements. I assume the last case can be resolved
> by pinning cores/traffic but for the general case perhaps it is useful.
>

Good summary, John. Really good input Max, and addressable -- but API
changes as far as I see it is not required.

(I'm out-of-office this week, skiing. So, I'll be away from the lists
until Monday!)

Cheers,
Bj=C3=B6rn

>> Thanks,
>> Max
>>