From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=f3Oa=RB=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 61069C43381
	for <netdev@archiver.kernel.org>; Tue, 26 Feb 2019 16:41:24 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 19E382184D
	for <netdev@archiver.kernel.org>; Tue, 26 Feb 2019 16:41:24 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="lbqh+WCF"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727549AbfBZQlW (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Tue, 26 Feb 2019 11:41:22 -0500
Received: from mail-pl1-f196.google.com ([209.85.214.196]:33301 "EHLO
        mail-pl1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726624AbfBZQlW (ORCPT
        <rfc822;netdev@vger.kernel.org>); Tue, 26 Feb 2019 11:41:22 -0500
Received: by mail-pl1-f196.google.com with SMTP id y10so6519409plp.0
        for <netdev@vger.kernel.org>; Tue, 26 Feb 2019 08:41:22 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=subject:to:cc:references:from:message-id:date:user-agent
         :mime-version:in-reply-to:content-language:content-transfer-encoding;
        bh=w4gNmeX0yQ5B823bX2Z2avwwCCLymzg9uRqZtEY/nmw=;
        b=lbqh+WCFxDlwGILjSeMes5xxYUWN+nI+FOwAVcHWiqAjqw8YF1pVaUzcQ2igFE4QfW
         2IMuq5MRqt/umReNYA5E6U4PTrifTb0NvKpZoi1ezzgDTHFOQxpPH01/PS1fxdvJCj7+
         3c7yYIBBNKbEDN3c/mQ0zePt2YoCS8gy6Te64dSgbznVXRMaQJgopqlZXxl1drdTQQv+
         5n/uHaxUl3wjRkesCskxRjcw4hTheVSKofpvRCTucmlzffI+lDh73hUwCOSCqk7kWuYZ
         7P8kJjpSRkSSCahY/16EozETB89h2HpDx+Gr56bpctufb+Zelr11y8NIjqPZrkxlDKHJ
         eTRw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:subject:to:cc:references:from:message-id:date
         :user-agent:mime-version:in-reply-to:content-language
         :content-transfer-encoding;
        bh=w4gNmeX0yQ5B823bX2Z2avwwCCLymzg9uRqZtEY/nmw=;
        b=tCF5YEdWUoeHCknwdR0nPozuKjTr9/OFiCz67jzvGhjy1L4NqortSQ50DuSN4fNZ++
         qhOTv20ntlrusOP3Y2/IhcLSc/p1YVKGXrBosGyWh8jxzAuQtkHR7TZ962plkNY5nYnk
         VURwb8+JRL4Ei/21vamYTASe95N/3KA0SMFXbCG5xsEgvRKszQ30Fs4DES7YiZfE55j9
         MZaVmVmpEV1De8WwnfWL2w1/2blH9VjktjAWh38O1BHUz35XZ1hB+fXU4JpOK0NChWen
         7ooSHpoQmwrXei7xphGf09Yj7+fbUE9x2K4NokvOyzIYF44kA73TWQqZ0R/DzVm3QHvq
         LZrA==
X-Gm-Message-State: AHQUAubHMtE5n6CibHmkdUFc1dO/OdvETgatfyazQiZ3cbul75TWaHRi
        vilrUsBhz6/WRyLvki4q4s9aefvt
X-Google-Smtp-Source: AHgI3IZ7KyDmDKzxpTzbsXOydOlG5DcPKqp9qL4YsKsJQAmQiWvSR387ZQTycYltpmT0PrRWjB5deA==
X-Received: by 2002:a17:902:4181:: with SMTP id f1mr19831434pld.280.1551199281545;
        Tue, 26 Feb 2019 08:41:21 -0800 (PST)
Received: from [172.31.98.130] (198-0-60-179-static.hfc.comcastbusiness.net. [198.0.60.179])
        by smtp.gmail.com with ESMTPSA id u184sm20325869pgd.13.2019.02.26.08.41.20
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 26 Feb 2019 08:41:20 -0800 (PST)
Subject: Re: AF_XDP design flaws
To:     Maxim Mikityanskiy <maximmi@mellanox.com>,
        "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
        =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= <bjorn.topel@intel.com>,
        Magnus Karlsson <magnus.karlsson@intel.com>,
        "David S. Miller" <davem@davemloft.net>
Cc:     Tariq Toukan <tariqt@mellanox.com>,
        Saeed Mahameed <saeedm@mellanox.com>,
        Eran Ben Elisha <eranbe@mellanox.com>
References: <AM6PR05MB5879DF6B2BD7DC426869875ED17B0@AM6PR05MB5879.eurprd05.prod.outlook.com>
From:   John Fastabend <john.fastabend@gmail.com>
Message-ID: <b48e282f-8405-8974-8b71-df15f4bda8ab@gmail.com>
Date:   Tue, 26 Feb 2019 08:41:19 -0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.2.1
MIME-Version: 1.0
In-Reply-To: <AM6PR05MB5879DF6B2BD7DC426869875ED17B0@AM6PR05MB5879.eurprd05.prod.outlook.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

On 2/26/19 6:49 AM, Maxim Mikityanskiy wrote:
> Hi everyone,
> 
> I would like to discuss some design flaws of AF_XDP socket (XSK) implementation
> in kernel. At the moment I don't see a way to work around them without changing
> the API, so I would like to make sure that I'm not missing anything and to
> suggest and discuss some possible improvements that can be made.
> 
> The issues I describe below are caused by the fact that the driver depends on
> the application doing some things, and if the application is
> slow/buggy/malicious, the driver is forced to busy poll because of the lack of a
> notification mechanism from the application side. I will refer to the i40e
> driver implementation a lot, as it is the first implementation of AF_XDP, but
> the issues are general and affect any driver. I already considered trying to fix
> it on driver level, but it doesn't seem possible, so it looks like the behavior
> and implementation of AF_XDP in the kernel has to be changed.
> 
> RX side busy polling
> ====================
> 
> On the RX side, the driver expects the application to put some descriptors in
> the Fill Ring. There is no way for the application to notify the driver that
> there are more Fill Ring descriptors to take, so the driver is forced to busy
> poll the Fill Ring if it gets empty. E.g., the i40e driver does it in NAPI poll:
> 
> int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> {
> ...
>                         failure = failure ||
>                                   !i40e_alloc_rx_buffers_fast_zc(rx_ring,
>                                                                  cleaned_count);
> ...
>         return failure ? budget : (int)total_rx_packets;
> }
> 
> Basically, it means that if there are no descriptors in the Fill Ring, NAPI will
> never stop, draining CPU.
> 
> Possible cases when it happens
> ------------------------------
> 
> 1. The application is slow, it received some frames in the RX Ring, and it is
> still handling the data, so it has no free frames to put to the Fill Ring.
> 
> 2. The application is malicious, it opens an XSK and puts no frames to the Fill
> Ring. It can be used as a local DoS attack.
> 
> 3. The application is buggy and stops filling the Fill Ring for whatever reason
> (deadlock, waiting for another blocking operation, other bugs).
> 
> Although loading an XDP program requires root access, the DoS attack can be
> targeted to setups that already use XDP, i.e. an XDP program is already loaded.
> Even under root, userspace applications should not be able to disrupt system
> stability by just calling normal APIs without an intention to destroy the
> system, and here it happens in case 1.

I believe this is by design. If packets per second is your top priority
(at the expense of power, cpu, etc.) this is the behavior you might
want. To resolve your points if you don't trust the application it
should be isolated to a queue pair and cores so the impact is known and
managed.

That said having a flag to back-off seems like a good idea. But should
be doable as a feature without breaking current API.

> 
> Possible way to solve the issue
> -------------------------------
> 
> When the driver can't take new Fill Ring frames, it shouldn't busy poll.
> Instead, it signals the failure to the application (e.g., with POLLERR), and
> after that it's up to the application to restart polling (e.g., by calling
> sendto()) after refilling the Fill Ring. The issue with this approach is that it
> changes the API, so we either have to deal with it or to introduce some API
> version field.

See above. I like the idea here though.

> 
> TX side getting stuck
> =====================
> 
> On the TX side, there is the Completion Ring that the application has to clean.
> If it doesn't, the i40e driver stops taking descriptors from the TX Ring. If the
> application finally completes something, the driver can go on transmitting.
> However, it would require busy polling the Completion Ring (just like with the
> Fill Ring on the RX side). i40e doesn't do it, instead, it relies on the
> application to kick the TX by calling sendto(). The issue is that poll() doesn't
> return POLLOUT in this case, because the TX Ring is full, so the application
> will never call sendto(), and the ring is stuck forever (or at least until
> something triggers NAPI).
> 
> Possible way to solve the issue
> -------------------------------
> 
> When the driver can't reserve a descriptor in the Completion Ring, it should
> signal the failure to the application (e.g., with POLLERR). The application
> shouldn't call sendto() every time it sees that the number of not completed
> frames is greater than zero (like xdpsock sample does). Instead, the application
> should kick the TX only when it wants to flush the ring, and, in addition, after
> resolving the cause for POLLERR, i.e. after handling Completion Ring entries.
> The API will also have to change with this approach.
> 

+1 seems to me this can be done without breaking existing API though.

> Triggering NAPI on a different CPU core
> =======================================
> 
> .ndo_xsk_async_xmit runs on a random CPU core, so, to preserve CPU affinity,
> i40e triggers an interrupt to schedule NAPI, instead of calling napi_schedule
> directly. Scheduling NAPI on the correct CPU is what would every driver do, I
> guess, but currently it has to be implemented differently in every driver, and
> it relies on hardware features (the ability to trigger an IRQ).

Ideally the application would be pinned to a core and the traffic
steered to that core using ethtool/tc. Yes it requires a bit of mgmt on
the platform but I think this is needed for best pps numbers.

> 
> I suggest introducing a kernel API that would allow triggering NAPI on a given
> CPU. A brief look shows that something like smp_call_function_single_async can
> be used. Advantages:

Assuming you want to avoid pinning cores/traffic for some reason. Could
this be done with some combination of cpumap + af_xdp? I haven't thought
too much about it though. Maybe Bjorn has some ideas.

> 
> 1. It lifts the hardware requirement to be able to raise an interrupt on demand.
> 
> 2. It would allow to move common code to the kernel (.ndo_xsk_async_xmit).
> 
> 3. It is also useful in the situation where CPU affinity changes while being in
> NAPI poll. Currently, i40e and mlx5e try to stop NAPI polling by returning
> a value less than budget if CPU affinity changes. However, there are cases
> (e.g., NAPIF_STATE_MISSED) when NAPI will be rescheduled on a wrong CPU. It's a
> race between the interrupt, which will move NAPI to the correct CPU, and
> __napi_schedule from a wrong CPU. Having an API to schedule NAPI on a given CPU
> will benefit both mlx5e and i40e, because when this situation happens, it kills
> the performance.

How would we know what core to trigger NAPI on?

> 
> I would be happy to hear your thoughts about these issues.

At least the first two observations seem incrementally solvable to me
and would be nice improvements. I assume the last case can be resolved
by pinning cores/traffic but for the general case perhaps it is useful.

> 
> Thanks,
> Max
>