From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 127BCC43381 for ; Tue, 26 Feb 2019 14:49:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B378D21848 for ; Tue, 26 Feb 2019 14:49:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=Mellanox.com header.i=@Mellanox.com header.b="bwQBkoqv" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727357AbfBZOty (ORCPT ); Tue, 26 Feb 2019 09:49:54 -0500 Received: from mail-eopbgr130087.outbound.protection.outlook.com ([40.107.13.87]:49381 "EHLO EUR01-HE1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726099AbfBZOtx (ORCPT ); Tue, 26 Feb 2019 09:49:53 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Mellanox.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Yw60i1LBdIMpqLyMmVRZuwN0ZLhXlcVGcJT39dvSW5o=; b=bwQBkoqvPNc/f1dxYDZrmtnXsg6iGuRbX/YuTfNv44JN9oRh3XEy+1tXjnjdJDyXmQwtOQ7huyQticvN7GXpFTGQhelXRwLZ13Vg3iy49lq0Gy1vIKv+uosRoST0cKgojpai+xzYeHoX1nQcFvFK4vWtBWmVmO/CxrWkM+3YPqQ= Received: from AM6PR05MB5879.eurprd05.prod.outlook.com (20.179.0.76) by AM6PR05MB4902.eurprd05.prod.outlook.com (20.177.36.26) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1643.18; Tue, 26 Feb 2019 14:49:47 +0000 Received: from AM6PR05MB5879.eurprd05.prod.outlook.com ([fe80::98c7:b5be:ac6b:144]) by AM6PR05MB5879.eurprd05.prod.outlook.com ([fe80::98c7:b5be:ac6b:144%5]) with mapi id 15.20.1643.019; Tue, 26 Feb 2019 14:49:47 +0000 From: Maxim Mikityanskiy To: "netdev@vger.kernel.org" , =?iso-8859-1?Q?Bj=F6rn_T=F6pel?= , Magnus Karlsson , "David S. Miller" CC: Tariq Toukan , Saeed Mahameed , Eran Ben Elisha Subject: AF_XDP design flaws Thread-Topic: AF_XDP design flaws Thread-Index: AdTN4YAOICnun2ZQR0OvVwAOQ6l7Ug== Date: Tue, 26 Feb 2019 14:49:01 +0000 Deferred-Delivery: Tue, 26 Feb 2019 14:48:44 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=maximmi@mellanox.com; x-originating-ip: [95.67.35.250] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 4410faf0-58b3-43ae-63af-08d69bf9a763 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0;PCL:0;RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600127)(711020)(4605104)(4618075)(2017052603328)(7153060)(7193020);SRVR:AM6PR05MB4902; x-ms-traffictypediagnostic: AM6PR05MB4902: x-microsoft-exchange-diagnostics: =?iso-8859-1?Q?1;AM6PR05MB4902;23:ZjKgXr2yOJSqDtdTMxBtK6tD2s0343NfL0aLtoP?= =?iso-8859-1?Q?9GPkxmp/EE+E9tAFpOmMkhYx3C622bTx/uytG8NoQZ67BRSinkhjBhAOCb?= =?iso-8859-1?Q?J9eVBcR/3lLzUGi3XwfPvSMUY6lCSwCs/Y/eaLo+yjZZanX63yjI+uJUM+?= =?iso-8859-1?Q?Wan7awY9B7yGAsZODYr8GuF/ohc8Z8A+EuGE6wG2vPjtogH5KK3xmFePzo?= =?iso-8859-1?Q?c3xHEC0RdIWFdplpSoYP/qHZAssfL4qI4wkkTv64Vgx23ssleyDx0Nq63m?= =?iso-8859-1?Q?scYdlFhp2i9n2b1+ZinzOQX4tHSegL6Dv0wJcFfr6k9PcNReWUgtZEZAml?= =?iso-8859-1?Q?eUspVfPDYNfbvSjwAcFEcqKQkaYGBs9LH+49ncVG8yTPS0u3CPEzUUsr4y?= =?iso-8859-1?Q?oE5mVQ/XWHvb8rfRPWB3gxWgZjunRJgMYZPl16Av8whYcjr3kQMcj+JSfF?= =?iso-8859-1?Q?M+tVIqn3uIAvNLQxrZg0pfFloEU4Ul6nLlPn0yB6PKKARx59RXt0FpsEAu?= =?iso-8859-1?Q?C5A2nDEPrzhlhTUuKnaRaFTi0hgpCn5MEiHqdgLY/iaRFgMPhbgYyn2r+s?= =?iso-8859-1?Q?AJwWSeR1Isi6c/HJLMz0G6RAb7abG8fLhFw2/0XuvHz4xxUEFMMQyDBGOo?= =?iso-8859-1?Q?Na/Wpqwa4cJtoZS3dK956IaPi7MDjzZJ43KQmFeeQysrbiTFZzgyiI9bkY?= =?iso-8859-1?Q?0EdIpzah6O6/G2y5Jjt5ltw3c/pz18YM+AuM0CCPq2dRJJohTTI+JQaonF?= =?iso-8859-1?Q?BSuu/7g5rVUHnG36KiGElfnEPe886pdmZVMaVaT260Ewh5MVyNwKTE6jKD?= =?iso-8859-1?Q?g42IqiwFxkaZBucL2DeoYCXMUjOhIR62TtN+Hr0/iBc2YYijHXqwlMIG9j?= =?iso-8859-1?Q?64NM9bKoBGfxnlkoidnujMdmWPHKPJot7Uruk6D1IwRAmi8pbrIc5SNWJS?= =?iso-8859-1?Q?t1e0nIgCFYSp6Jds3SwU4Yl6f369hMTIYfbM3unlfmyC612uc5SkmZyp1V?= =?iso-8859-1?Q?tW3kFOfyOGr2z58KTxi3neNtVnMPfFEEVZ9eJfcuH9tKWqcqG0hwwsP/AG?= =?iso-8859-1?Q?qh+5zvt+yXFYaIVY9skjQwemhFsF1daEsV//ISoKn9Y2Xd/g4S2livdCLj?= =?iso-8859-1?Q?uEfTAqJRUEr8E2yUI0hHm5DoxaEoCWk4Pv1rxamCVWeWHZMVlTJKnWXVC5?= =?iso-8859-1?Q?lhRoTJQb3zZzja2p7wFMcBaDR+TGsxNdbH8WJNlHR5EcE/z51EuNAF4U14?= =?iso-8859-1?Q?iDSAS60k/2pHyFz7q/EFTKL5JQDGP07jkMqjcjV8npUJEBNk7qVoqSUitr?= =?iso-8859-1?Q?mr3eQm+sY067K4ZbyNg3uUDVnQcOaQ3oTrVAujdtEihyRTHpAmPVoRBthd?= =?iso-8859-1?Q?BBF5zw19BD8qAobgXpdhubnyzmuHo?= x-microsoft-antispam-prvs: x-forefront-prvs: 096029FF66 x-forefront-antispam-report: SFV:NSPM;SFS:(10009020)(979002)(39860400002)(136003)(346002)(396003)(376002)(366004)(53754006)(199004)(189003)(26005)(476003)(2906002)(186003)(5660300002)(52536013)(3846002)(6116002)(256004)(53936002)(6666004)(74316002)(14454004)(102836004)(9686003)(486006)(478600001)(14444005)(8676002)(66066001)(305945005)(7736002)(25786009)(2501003)(81156014)(7116003)(81166006)(55016002)(107886003)(7696005)(68736007)(316002)(110136005)(105586002)(33656002)(106356001)(86362001)(71190400001)(97736004)(71200400001)(6506007)(4326008)(6436002)(8936002)(99286004)(54906003)(969003)(989001)(999001)(1009001)(1019001);DIR:OUT;SFP:1101;SCL:1;SRVR:AM6PR05MB4902;H:AM6PR05MB5879.eurprd05.prod.outlook.com;FPR:;SPF:None;LANG:en;PTR:InfoNoRecords;A:1;MX:1; received-spf: None (protection.outlook.com: mellanox.com does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: 5hKNZoCuVHeVUndUuIY+4/JmsNduilOtuYxq42jjIyagKmzVkr9vQIuE86T3+WOTv1wYMGm+eHDhnvJn79cny21BDdJsDzbiAUrfU1TmjYHmT+q59YGullW8Md9qAobL4DyODmmIknXWVM0oiH+9gN6VmeVkIR5sFLzd6rBKh9Jan1YXmQGXUCfh0mOwbqQftPrK+SzR4IPqH7W3O5HhKkNY4EhxOIPqaRcf03XAb89X6X6gKwdKBxfmrEWhjczQPuZ6G/HgiVUJGNRYXPXUVKWLpFFVj35flw5voShWBEHevaBZDDG6wYerXe2ixY+sIE7lPVGxAVdcrLHTAeuQt89mx6KS1ItGS26euGklEkqL18Njuvxu+HdAmQryjP1UW4ACLTZlDVB4L0Gl2rUMHmT71ZgxfSacsxEw3TTpjb4= Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-Network-Message-Id: 4410faf0-58b3-43ae-63af-08d69bf9a763 X-MS-Exchange-CrossTenant-originalarrivaltime: 26 Feb 2019 14:49:47.5713 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: a652971c-7d2e-4d9b-a6a4-d149256f461b X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM6PR05MB4902 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Hi everyone, I would like to discuss some design flaws of AF_XDP socket (XSK) implementa= tion in kernel. At the moment I don't see a way to work around them without chan= ging the API, so I would like to make sure that I'm not missing anything and to suggest and discuss some possible improvements that can be made. The issues I describe below are caused by the fact that the driver depends = on the application doing some things, and if the application is slow/buggy/malicious, the driver is forced to busy poll because of the lack= of a notification mechanism from the application side. I will refer to the i40e driver implementation a lot, as it is the first implementation of AF_XDP, b= ut the issues are general and affect any driver. I already considered trying t= o fix it on driver level, but it doesn't seem possible, so it looks like the beha= vior and implementation of AF_XDP in the kernel has to be changed. RX side busy polling =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D On the RX side, the driver expects the application to put some descriptors = in the Fill Ring. There is no way for the application to notify the driver tha= t there are more Fill Ring descriptors to take, so the driver is forced to bu= sy poll the Fill Ring if it gets empty. E.g., the i40e driver does it in NAPI = poll: int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget) { ... failure =3D failure || !i40e_alloc_rx_buffers_fast_zc(rx_ring, cleaned_co= unt); ... return failure ? budget : (int)total_rx_packets; } Basically, it means that if there are no descriptors in the Fill Ring, NAPI= will never stop, draining CPU. Possible cases when it happens ------------------------------ 1. The application is slow, it received some frames in the RX Ring, and it = is still handling the data, so it has no free frames to put to the Fill Ring. 2. The application is malicious, it opens an XSK and puts no frames to the = Fill Ring. It can be used as a local DoS attack. 3. The application is buggy and stops filling the Fill Ring for whatever re= ason (deadlock, waiting for another blocking operation, other bugs). Although loading an XDP program requires root access, the DoS attack can be targeted to setups that already use XDP, i.e. an XDP program is already loa= ded. Even under root, userspace applications should not be able to disrupt syste= m stability by just calling normal APIs without an intention to destroy the system, and here it happens in case 1. Possible way to solve the issue ------------------------------- When the driver can't take new Fill Ring frames, it shouldn't busy poll. Instead, it signals the failure to the application (e.g., with POLLERR), an= d after that it's up to the application to restart polling (e.g., by calling sendto()) after refilling the Fill Ring. The issue with this approach is th= at it changes the API, so we either have to deal with it or to introduce some API version field. TX side getting stuck =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D On the TX side, there is the Completion Ring that the application has to cl= ean. If it doesn't, the i40e driver stops taking descriptors from the TX Ring. I= f the application finally completes something, the driver can go on transmitting. However, it would require busy polling the Completion Ring (just like with = the Fill Ring on the RX side). i40e doesn't do it, instead, it relies on the application to kick the TX by calling sendto(). The issue is that poll() do= esn't return POLLOUT in this case, because the TX Ring is full, so the applicatio= n will never call sendto(), and the ring is stuck forever (or at least until something triggers NAPI). Possible way to solve the issue ------------------------------- When the driver can't reserve a descriptor in the Completion Ring, it shoul= d signal the failure to the application (e.g., with POLLERR). The application shouldn't call sendto() every time it sees that the number of not completed frames is greater than zero (like xdpsock sample does). Instead, the applic= ation should kick the TX only when it wants to flush the ring, and, in addition, = after resolving the cause for POLLERR, i.e. after handling Completion Ring entrie= s. The API will also have to change with this approach. Triggering NAPI on a different CPU core =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D .ndo_xsk_async_xmit runs on a random CPU core, so, to preserve CPU affinity= , i40e triggers an interrupt to schedule NAPI, instead of calling napi_schedu= le directly. Scheduling NAPI on the correct CPU is what would every driver do,= I guess, but currently it has to be implemented differently in every driver, = and it relies on hardware features (the ability to trigger an IRQ). I suggest introducing a kernel API that would allow triggering NAPI on a gi= ven CPU. A brief look shows that something like smp_call_function_single_async = can be used. Advantages: 1. It lifts the hardware requirement to be able to raise an interrupt on de= mand. 2. It would allow to move common code to the kernel (.ndo_xsk_async_xmit). 3. It is also useful in the situation where CPU affinity changes while bein= g in NAPI poll. Currently, i40e and mlx5e try to stop NAPI polling by returning a value less than budget if CPU affinity changes. However, there are cases (e.g., NAPIF_STATE_MISSED) when NAPI will be rescheduled on a wrong CPU. It= 's a race between the interrupt, which will move NAPI to the correct CPU, and __napi_schedule from a wrong CPU. Having an API to schedule NAPI on a given= CPU will benefit both mlx5e and i40e, because when this situation happens, it k= ills the performance. I would be happy to hear your thoughts about these issues. Thanks, Max