From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 761B4C43381 for ; Mon, 18 Feb 2019 11:21:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 242B7217F5 for ; Mon, 18 Feb 2019 11:21:54 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="UNxEprIj" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728898AbfBRLVx (ORCPT ); Mon, 18 Feb 2019 06:21:53 -0500 Received: from mail-pf1-f194.google.com ([209.85.210.194]:44520 "EHLO mail-pf1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727058AbfBRLVw (ORCPT ); Mon, 18 Feb 2019 06:21:52 -0500 Received: by mail-pf1-f194.google.com with SMTP id u6so8395283pfh.11 for ; Mon, 18 Feb 2019 03:21:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=1cV4OX5r4+tx2Qviw9gk3aVl8AolAtevB1zy8Giva/0=; b=UNxEprIjACs3XAD/psjbzcX7v83cHn9lD9VJtnvmaj+wWfwiKl5ssx+gmUGrHX6xu9 3QlFpMxVeZ+a4fAovKWZJtGpB+CmzgL64QVUZFdqb0eYv7uM/NgEM9k/0Ol5KTXKRG7P 58ssWP80wiB0UUtswckwJgGX1TN2cC1oKoRXb7KC5COPT+iP8qA+NMaiofHOqrSqbOvb cm5/xBTM5+8jJbQLj2BsiwGbLUZ06H7C+3zsqbLDppr3t2DxnysNm4iiG3ObXueSivCZ u5Y1m5T8IsqQAD8INW5i+mTFG9JkeOOZdoNsp1cMB2nweLZtCVaKbICmugXJH39q779O C0Dg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=1cV4OX5r4+tx2Qviw9gk3aVl8AolAtevB1zy8Giva/0=; b=Pv3LcreraYBWZLsLI0nrDFFs+yH7ZZmfAzIWT0CwjZUj/onb9jkL+S6H0fY5yiyU6h FhV8ZZ7XTulE+Rlecb2CZRMEHVHY4Fh1OfFBKC+pkL7bVuDVc9nfx+fyV50uHHy43y7U DRyEdE8N/GqzkxXYyw6ojADQt+XKJh/M22IDuKBa7fKgvsWAkggPyjXdnMXWiBK4F9RI MVwYaCrSy0vBM7+etwJZrCm/cX/NFkOmQucei6SlvCDMDtMfKKOq2uP/ZLSvf6e53bzb Oet3iGqDcxRSeFYaVuo+kNF5aopGW7mRURsuZflROPK7V14Xyc6LuHrt7uVSYrg2lFoa zQBg== X-Gm-Message-State: AHQUAuZKvJMovEnztMJrScen2B8XB0ezQHJiZsbQDEEua8BhQbjM1hp4 K4xgtcddz3DfKiIOdb8b0dA= X-Google-Smtp-Source: AHgI3IZorK7Ei5fPEFiBqx+qc3gHZolTm5e39WygZP8fWcIYtkf45A5L/JjQHKHVRnoL3Iz7l9rbzQ== X-Received: by 2002:a65:6150:: with SMTP id o16mr18590907pgv.434.1550488911307; Mon, 18 Feb 2019 03:21:51 -0800 (PST) Received: from localhost ([192.55.54.45]) by smtp.gmail.com with ESMTPSA id c4sm14232542pgq.85.2019.02.18.03.21.47 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 18 Feb 2019 03:21:51 -0800 (PST) Date: Mon, 18 Feb 2019 12:21:32 +0100 From: Maciej Fijalkowski To: Magnus Karlsson Cc: Daniel Borkmann , Magnus Karlsson , =?ISO-8859-1?Q?Bj=F6rn_T=F6pel?= , ast@kernel.org, Network Development , Jakub Kicinski , =?ISO-8859-1?Q?Bj=F6rn_T=F6pel?= , "Zhang, Qi Z" , Jesper Dangaard Brouer , xiaolong.ye@intel.com Subject: Re: [PATCH bpf-next v4 1/2] libbpf: add support for using AF_XDP sockets Message-ID: <20190218122055.00007937@gmail.com> In-Reply-To: References: <1549631126-29067-1-git-send-email-magnus.karlsson@intel.com> <1549631126-29067-2-git-send-email-magnus.karlsson@intel.com> X-Mailer: Claws Mail 3.17.1 (GTK+ 2.24.32; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Mon, 18 Feb 2019 09:59:30 +0100 Magnus Karlsson wrote: > On Fri, Feb 15, 2019 at 6:09 PM Daniel Borkmann wr= ote: > > > > On 02/08/2019 02:05 PM, Magnus Karlsson wrote: =20 > > > This commit adds AF_XDP support to libbpf. The main reason for this is > > > to facilitate writing applications that use AF_XDP by offering > > > higher-level APIs that hide many of the details of the AF_XDP > > > uapi. This is in the same vein as libbpf facilitates XDP adoption by > > > offering easy-to-use higher level interfaces of XDP > > > functionality. Hopefully this will facilitate adoption of AF_XDP, make > > > applications using it simpler and smaller, and finally also make it > > > possible for applications to benefit from optimizations in the AF_XDP > > > user space access code. Previously, people just copied and pasted the > > > code from the sample application into their application, which is not > > > desirable. > > > > > > The interface is composed of two parts: > > > > > > * Low-level access interface to the four rings and the packet > > > * High-level control plane interface for creating and setting > > > up umems and af_xdp sockets as well as a simple XDP program. > > > > > > Tested-by: Bj=F6rn T=F6pel > > > Signed-off-by: Magnus Karlsson =20 > > [...] =20 > > > diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c > > > new file mode 100644 > > > index 0000000..a982a76 > > > --- /dev/null > > > +++ b/tools/lib/bpf/xsk.c > > > @@ -0,0 +1,742 @@ > > > +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) > > > + > > > +/* > > > + * AF_XDP user-space access library. > > > + * > > > + * Copyright(c) 2018 - 2019 Intel Corporation. > > > + * > > > + * Author(s): Magnus Karlsson > > > + */ > > > + > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > + > > > +#include "bpf.h" > > > +#include "libbpf.h" > > > +#include "libbpf_util.h" > > > +#include "nlattr.h" > > > +#include "xsk.h" > > > + > > > +#ifndef SOL_XDP > > > + #define SOL_XDP 283 > > > +#endif > > > + > > > +#ifndef AF_XDP > > > + #define AF_XDP 44 > > > +#endif > > > + > > > +#ifndef PF_XDP > > > + #define PF_XDP AF_XDP > > > +#endif > > > + > > > +struct xsk_umem { > > > + struct xsk_ring_prod *fill; > > > + struct xsk_ring_cons *comp; > > > + char *umem_area; > > > + struct xsk_umem_config config; > > > + int fd; > > > + int refcount; > > > +}; > > > + > > > +struct xsk_socket { > > > + struct xsk_ring_cons *rx; > > > + struct xsk_ring_prod *tx; > > > + __u64 outstanding_tx; > > > + struct xsk_umem *umem; > > > + struct xsk_socket_config config; > > > + int fd; > > > + int xsks_map; > > > + int ifindex; > > > + int prog_fd; > > > + int qidconf_map_fd; > > > + int xsks_map_fd; > > > + __u32 queue_id; > > > +}; > > > + > > > +struct xsk_nl_info { > > > + bool xdp_prog_attached; > > > + int ifindex; > > > + int fd; > > > +}; > > > + > > > +#define MAX_QUEUES 128 =20 > > > > Why is this a fixed constant here, shouldn't this be dynamic due to bei= ng NIC > > specific anyway? =20 >=20 > It was only here for simplicity. If a NIC had more queues, it would > require a recompile of the lib. Obviously, not desirable in a distro. > What I could do is to read the max "combined" queues (pre-set maximum > in the ethtool output) from the same interface as ethool uses and size > the array after that. Or is there a simpler way? What to do if the NIC > does not have a "combined", or is there no such NIC (seems the common > HW ones set this)? >=20 > > [...] =20 > > > +void *xsk_umem__get_data(struct xsk_umem *umem, __u64 addr) > > > +{ > > > + return &((char *)(umem->umem_area))[addr]; > > > +} =20 > > > > There's also a xsk_umem__get_data_raw() doing the same. Why having both= , resp. > > when to choose which? ;) =20 >=20 > There is enough to have the xsk_umem__get_data_raw() function. > xsk_umem__get_data() is just a convenience function for which the > application does not have to store the beginning of the umem. But as > the application always has to provide this anyway in the > xsk_umem__create() function, it might as well store this pointer. I > will delete xsk_umem__get_data() and rename xsk_umem__get_data_raw() > to xsk_umem__get_data(). >=20 > > > +int xsk_umem__fd(const struct xsk_umem *umem) > > > +{ > > > + return umem ? umem->fd : -EINVAL; > > > +} > > > + > > > +int xsk_socket__fd(const struct xsk_socket *xsk) > > > +{ > > > + return xsk ? xsk->fd : -EINVAL; > > > +} > > > + > > > +static bool xsk_page_aligned(void *buffer) > > > +{ > > > + unsigned long addr =3D (unsigned long)buffer; > > > + > > > + return !(addr & (getpagesize() - 1)); > > > +} > > > + > > > +static void xsk_set_umem_config(struct xsk_umem_config *cfg, > > > + const struct xsk_umem_config *usr_cfg) > > > +{ > > > + if (!usr_cfg) { > > > + cfg->fill_size =3D XSK_RING_PROD__DEFAULT_NUM_DESCS; > > > + cfg->comp_size =3D XSK_RING_CONS__DEFAULT_NUM_DESCS; > > > + cfg->frame_size =3D XSK_UMEM__DEFAULT_FRAME_SIZE; > > > + cfg->frame_headroom =3D XSK_UMEM__DEFAULT_FRAME_HEADROO= M; > > > + return; > > > + } > > > + > > > + cfg->fill_size =3D usr_cfg->fill_size; > > > + cfg->comp_size =3D usr_cfg->comp_size; > > > + cfg->frame_size =3D usr_cfg->frame_size; > > > + cfg->frame_headroom =3D usr_cfg->frame_headroom; =20 > > > > Just optional nit, might be a bit nicer to have it in this form: > > > > cfg->fill_size =3D usr_cfg ? usr_cfg->fill_size : > > XSK_RING_PROD__DEFAULT_NUM_DESCS; =20 >=20 > I actually think the current form is clearer when there are multiple > lines. If there was only one line, I would agree with you. >=20 > > > +} > > > + > > > +static void xsk_set_xdp_socket_config(struct xsk_socket_config *cfg, > > > + const struct xsk_socket_config *u= sr_cfg) > > > +{ > > > + if (!usr_cfg) { > > > + cfg->rx_size =3D XSK_RING_CONS__DEFAULT_NUM_DESCS; > > > + cfg->tx_size =3D XSK_RING_PROD__DEFAULT_NUM_DESCS; > > > + cfg->libbpf_flags =3D 0; > > > + cfg->xdp_flags =3D 0; > > > + cfg->bind_flags =3D 0; > > > + return; > > > + } > > > + > > > + cfg->rx_size =3D usr_cfg->rx_size; > > > + cfg->tx_size =3D usr_cfg->tx_size; > > > + cfg->libbpf_flags =3D usr_cfg->libbpf_flags; > > > + cfg->xdp_flags =3D usr_cfg->xdp_flags; > > > + cfg->bind_flags =3D usr_cfg->bind_flags; =20 > > > > (Ditto) > > =20 > > > +} > > > + > > > +int xsk_umem__create(struct xsk_umem **umem_ptr, void *umem_area, __= u64 size, > > > + struct xsk_ring_prod *fill, struct xsk_ring_cons *= comp, > > > + const struct xsk_umem_config *usr_config) > > > +{ > > > + struct xdp_mmap_offsets off; > > > + struct xdp_umem_reg mr; > > > + struct xsk_umem *umem; > > > + socklen_t optlen; > > > + void *map; > > > + int err; > > > + > > > + if (!umem_area || !umem_ptr || !fill || !comp) > > > + return -EFAULT; > > > + if (!size && !xsk_page_aligned(umem_area)) > > > + return -EINVAL; > > > + > > > + umem =3D calloc(1, sizeof(*umem)); > > > + if (!umem) > > > + return -ENOMEM; > > > + > > > + umem->fd =3D socket(AF_XDP, SOCK_RAW, 0); > > > + if (umem->fd < 0) { > > > + err =3D -errno; > > > + goto out_umem_alloc; > > > + } > > > + > > > + umem->umem_area =3D umem_area; > > > + xsk_set_umem_config(&umem->config, usr_config); > > > + > > > + mr.addr =3D (uintptr_t)umem_area; > > > + mr.len =3D size; > > > + mr.chunk_size =3D umem->config.frame_size; > > > + mr.headroom =3D umem->config.frame_headroom; > > > + > > > + err =3D setsockopt(umem->fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof= (mr)); > > > + if (err) { > > > + err =3D -errno; > > > + goto out_socket; > > > + } > > > + err =3D setsockopt(umem->fd, SOL_XDP, XDP_UMEM_FILL_RING, > > > + &umem->config.fill_size, > > > + sizeof(umem->config.fill_size)); > > > + if (err) { > > > + err =3D -errno; > > > + goto out_socket; > > > + } > > > + err =3D setsockopt(umem->fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, > > > + &umem->config.comp_size, > > > + sizeof(umem->config.comp_size)); > > > + if (err) { > > > + err =3D -errno; > > > + goto out_socket; > > > + } > > > + > > > + optlen =3D sizeof(off); > > > + err =3D getsockopt(umem->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &= optlen); > > > + if (err) { > > > + err =3D -errno; > > > + goto out_socket; > > > + } > > > + > > > + map =3D xsk_mmap(NULL, off.fr.desc + > > > + umem->config.fill_size * sizeof(__u64), > > > + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULAT= E, > > > + umem->fd, XDP_UMEM_PGOFF_FILL_RING); > > > + if (map =3D=3D MAP_FAILED) { > > > + err =3D -errno; > > > + goto out_socket; > > > + } > > > + > > > + umem->fill =3D fill; > > > + fill->mask =3D umem->config.fill_size - 1; > > > + fill->size =3D umem->config.fill_size; > > > + fill->producer =3D map + off.fr.producer; > > > + fill->consumer =3D map + off.fr.consumer; > > > + fill->ring =3D map + off.fr.desc; > > > + fill->cached_cons =3D umem->config.fill_size; > > > + > > > + map =3D xsk_mmap(NULL, > > > + off.cr.desc + umem->config.comp_size * sizeof(__= u64), > > > + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULAT= E, > > > + umem->fd, XDP_UMEM_PGOFF_COMPLETION_RING); > > > + if (map =3D=3D MAP_FAILED) { > > > + err =3D -errno; > > > + goto out_mmap; > > > + } > > > + > > > + umem->comp =3D comp; > > > + comp->mask =3D umem->config.comp_size - 1; > > > + comp->size =3D umem->config.comp_size; > > > + comp->producer =3D map + off.cr.producer; > > > + comp->consumer =3D map + off.cr.consumer; > > > + comp->ring =3D map + off.cr.desc; > > > + > > > + *umem_ptr =3D umem; > > > + return 0; > > > + > > > +out_mmap: > > > + munmap(umem->fill, > > > + off.fr.desc + umem->config.fill_size * sizeof(__u64)); > > > +out_socket: > > > + close(umem->fd); > > > +out_umem_alloc: > > > + free(umem); > > > + return err; > > > +} > > > + > > > +static int xsk_parse_nl(void *cookie, void *msg, struct nlattr **tb) > > > +{ > > > + struct nlattr *tb_parsed[IFLA_XDP_MAX + 1]; > > > + struct xsk_nl_info *nl_info =3D cookie; > > > + struct ifinfomsg *ifinfo =3D msg; > > > + unsigned char mode; > > > + int err; > > > + > > > + if (nl_info->ifindex && nl_info->ifindex !=3D ifinfo->ifi_index) > > > + return 0; > > > + > > > + if (!tb[IFLA_XDP]) > > > + return 0; > > > + > > > + err =3D libbpf_nla_parse_nested(tb_parsed, IFLA_XDP_MAX, tb[IFL= A_XDP], > > > + NULL); > > > + if (err) > > > + return err; > > > + > > > + if (!tb_parsed[IFLA_XDP_ATTACHED] || !tb_parsed[IFLA_XDP_FD]) > > > + return 0; > > > + > > > + mode =3D libbpf_nla_getattr_u8(tb_parsed[IFLA_XDP_ATTACHED]); > > > + if (mode =3D=3D XDP_ATTACHED_NONE) > > > + return 0; > > > + > > > + nl_info->xdp_prog_attached =3D true; > > > + nl_info->fd =3D libbpf_nla_getattr_u32(tb_parsed[IFLA_XDP_FD]);= =20 > > > > Hm, I don't think this works if I read the intention of this helper cor= rectly. > > > > IFLA_XDP_FD is never set for retrieving the prog from the kernel. So the > > above is a bug. > > > > We also have bpf_get_link_xdp_id(). This should probably just be reused= in > > this context here. =20 >=20 > If bpf_get_link_xdp_id() will fit my bill, I will happily use it. I > will check it out and hopefully I can drop all this code. Thanks. > I see that all you need to know is whether there's already attached XDP pro= gram to xsk socket's related interface, no? If so, then within the xsk_setup_xdp_prog, you could do something like: u32 prog_id =3D 0; bpf_get_link_xdp_id(xsk->ifindex, &prog_id, xsk->config.xdp_flags); if (!prog_id) { // create maps // load xdp prog } else { xsk->fd =3D prog_id; } xsk_update_bpf_maps(xsk, true, xsk->fd); If that's ok then xsk_xdp_prog_attached and xsk_parse_nl could be dropped. > > > + return 0; > > > +} > > > + > > > +static bool xsk_xdp_prog_attached(struct xsk_socket *xsk) > > > +{ > > > + struct xsk_nl_info nl_info; > > > + unsigned int nl_pid; > > > + char err_buf[256]; > > > + int sock, err; > > > + > > > + sock =3D libbpf_netlink_open(&nl_pid); > > > + if (sock < 0) > > > + return false; > > > + > > > + nl_info.xdp_prog_attached =3D false; > > > + nl_info.ifindex =3D xsk->ifindex; > > > + nl_info.fd =3D -1; > > > + > > > + err =3D libbpf_nl_get_link(sock, nl_pid, xsk_parse_nl, &nl_info= ); > > > + if (err) { > > > + libbpf_strerror(err, err_buf, sizeof(err_buf)); > > > + pr_warning("Error:\n%s\n", err_buf); > > > + close(sock); > > > + return false; > > > + } > > > + > > > + close(sock); > > > + xsk->prog_fd =3D nl_info.fd; > > > + return nl_info.xdp_prog_attached; > > > +} =20 > > > > (See bpf_get_link_xdp_id().) > > =20 > > > + > > > +static int xsk_load_xdp_prog(struct xsk_socket *xsk) > > > +{ > > > + char bpf_log_buf[BPF_LOG_BUF_SIZE]; > > > + int err, prog_fd; > > > + > > > + /* This is the C-program: > > > + * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) > > > + * { > > > + * int *qidconf, index =3D ctx->rx_queue_index; =20 > > [...] =20 > > > + > > > +int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifna= me, > > > + __u32 queue_id, struct xsk_umem *umem, > > > + struct xsk_ring_cons *rx, struct xsk_ring_prod *= tx, > > > + const struct xsk_socket_config *usr_config) > > > +{ > > > + struct sockaddr_xdp sxdp =3D {}; > > > + struct xdp_mmap_offsets off; > > > + struct xsk_socket *xsk; > > > + socklen_t optlen; > > > + void *map; > > > + int err; > > > + > > > + if (!umem || !xsk_ptr || !rx || !tx) > > > + return -EFAULT; > > > + > > > + if (umem->refcount) { > > > + pr_warning("Error: shared umems not supported by libbpf= .\n"); > > > + return -EBUSY; > > > + } > > > + > > > + xsk =3D calloc(1, sizeof(*xsk)); > > > + if (!xsk) > > > + return -ENOMEM; > > > + > > > + if (umem->refcount++ > 0) { =20 > > > > Should this refcount rather be atomic actually? =20 >=20 > Neither our config nor data plane interfaces are reentrant for > performance reasons. Any concurrency has to be handled explicitly on > the application level. This so that it only penalizes apps that really > need this. >=20 > Thanks for all your reviews: Magnus >=20 > > > + xsk->fd =3D socket(AF_XDP, SOCK_RAW, 0); > > > + if (xsk->fd < 0) { > > > + err =3D -errno; > > > + goto out_xsk_alloc; > > > + } > > > + } else { > > > + xsk->fd =3D umem->fd; > > > + } > > > + > > > + xsk->outstanding_tx =3D 0; > > > + xsk->queue_id =3D queue_id; > > > + xsk->umem =3D umem; > > > + xsk->ifindex =3D if_nametoindex(ifname); > > > + if (!xsk->ifindex) { > > > + err =3D -errno; > > > + goto out_socket; > > > + } > > > + > > > + xsk_set_xdp_socket_config(&xsk->config, usr_config); =20 > > [...] =20