From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2B703C169C4 for ; Thu, 31 Jan 2019 14:24:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id DD72A20857 for ; Thu, 31 Jan 2019 14:24:03 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="JVlgiubu" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387833AbfAaOYC (ORCPT ); Thu, 31 Jan 2019 09:24:02 -0500 Received: from mail-it1-f195.google.com ([209.85.166.195]:39060 "EHLO mail-it1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726775AbfAaOYC (ORCPT ); Thu, 31 Jan 2019 09:24:02 -0500 Received: by mail-it1-f195.google.com with SMTP id a6so4341506itl.4 for ; Thu, 31 Jan 2019 06:24:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=VuBLKsFnKj4KHE/fD8+LvEQN52kPDOIPoR21nq7Wnus=; b=JVlgiubuR0jz4pBZNx294dLRMYju3xKoN+I1OfBzSO0FjfivP/a+8+kjcJvRRj4swJ a4WnN/pGZ2gMhx5Ej00uSZWdLYK9DCfk+sWHNfvmTZ3XXTj2dJJEWeqrcGR4lXR/D0d8 5vWcodtbtYMZuVXpREBcTBBAx+zb2xqaeFK3DBl0CPg5ZSGHiFvrUSJnFAxR9ZliRgME y3b5WnoJ19f4eECbg3Z1Qnv7unNkoScmxlH16FN5DAO5A3qjYIbda5DQMxZ7pg/2PEXK 72WhMAA5NyeTLQ/yA3G1GUUwBewF+LbKDBhzgWz5BQW6ffuJlDq29nUg73e/6e7Jvr1s 2DCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=VuBLKsFnKj4KHE/fD8+LvEQN52kPDOIPoR21nq7Wnus=; b=ptV3yISngfeWQ1Oe8AGQJyYqck37lPS/My0/q6UQIJpPGgBDjWmgPgUd+0oE/18KA2 xvngkDdm8HMyebd/z0qIHNUFN8PYNa7UiI3OImSMnjw3NwtR32cPx7e5A/+4yyAE7kef Y2STN/bv581xf7ofeluRuGmiwTgLbf6ruTMvQGD8jmOoP/Pl8tWwEMsZGIsgFObmjD0w WSaL6/Y86zcV3CP8CPBfMI6RcibRJlvGmNVMbThi7kOGvi0wNicgjiEFpgbW5M+LAWh+ aABoaPtyW/RBtazTvIdKcGKi2+A+48fzy1C7y/MRZFyI6mjcYQvpj+QAsftvqP+ydbJF 8a0A== X-Gm-Message-State: AHQUAuaUTRTHkGBXvVVDhstmnFX+ASm2Oy0HTbh13hYHcQ8GcxWb6FxC 9IxBa9T+TeGgklbXndPnzTCDD4uYKXlrWwZ0aPI= X-Google-Smtp-Source: ALg8bN6vA98u7h+YmKixWtyYZ/9k+gs5gPJEdSVIXk5hrQqBpN8Ye/xT8hk9z9UxZVW3WG7bDjHQgzvgoMBtQB8cdCk= X-Received: by 2002:a24:7284:: with SMTP id x126mr18403874itc.15.1548944640543; Thu, 31 Jan 2019 06:24:00 -0800 (PST) MIME-Version: 1.0 References: <1548774737-16579-1-git-send-email-magnus.karlsson@intel.com> <1548774737-16579-3-git-send-email-magnus.karlsson@intel.com> <1d42cd5e-4842-1c6f-6fa9-f1cb16fdd74f@iogearbox.net> <20190131150555.00002898@gmail.com> In-Reply-To: <20190131150555.00002898@gmail.com> From: Magnus Karlsson Date: Thu, 31 Jan 2019 15:23:49 +0100 Message-ID: Subject: Re: [PATCH bpf-next v3 2/3] libbpf: add support for using AF_XDP sockets To: Maciej Fijalkowski Cc: Daniel Borkmann , Magnus Karlsson , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , ast@kernel.org, Network Development , Jakub Kicinski , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , "Zhang, Qi Z" , Jesper Dangaard Brouer Content-Type: text/plain; charset="UTF-8" Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Thu, Jan 31, 2019 at 3:06 PM Maciej Fijalkowski wrote: > > On Thu, 31 Jan 2019 14:55:06 +0100 > Magnus Karlsson wrote: > > > On Thu, Jan 31, 2019 at 11:31 AM Daniel Borkmann wrote: > > > > > > On 01/29/2019 04:12 PM, Magnus Karlsson wrote: > > > > This commit adds AF_XDP support to libbpf. The main reason for this is > > > > to facilitate writing applications that use AF_XDP by offering > > > > higher-level APIs that hide many of the details of the AF_XDP > > > > uapi. This is in the same vein as libbpf facilitates XDP adoption by > > > > offering easy-to-use higher level interfaces of XDP > > > > functionality. Hopefully this will facilitate adoption of AF_XDP, make > > > > applications using it simpler and smaller, and finally also make it > > > > possible for applications to benefit from optimizations in the AF_XDP > > > > user space access code. Previously, people just copied and pasted the > > > > code from the sample application into their application, which is not > > > > desirable. > > > > > > > > The interface is composed of two parts: > > > > > > > > * Low-level access interface to the four rings and the packet > > > > * High-level control plane interface for creating and setting > > > > up umems and af_xdp sockets as well as a simple XDP program. > > > > > > > > Signed-off-by: Magnus Karlsson > > > [...] > > > > + > > > > +static __u32 xsk_prod_nb_free(struct xsk_ring_prod *r, __u32 nb) > > > > +{ > > > > + __u32 free_entries = r->cached_cons - r->cached_prod; > > > > + > > > > + if (free_entries >= nb) > > > > + return free_entries; > > > > + > > > > + /* Refresh the local tail pointer. > > > > + * cached_cons is r->size bigger than the real consumer pointer so > > > > + * that this addition can be avoided in the more frequently > > > > + * executed code that computs free_entries in the beginning of > > > > + * this function. Without this optimization it whould have been > > > > + * free_entries = r->cached_prod - r->cached_cons + r->size. > > > > + */ > > > > + r->cached_cons = *r->consumer + r->size; > > > > + > > > > + return r->cached_cons - r->cached_prod; > > > > +} > > > > + > > > > +static __u32 xsk_cons_nb_avail(struct xsk_ring_cons *r, __u32 nb) > > > > +{ > > > > + __u32 entries = r->cached_prod - r->cached_cons; > > > > + > > > > + if (entries == 0) { > > > > + r->cached_prod = *r->producer; > > > > + entries = r->cached_prod - r->cached_cons; > > > > + } > > > > + > > > > + return (entries > nb) ? nb : entries; > > > > +} > > > > + > > > > +size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod, size_t nb, __u32 *idx) > > > > +{ > > > > + if (unlikely(xsk_prod_nb_free(prod, nb) < nb)) > > > > + return 0; > > > > + > > > > + *idx = prod->cached_prod; > > > > + prod->cached_prod += nb; > > > > + > > > > + return nb; > > > > +} > > > > + > > > > +void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb) > > > > +{ > > > > + /* Make sure everything has been written to the ring before signalling > > > > + * this to the kernel. > > > > + */ > > > > + smp_wmb(); > > > > + > > > > + *prod->producer += nb; > > > > +} > > > > + > > > > +size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons, size_t nb, __u32 *idx) > > > > +{ > > > > + size_t entries = xsk_cons_nb_avail(cons, nb); > > > > + > > > > + if (likely(entries > 0)) { > > > > + /* Make sure we do not speculatively read the data before > > > > + * we have received the packet buffers from the ring. > > > > + */ > > > > + smp_rmb(); > > > > + > > > > + *idx = cons->cached_cons; > > > > + cons->cached_cons += entries; > > > > + } > > > > + > > > > + return entries; > > > > +} > > > > + > > > > +void xsk_ring_cons__release(struct xsk_ring_cons *cons, size_t nb) > > > > +{ > > > > + *cons->consumer += nb; > > > > +} > > > > + > > > > +void *xsk_umem__get_data_raw(void *umem_area, __u64 addr) > > > > +{ > > > > + return &((char *)umem_area)[addr]; > > > > +} > > > > + > > > > +void *xsk_umem__get_data(struct xsk_umem *umem, __u64 addr) > > > > +{ > > > > + return &((char *)(umem->umem_area))[addr]; > > > > +} > > > > > > Shouldn't some of the above helpers for critical path be exposed as static > > > inline functions instead? > > > > Could be. I will make some experiments to see how much performance > > would improve. Kept them as non-static so that they could be changed > > if someone thought of a faster way of doing these operations. But > > might be unnecessary, since we can change parts of it even if they are > > static. I will measure and come back to you. > > > > > [...] > > > > +static int xsk_parse_nl(void *cookie, void *msg, struct nlattr **tb) > > > > +{ > > > > + struct nlattr *tb_parsed[IFLA_XDP_MAX + 1]; > > > > + struct xsk_nl_info *nl_info = cookie; > > > > + unsigned char mode; > > > > + int err; > > > > + > > > > + (void)msg; > > > > > > Unused? > > > > Yes. > > > IMO we should make sure here that we actually do the parsing for xsk->ifindex > only. So nl_info should have an ifindex field which will be set to xsk->ifindex > and here we will do: > > struct ifinfomsg *ifinfo = msg; > if (nl_info->ifindex && nl_info->ifindex != ifinfo->ifi_index) > return 0; > > RTM_GETLINK with NLM_F_DUMP flag will give you information from all interfaces > so let's suppose that you are running the xdpsock on eth0 but on eth1 there's a > xdp prog already running. IIUC here you might have a situation that > xsk_xdp_prog_attached will return true because you did the parsing for every > interface, eth1 has xdp prog attached so the nl_info will be filled. Then, in > xsk_setup_xdp_prog you won't create maps. > > Check against the xsk->ifindex at the beginning of xsk_parse_nl will prevent > this kind of behavior. Thanks for catching this Maciej. Will fix. /Magnus > > > > + nl_info->xdp_prog_attached = false; > > > > + if (!tb[IFLA_XDP]) > > > > + return 0; > > > > + > > > > + err = libbpf_nla_parse_nested(tb_parsed, IFLA_XDP_MAX, tb[IFLA_XDP], > > > > + NULL); > > > > + if (err) > > > > + return err; > > > > + > > > > + if (!tb_parsed[IFLA_XDP_ATTACHED] || !tb_parsed[IFLA_XDP_FD]) > > > > + return 0; > > > > + > > > > + mode = libbpf_nla_getattr_u8(tb_parsed[IFLA_XDP_ATTACHED]); > > > > + if (mode == XDP_ATTACHED_NONE) > > > > + return 0; > > > > > > Probably good to memset and/or init the passed struct xsk_nl_info (e.g. > > > fd to -1 etc) such that if some user did not we won't end up with garbage > > > values on return 0. > > > > Will do. > > > > > > + nl_info->xdp_prog_attached = true; > > > > + nl_info->fd = libbpf_nla_getattr_u32(tb_parsed[IFLA_XDP_FD]); > > > > + return 0; > > > > +} > > > > + > > > > +static bool xsk_xdp_prog_attached(struct xsk_socket *xsk) > > > > +{ > > > > + struct xsk_nl_info nl_info; > > > > + unsigned int nl_pid; > > > > + char err_buf[256]; > > > > + int sock, err; > > > > + > > > > + sock = libbpf_netlink_open(&nl_pid); > > > > + if (sock < 0) > > > > + return false; > > > > + > > > > + nl_info.xdp_prog_attached = false; > > > > + nl_info.fd = 0; > > > > + > > > > + err = libbpf_nl_get_link(sock, nl_pid, xsk_parse_nl, &nl_info); > > > > + if (err) { > > > > + libbpf_strerror(err, err_buf, sizeof(err_buf)); > > > > + pr_warning("Error:\n%s\n", err_buf); > > > > + return false; > > > > + } > > > > + > > > > + xsk->prog_fd = nl_info.fd; > > > > + return nl_info.xdp_prog_attached; > > > > > > Don't we leak sock here and in error above? > > > > Will fix the fd leaks that you have pointed out here and below. And go > > through the code to see if there are more fd leaks. Do I ever learn > > ;-). > > > > Thanks for your review. Appreciated. > > > > /Magnus > > > > > > +} > > > > + > > > > +static int xsk_load_xdp_prog(struct xsk_socket *xsk) > > > > +{ > > > > + char bpf_log_buf[BPF_LOG_BUF_SIZE]; > > > > + int err, prog_fd; > > > > + > > > > + /* This is the C-program: > > > > + * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) > > > > + * { > > > > + * int *qidconf, index = ctx->rx_queue_index; > > > > + * > > > > + * // A set entry here means that the correspnding queue_id > > > > + * // has an active AF_XDP socket bound to it. > > > > + * qidconf = bpf_map_lookup_elem(&qidconf_map, &index); > > > > + * if (!qidconf) > > > > + * return XDP_ABORTED; > > > > + * > > > > + * if (*qidconf) > > > > + * return bpf_redirect_map(&xsks_map, index, 0); > > > > + * > > > > + * return XDP_PASS; > > > > + * } > > > > + */ > > > > + struct bpf_insn prog[] = { > > > > + /* r1 = *(u32 *)(r1 + 16) */ > > > > + BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 16), > > > > + /* *(u32 *)(r10 - 4) = r1 */ > > > > + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_1, -4), > > > > + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), > > > > + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), > > > > + BPF_LD_MAP_FD(BPF_REG_1, xsk->qidconf_map_fd), > > > > + BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem), > > > > + BPF_MOV64_REG(BPF_REG_1, BPF_REG_0), > > > > + BPF_MOV32_IMM(BPF_REG_0, 0), > > > > + /* if r1 == 0 goto +8 */ > > > > + BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 8), > > > > + BPF_MOV32_IMM(BPF_REG_0, 2), > > > > + /* r1 = *(u32 *)(r1 + 0) */ > > > > + BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0), > > > > + /* if r1 == 0 goto +5 */ > > > > + BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 5), > > > > + /* r2 = *(u32 *)(r10 - 4) */ > > > > + BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd), > > > > + BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4), > > > > + BPF_MOV32_IMM(BPF_REG_3, 0), > > > > + BPF_EMIT_CALL(BPF_FUNC_redirect_map), > > > > + /* The jumps are to this instruction */ > > > > + BPF_EXIT_INSN(), > > > > + }; > > > > + size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn); > > > > + > > > > + prog_fd = bpf_load_program(BPF_PROG_TYPE_XDP, prog, insns_cnt, > > > > + "LGPL-2.1 or BSD-2-Clause", 0, bpf_log_buf, > > > > + BPF_LOG_BUF_SIZE); > > > > + if (prog_fd < 0) { > > > > + pr_warning("BPF log buffer:\n%s", bpf_log_buf); > > > > + return prog_fd; > > > > + } > > > > + > > > > + err = bpf_set_link_xdp_fd(xsk->ifindex, prog_fd, xsk->config.xdp_flags); > > > > + if (err) > > > > + return err; > > > > > > Leaks prog_fd on error. > > > > > > > + xsk->prog_fd = prog_fd; > > > > + return 0; > > > > +} > > > > + > > > > +static int xsk_create_bpf_maps(struct xsk_socket *xsk) > > > > +{ > > > > + int fd; > > > > + > > > > + fd = bpf_create_map_name(BPF_MAP_TYPE_ARRAY, "qidconf_map", > > > > + sizeof(int), sizeof(int), MAX_QUEUES, 0); > > > > + if (fd < 0) > > > > + return fd; > > > > + xsk->qidconf_map_fd = fd; > > > > + > > > > + fd = bpf_create_map_name(BPF_MAP_TYPE_XSKMAP, "xsks_map", > > > > + sizeof(int), sizeof(int), MAX_QUEUES, 0); > > > > + if (fd < 0) > > > > + return fd; > > > > > > Leaks first map fd on error. > > > > > > > + xsk->xsks_map_fd = fd; > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +static int xsk_update_bpf_maps(struct xsk_socket *xsk, int qidconf_value, > > > > + int xsks_value) > > > > +{ > > > > + bool qidconf_map_updated = false, xsks_map_updated = false; > > > > + struct bpf_prog_info prog_info = {}; > > > > + __u32 prog_len = sizeof(prog_info); > > > > + struct bpf_map_info map_info; > > > > + __u32 map_len = sizeof(map_info); > > > > + __u32 *map_ids; > > > > + int reset_value = 0; > > > > + __u32 num_maps; > > > > + unsigned int i; > > > > + int err; > > > > + > > > > + err = bpf_obj_get_info_by_fd(xsk->prog_fd, &prog_info, &prog_len); > > > > + if (err) > > > > + return err; > > > > + > > > > + num_maps = prog_info.nr_map_ids; > > > > + > > > > + map_ids = malloc(prog_info.nr_map_ids * sizeof(*map_ids)); > > > > > > calloc()? > > > > > > > + if (!map_ids) > > > > + return -ENOMEM; > > > > + > > > > + memset(&prog_info, 0, prog_len); > > > > + prog_info.nr_map_ids = num_maps; > > > > + prog_info.map_ids = (__u64)(unsigned long)map_ids; > > > > + > > > > + err = bpf_obj_get_info_by_fd(xsk->prog_fd, &prog_info, &prog_len); > > > > + if (err) > > > > + return err; > > > > > > Leaks map_ids on error. > > > > > > > + > > > > + for (i = 0; i < prog_info.nr_map_ids; i++) { > > > > + int fd; > > > > + > > > > + fd = bpf_map_get_fd_by_id(map_ids[i]); > > > > + if (fd < 0) { > > > > + err = -errno; > > > > + goto out; > > > > + } > > > > + > > > > + err = bpf_obj_get_info_by_fd(fd, &map_info, &map_len); > > > > + if (err) > > > > + goto out; > > > > > > close(fd) on error, also the case for below, please double check everything > > > so that no fd leaks by accident into the app. > > > > > > > + > > > > + if (!strcmp(map_info.name, "qidconf_map")) { > > > > + err = bpf_map_update_elem(fd, &xsk->queue_id, > > > > + &qidconf_value, 0); > > > > + if (err) > > > > + goto out; > > > > + qidconf_map_updated = true; > > > > + xsk->qidconf_map_fd = fd; > > > > + } else if (!strcmp(map_info.name, "xsks_map")) { > > > > + err = bpf_map_update_elem(fd, &xsk->queue_id, > > > > + &xsks_value, 0); > > > > + if (err) > > > > + goto out;> + xsks_map_updated = true; > > > > + xsk->xsks_map_fd = fd; > > > > + } > > > > + > > > > + if (qidconf_map_updated && xsks_map_updated) > > > > + break; > > > > + } > > > > + > > > > + if (!(qidconf_map_updated && xsks_map_updated)) { > > > > + err = -ENOENT; > > > > + goto out; > > > > + } > > > > + > > > > + return 0; > > > > + > > > > +out: > > > > + if (qidconf_map_updated) > > > > + (void)bpf_map_update_elem(xsk->qidconf_map_fd, &xsk->queue_id, > > > > + &reset_value, 0); > > > > + if (xsks_map_updated) > > > > + (void)bpf_map_update_elem(xsk->xsks_map_fd, &xsk->queue_id, > > > > + &reset_value, 0); > > > > + return err; > > > > +} > > > > + > > > > +static int xsk_setup_xdp_prog(struct xsk_socket *xsk) > > > > +{ > > > > + bool prog_loaded = false; > > > > + int err; > > > > + > > > > + if (!xsk_xdp_prog_attached(xsk)) { > > > > + err = xsk_create_bpf_maps(xsk); > > > > + if (err) > > > > + goto out_load; > > > > + > > > > + err = xsk_load_xdp_prog(xsk); > > > > + if (err) > > > > + return err; > > > > > > Needs to undo created maps on error. > > > > > > > + prog_loaded = true; > > > > + } > > > > + > > > > + err = xsk_update_bpf_maps(xsk, true, xsk->fd); > > > > + if (err) > > > > + goto out_load; > > > > + > > > > + return 0; > > > > + > > > > +out_load: > > > > + if (prog_loaded) > > > > + close(xsk->prog_fd); > > > > > > Ditto > > > > > > > + return err; > > > > +} > > > > + > > > > +int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname, > > > > + __u32 queue_id, struct xsk_umem *umem, > > > > + struct xsk_ring_cons *rx, struct xsk_ring_prod *tx, > > > > + const struct xsk_socket_config *usr_config) > > > > +{ > > > > + struct sockaddr_xdp sxdp = {}; > > > > + struct xdp_mmap_offsets off; > > > > + struct xsk_socket *xsk; > > > > + socklen_t optlen; > > > > + void *map; >