netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Magnus Karlsson <magnus.karlsson@gmail.com>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: "Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	ast@kernel.org, "Network Development" <netdev@vger.kernel.org>,
	"Jakub Kicinski" <jakub.kicinski@netronome.com>,
	"Björn Töpel" <bjorn.topel@gmail.com>,
	"Zhang, Qi Z" <qi.z.zhang@intel.com>,
	"Jesper Dangaard Brouer" <brouer@redhat.com>,
	xiaolong.ye@intel.com
Subject: Re: [PATCH bpf-next v4 1/2] libbpf: add support for using AF_XDP sockets
Date: Mon, 18 Feb 2019 09:59:30 +0100	[thread overview]
Message-ID: <CAJ8uoz1xjv8OBAn==S3EBiGvTqp_dd_t7YQqp2TOSR6JSOdrnA@mail.gmail.com> (raw)
In-Reply-To: <f2f4134d-a01f-8d14-26c3-2ca82405a960@iogearbox.net>

On Fri, Feb 15, 2019 at 6:09 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 02/08/2019 02:05 PM, Magnus Karlsson wrote:
> > This commit adds AF_XDP support to libbpf. The main reason for this is
> > to facilitate writing applications that use AF_XDP by offering
> > higher-level APIs that hide many of the details of the AF_XDP
> > uapi. This is in the same vein as libbpf facilitates XDP adoption by
> > offering easy-to-use higher level interfaces of XDP
> > functionality. Hopefully this will facilitate adoption of AF_XDP, make
> > applications using it simpler and smaller, and finally also make it
> > possible for applications to benefit from optimizations in the AF_XDP
> > user space access code. Previously, people just copied and pasted the
> > code from the sample application into their application, which is not
> > desirable.
> >
> > The interface is composed of two parts:
> >
> > * Low-level access interface to the four rings and the packet
> > * High-level control plane interface for creating and setting
> >   up umems and af_xdp sockets as well as a simple XDP program.
> >
> > Tested-by: Björn Töpel <bjorn.topel@intel.com>
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> [...]
> > diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> > new file mode 100644
> > index 0000000..a982a76
> > --- /dev/null
> > +++ b/tools/lib/bpf/xsk.c
> > @@ -0,0 +1,742 @@
> > +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
> > +
> > +/*
> > + * AF_XDP user-space access library.
> > + *
> > + * Copyright(c) 2018 - 2019 Intel Corporation.
> > + *
> > + * Author(s): Magnus Karlsson <magnus.karlsson@intel.com>
> > + */
> > +
> > +#include <errno.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <unistd.h>
> > +#include <arpa/inet.h>
> > +#include <asm/barrier.h>
> > +#include <linux/compiler.h>
> > +#include <linux/filter.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_link.h>
> > +#include <linux/if_packet.h>
> > +#include <linux/if_xdp.h>
> > +#include <linux/rtnetlink.h>
> > +#include <net/if.h>
> > +#include <sys/mman.h>
> > +#include <sys/socket.h>
> > +#include <sys/types.h>
> > +
> > +#include "bpf.h"
> > +#include "libbpf.h"
> > +#include "libbpf_util.h"
> > +#include "nlattr.h"
> > +#include "xsk.h"
> > +
> > +#ifndef SOL_XDP
> > + #define SOL_XDP 283
> > +#endif
> > +
> > +#ifndef AF_XDP
> > + #define AF_XDP 44
> > +#endif
> > +
> > +#ifndef PF_XDP
> > + #define PF_XDP AF_XDP
> > +#endif
> > +
> > +struct xsk_umem {
> > +     struct xsk_ring_prod *fill;
> > +     struct xsk_ring_cons *comp;
> > +     char *umem_area;
> > +     struct xsk_umem_config config;
> > +     int fd;
> > +     int refcount;
> > +};
> > +
> > +struct xsk_socket {
> > +     struct xsk_ring_cons *rx;
> > +     struct xsk_ring_prod *tx;
> > +     __u64 outstanding_tx;
> > +     struct xsk_umem *umem;
> > +     struct xsk_socket_config config;
> > +     int fd;
> > +     int xsks_map;
> > +     int ifindex;
> > +     int prog_fd;
> > +     int qidconf_map_fd;
> > +     int xsks_map_fd;
> > +     __u32 queue_id;
> > +};
> > +
> > +struct xsk_nl_info {
> > +     bool xdp_prog_attached;
> > +     int ifindex;
> > +     int fd;
> > +};
> > +
> > +#define MAX_QUEUES 128
>
> Why is this a fixed constant here, shouldn't this be dynamic due to being NIC
> specific anyway?

It was only here for simplicity. If a NIC had more queues, it would
require a recompile of the lib. Obviously, not desirable in a distro.
What I could do is to read the max "combined" queues (pre-set maximum
in the ethtool output) from the same interface as ethool uses and size
the array after that. Or is there a simpler way? What to do if the NIC
does not have a "combined", or is there no such NIC (seems the common
HW ones set this)?

> [...]
> > +void *xsk_umem__get_data(struct xsk_umem *umem, __u64 addr)
> > +{
> > +     return &((char *)(umem->umem_area))[addr];
> > +}
>
> There's also a xsk_umem__get_data_raw() doing the same. Why having both, resp.
> when to choose which? ;)

There is enough to have the xsk_umem__get_data_raw() function.
xsk_umem__get_data() is just a convenience function for which the
application does not have to store the beginning of the umem. But as
the application always has to provide this anyway in the
xsk_umem__create() function, it might as well store this pointer. I
will delete xsk_umem__get_data() and rename xsk_umem__get_data_raw()
to xsk_umem__get_data().

> > +int xsk_umem__fd(const struct xsk_umem *umem)
> > +{
> > +     return umem ? umem->fd : -EINVAL;
> > +}
> > +
> > +int xsk_socket__fd(const struct xsk_socket *xsk)
> > +{
> > +     return xsk ? xsk->fd : -EINVAL;
> > +}
> > +
> > +static bool xsk_page_aligned(void *buffer)
> > +{
> > +     unsigned long addr = (unsigned long)buffer;
> > +
> > +     return !(addr & (getpagesize() - 1));
> > +}
> > +
> > +static void xsk_set_umem_config(struct xsk_umem_config *cfg,
> > +                             const struct xsk_umem_config *usr_cfg)
> > +{
> > +     if (!usr_cfg) {
> > +             cfg->fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
> > +             cfg->comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
> > +             cfg->frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
> > +             cfg->frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM;
> > +             return;
> > +     }
> > +
> > +     cfg->fill_size = usr_cfg->fill_size;
> > +     cfg->comp_size = usr_cfg->comp_size;
> > +     cfg->frame_size = usr_cfg->frame_size;
> > +     cfg->frame_headroom = usr_cfg->frame_headroom;
>
> Just optional nit, might be a bit nicer to have it in this form:
>
>         cfg->fill_size = usr_cfg ? usr_cfg->fill_size :
>                          XSK_RING_PROD__DEFAULT_NUM_DESCS;

I actually think the current form is clearer when there are multiple
lines. If there was only one line, I would agree with you.

> > +}
> > +
> > +static void xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
> > +                                   const struct xsk_socket_config *usr_cfg)
> > +{
> > +     if (!usr_cfg) {
> > +             cfg->rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
> > +             cfg->tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
> > +             cfg->libbpf_flags = 0;
> > +             cfg->xdp_flags = 0;
> > +             cfg->bind_flags = 0;
> > +             return;
> > +     }
> > +
> > +     cfg->rx_size = usr_cfg->rx_size;
> > +     cfg->tx_size = usr_cfg->tx_size;
> > +     cfg->libbpf_flags = usr_cfg->libbpf_flags;
> > +     cfg->xdp_flags = usr_cfg->xdp_flags;
> > +     cfg->bind_flags = usr_cfg->bind_flags;
>
> (Ditto)
>
> > +}
> > +
> > +int xsk_umem__create(struct xsk_umem **umem_ptr, void *umem_area, __u64 size,
> > +                  struct xsk_ring_prod *fill, struct xsk_ring_cons *comp,
> > +                  const struct xsk_umem_config *usr_config)
> > +{
> > +     struct xdp_mmap_offsets off;
> > +     struct xdp_umem_reg mr;
> > +     struct xsk_umem *umem;
> > +     socklen_t optlen;
> > +     void *map;
> > +     int err;
> > +
> > +     if (!umem_area || !umem_ptr || !fill || !comp)
> > +             return -EFAULT;
> > +     if (!size && !xsk_page_aligned(umem_area))
> > +             return -EINVAL;
> > +
> > +     umem = calloc(1, sizeof(*umem));
> > +     if (!umem)
> > +             return -ENOMEM;
> > +
> > +     umem->fd = socket(AF_XDP, SOCK_RAW, 0);
> > +     if (umem->fd < 0) {
> > +             err = -errno;
> > +             goto out_umem_alloc;
> > +     }
> > +
> > +     umem->umem_area = umem_area;
> > +     xsk_set_umem_config(&umem->config, usr_config);
> > +
> > +     mr.addr = (uintptr_t)umem_area;
> > +     mr.len = size;
> > +     mr.chunk_size = umem->config.frame_size;
> > +     mr.headroom = umem->config.frame_headroom;
> > +
> > +     err = setsockopt(umem->fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
> > +     if (err) {
> > +             err = -errno;
> > +             goto out_socket;
> > +     }
> > +     err = setsockopt(umem->fd, SOL_XDP, XDP_UMEM_FILL_RING,
> > +                      &umem->config.fill_size,
> > +                      sizeof(umem->config.fill_size));
> > +     if (err) {
> > +             err = -errno;
> > +             goto out_socket;
> > +     }
> > +     err = setsockopt(umem->fd, SOL_XDP, XDP_UMEM_COMPLETION_RING,
> > +                      &umem->config.comp_size,
> > +                      sizeof(umem->config.comp_size));
> > +     if (err) {
> > +             err = -errno;
> > +             goto out_socket;
> > +     }
> > +
> > +     optlen = sizeof(off);
> > +     err = getsockopt(umem->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen);
> > +     if (err) {
> > +             err = -errno;
> > +             goto out_socket;
> > +     }
> > +
> > +     map = xsk_mmap(NULL, off.fr.desc +
> > +                    umem->config.fill_size * sizeof(__u64),
> > +                    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
> > +                    umem->fd, XDP_UMEM_PGOFF_FILL_RING);
> > +     if (map == MAP_FAILED) {
> > +             err = -errno;
> > +             goto out_socket;
> > +     }
> > +
> > +     umem->fill = fill;
> > +     fill->mask = umem->config.fill_size - 1;
> > +     fill->size = umem->config.fill_size;
> > +     fill->producer = map + off.fr.producer;
> > +     fill->consumer = map + off.fr.consumer;
> > +     fill->ring = map + off.fr.desc;
> > +     fill->cached_cons = umem->config.fill_size;
> > +
> > +     map = xsk_mmap(NULL,
> > +                    off.cr.desc + umem->config.comp_size * sizeof(__u64),
> > +                    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
> > +                    umem->fd, XDP_UMEM_PGOFF_COMPLETION_RING);
> > +     if (map == MAP_FAILED) {
> > +             err = -errno;
> > +             goto out_mmap;
> > +     }
> > +
> > +     umem->comp = comp;
> > +     comp->mask = umem->config.comp_size - 1;
> > +     comp->size = umem->config.comp_size;
> > +     comp->producer = map + off.cr.producer;
> > +     comp->consumer = map + off.cr.consumer;
> > +     comp->ring = map + off.cr.desc;
> > +
> > +     *umem_ptr = umem;
> > +     return 0;
> > +
> > +out_mmap:
> > +     munmap(umem->fill,
> > +            off.fr.desc + umem->config.fill_size * sizeof(__u64));
> > +out_socket:
> > +     close(umem->fd);
> > +out_umem_alloc:
> > +     free(umem);
> > +     return err;
> > +}
> > +
> > +static int xsk_parse_nl(void *cookie, void *msg, struct nlattr **tb)
> > +{
> > +     struct nlattr *tb_parsed[IFLA_XDP_MAX + 1];
> > +     struct xsk_nl_info *nl_info = cookie;
> > +     struct ifinfomsg *ifinfo = msg;
> > +     unsigned char mode;
> > +     int err;
> > +
> > +     if (nl_info->ifindex && nl_info->ifindex != ifinfo->ifi_index)
> > +             return 0;
> > +
> > +     if (!tb[IFLA_XDP])
> > +             return 0;
> > +
> > +     err = libbpf_nla_parse_nested(tb_parsed, IFLA_XDP_MAX, tb[IFLA_XDP],
> > +                                   NULL);
> > +     if (err)
> > +             return err;
> > +
> > +     if (!tb_parsed[IFLA_XDP_ATTACHED] || !tb_parsed[IFLA_XDP_FD])
> > +             return 0;
> > +
> > +     mode = libbpf_nla_getattr_u8(tb_parsed[IFLA_XDP_ATTACHED]);
> > +     if (mode == XDP_ATTACHED_NONE)
> > +             return 0;
> > +
> > +     nl_info->xdp_prog_attached = true;
> > +     nl_info->fd = libbpf_nla_getattr_u32(tb_parsed[IFLA_XDP_FD]);
>
> Hm, I don't think this works if I read the intention of this helper correctly.
>
> IFLA_XDP_FD is never set for retrieving the prog from the kernel. So the
> above is a bug.
>
> We also have bpf_get_link_xdp_id(). This should probably just be reused in
> this context here.

If bpf_get_link_xdp_id() will fit my bill, I will happily use it. I
will check it out and hopefully I can drop all this code. Thanks.

> > +     return 0;
> > +}
> > +
> > +static bool xsk_xdp_prog_attached(struct xsk_socket *xsk)
> > +{
> > +     struct xsk_nl_info nl_info;
> > +     unsigned int nl_pid;
> > +     char err_buf[256];
> > +     int sock, err;
> > +
> > +     sock = libbpf_netlink_open(&nl_pid);
> > +     if (sock < 0)
> > +             return false;
> > +
> > +     nl_info.xdp_prog_attached = false;
> > +     nl_info.ifindex = xsk->ifindex;
> > +     nl_info.fd = -1;
> > +
> > +     err = libbpf_nl_get_link(sock, nl_pid, xsk_parse_nl, &nl_info);
> > +     if (err) {
> > +             libbpf_strerror(err, err_buf, sizeof(err_buf));
> > +             pr_warning("Error:\n%s\n", err_buf);
> > +             close(sock);
> > +             return false;
> > +     }
> > +
> > +     close(sock);
> > +     xsk->prog_fd = nl_info.fd;
> > +     return nl_info.xdp_prog_attached;
> > +}
>
> (See bpf_get_link_xdp_id().)
>
> > +
> > +static int xsk_load_xdp_prog(struct xsk_socket *xsk)
> > +{
> > +     char bpf_log_buf[BPF_LOG_BUF_SIZE];
> > +     int err, prog_fd;
> > +
> > +     /* This is the C-program:
> > +      * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
> > +      * {
> > +      *     int *qidconf, index = ctx->rx_queue_index;
> [...]
> > +
> > +int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname,
> > +                    __u32 queue_id, struct xsk_umem *umem,
> > +                    struct xsk_ring_cons *rx, struct xsk_ring_prod *tx,
> > +                    const struct xsk_socket_config *usr_config)
> > +{
> > +     struct sockaddr_xdp sxdp = {};
> > +     struct xdp_mmap_offsets off;
> > +     struct xsk_socket *xsk;
> > +     socklen_t optlen;
> > +     void *map;
> > +     int err;
> > +
> > +     if (!umem || !xsk_ptr || !rx || !tx)
> > +             return -EFAULT;
> > +
> > +     if (umem->refcount) {
> > +             pr_warning("Error: shared umems not supported by libbpf.\n");
> > +             return -EBUSY;
> > +     }
> > +
> > +     xsk = calloc(1, sizeof(*xsk));
> > +     if (!xsk)
> > +             return -ENOMEM;
> > +
> > +     if (umem->refcount++ > 0) {
>
> Should this refcount rather be atomic actually?

Neither our config nor data plane interfaces are reentrant for
performance reasons. Any concurrency has to be handled explicitly on
the application level. This so that it only penalizes apps that really
need this.

Thanks for all your reviews: Magnus

> > +             xsk->fd = socket(AF_XDP, SOCK_RAW, 0);
> > +             if (xsk->fd < 0) {
> > +                     err = -errno;
> > +                     goto out_xsk_alloc;
> > +             }
> > +     } else {
> > +             xsk->fd = umem->fd;
> > +     }
> > +
> > +     xsk->outstanding_tx = 0;
> > +     xsk->queue_id = queue_id;
> > +     xsk->umem = umem;
> > +     xsk->ifindex = if_nametoindex(ifname);
> > +     if (!xsk->ifindex) {
> > +             err = -errno;
> > +             goto out_socket;
> > +     }
> > +
> > +     xsk_set_xdp_socket_config(&xsk->config, usr_config);
> [...]

  reply	other threads:[~2019-02-18  8:59 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-08 13:05 [PATCH bpf-next v4 0/2] libbpf: adding AF_XDP support Magnus Karlsson
2019-02-08 13:05 ` [PATCH bpf-next v4 1/2] libbpf: add support for using AF_XDP sockets Magnus Karlsson
2019-02-15 16:37   ` Daniel Borkmann
2019-02-18  8:59     ` Magnus Karlsson [this message]
2019-02-18 11:21       ` Maciej Fijalkowski
2019-02-08 13:05 ` [PATCH bpf-next v4 2/2] samples/bpf: convert xdpsock to use libbpf for AF_XDP access Magnus Karlsson
2019-02-11  6:33 ` [PATCH bpf-next v4 0/2] libbpf: adding AF_XDP support Jean-Mickael Guerin
2019-02-11  7:52   ` Magnus Karlsson
2019-02-11 19:48 ` Jonathan Lemon
2019-02-13 11:32   ` Magnus Karlsson
2019-02-13 11:55     ` Jesper Dangaard Brouer
2019-02-15 16:20       ` Daniel Borkmann
2019-02-18  8:20         ` Magnus Karlsson
2019-02-18  9:38           ` Daniel Borkmann
2019-02-18 10:09             ` Magnus Karlsson
2019-02-13 20:49     ` Jonathan Lemon
2019-02-14  8:25       ` Magnus Karlsson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJ8uoz1xjv8OBAn==S3EBiGvTqp_dd_t7YQqp2TOSR6JSOdrnA@mail.gmail.com' \
    --to=magnus.karlsson@gmail.com \
    --cc=ast@kernel.org \
    --cc=bjorn.topel@gmail.com \
    --cc=bjorn.topel@intel.com \
    --cc=brouer@redhat.com \
    --cc=daniel@iogearbox.net \
    --cc=jakub.kicinski@netronome.com \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=qi.z.zhang@intel.com \
    --cc=xiaolong.ye@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).