Re: [PATCH v2 bpf-next 2/7] libbpf: wire up USDT API and bpf_link integration

From: Dave Marchevsky <davemarchevsky@fb.com>
To: Andrii Nakryiko <andrii@kernel.org>,
	bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net
Cc: kernel-team@fb.com, Alan Maguire <alan.maguire@oracle.com>,
	Hengqi Chen <hengqi.chen@gmail.com>
Subject: Re: [PATCH v2 bpf-next 2/7] libbpf: wire up USDT API and bpf_link integration
Date: Sun, 3 Apr 2022 23:12:30 -0400	[thread overview]
Message-ID: <22359fb1-33a2-ee2c-4300-a07b175825e6@fb.com> (raw)
In-Reply-To: <20220402002944.382019-3-andrii@kernel.org>

On 4/1/22 8:29 PM, Andrii Nakryiko wrote:   
> Wire up libbpf USDT support APIs without yet implementing all the
> nitty-gritty details of USDT discovery, spec parsing, and BPF map
> initialization.
> 
> User-visible user-space API is simple and is conceptually very similar
> to uprobe API.
> 
> bpf_program__attach_usdt() API allows to programmatically attach given
> BPF program to a USDT, specified through binary path (executable or
> shared lib), USDT provider and name. Also, just like in uprobe case, PID
> filter is specified (0 - self, -1 - any process, or specific PID).
> Optionally, USDT cookie value can be specified. Such single API
> invocation will try to discover given USDT in specified binary and will
> use (potentially many) BPF uprobes to attach this program in correct
> locations.
> 
> Just like any bpf_program__attach_xxx() APIs, bpf_link is returned that
> represents this attachment. It is a virtual BPF link that doesn't have
> direct kernel object, as it can consist of multiple underlying BPF
> uprobe links. As such, attachment is not atomic operation and there can
> be brief moment when some USDT call sites are attached while others are
> still in the process of attaching. This should be taken into
> consideration by user. But bpf_program__attach_usdt() guarantees that
> in the case of success all USDT call sites are successfully attached, or
> all the successfuly attachments will be detached as soon as some USDT
> call sites failed to be attached. So, in theory, there could be cases of
> failed bpf_program__attach_usdt() call which did trigger few USDT
> program invocations. This is unavoidable due to multi-uprobe nature of
> USDT and has to be handled by user, if it's important to create an
> illusion of atomicity.

It would be useful to be able to control the behavior in response to attach
failure in bpf_program__attach_usdt. Specifically I'd like to be able to
choose between existing "all attaches succeed or entire operation fails" and
"_any_ attach succeeds or entire operation fails". Few reasons for this:

 * Tools like BOLT were not playing nicely with USDTs for some time ([0],[1])
 * BCC's logic was changed to support more granular 'attach failure' logic ([2])
 * At FB I still see some multi-probe USDTs with incorrect-looking locations on
   some of the probes

Note that my change for 2nd bullet was to handle ".so in shortlived process"
usecase, which this lib handles by properly supporting pid = -1. But it's since
come in handy to avoid 3rd bullet's issue from causing trouble.

Production tracing tools would be less brittle if they could control this attach
failure logic.

  [0]: https://github.com/facebookincubator/BOLT/commit/ea49a61463c65775aa796a9ef7a1199f20d2a698
  [1]: https://github.com/facebookincubator/BOLT/commit/93860e02a19227be4963a68aa99ea0e09771052b
  [2]: https://github.com/iovisor/bcc/pull/2476

> USDT BPF programs themselves are marked in BPF source code as either
> SEC("usdt"), in which case they won't be auto-attached through
> skeleton's <skel>__attach() method, or it can have a full definition,
> which follows the spirit of fully-specified uprobes:
> SEC("usdt/<path>:<provider>:<name>"). In the latter case skeleton's
> attach method will attempt auto-attachment. Similarly, generic
> bpf_program__attach() will have enought information to go off of for
> parameterless attachment.
> 
> USDT BPF programs are actually uprobes, and as such for kernel they are
> marked as BPF_PROG_TYPE_KPROBE.
> 
> Another part of this patch is USDT-related feature probing:
>   - BPF cookie support detection from user-space;
>   - detection of kernel support for auto-refcounting of USDT semaphore.
> 
> The latter is optional. If kernel doesn't support such feature and USDT
> doesn't rely on USDT semaphores, no error is returned. But if libbpf
> detects that USDT requires setting semaphores and kernel doesn't support
> this, libbpf errors out with explicit pr_warn() message. Libbpf doesn't
> support poking process's memory directly to increment semaphore value,
> like BCC does on legacy kernels, due to inherent raciness and danger of
> such process memory manipulation. Libbpf let's kernel take care of this
> properly or gives up.
> 
> Logistically, all the extra USDT-related infrastructure of libbpf is put
> into a separate usdt.c file and abstracted behind struct usdt_manager.
> Each bpf_object has lazily-initialized usdt_manager pointer, which is
> only instantiated if USDT programs are attempted to be attached. Closing
> BPF object frees up usdt_manager resources. usdt_manager keeps track of
> USDT spec ID assignment and few other small things.
> 
> Subsequent patches will fill out remaining missing pieces of USDT
> initialization and setup logic.
> 
> Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  tools/lib/bpf/Build             |   3 +-
>  tools/lib/bpf/libbpf.c          | 100 +++++++-
>  tools/lib/bpf/libbpf.h          |  31 +++
>  tools/lib/bpf/libbpf.map        |   1 +
>  tools/lib/bpf/libbpf_internal.h |  19 ++
>  tools/lib/bpf/usdt.c            | 426 ++++++++++++++++++++++++++++++++
>  6 files changed, 571 insertions(+), 9 deletions(-)
>  create mode 100644 tools/lib/bpf/usdt.c

[...]

> +static int attach_usdt(const struct bpf_program *prog, long cookie, struct bpf_link **link)
> +{
> +	char *path = NULL, *provider = NULL, *name = NULL;
> +	const char *sec_name;
> +	int n, err;
> +
> +	sec_name = bpf_program__section_name(prog);
> +	if (strcmp(sec_name, "usdt") == 0) {
> +		/* no auto-attach for just SEC("usdt") */
> +		*link = NULL;
> +		return 0;
> +	}
> +
> +	n = sscanf(sec_name, "usdt/%m[^:]:%m[^:]:%m[^:]", &path, &provider, &name);

TIL %m

> +	if (n != 3) {
> +		pr_warn("invalid section '%s', expected SEC(\"usdt/<path>:<provider>:<name>\")\n",
> +			sec_name);
> +		err = -EINVAL;
> +	} else {
> +		*link = bpf_program__attach_usdt(prog, -1 /* any process */, path,
> +						 provider, name, NULL);
> +		err = libbpf_get_error(*link);
> +	}
> +	free(path);
> +	free(provider);
> +	free(name);
> +	return err;
> +}

[...]

> diff --git a/tools/lib/bpf/usdt.c b/tools/lib/bpf/usdt.c
> new file mode 100644
> index 000000000000..9476f7a15769
> --- /dev/null
> +++ b/tools/lib/bpf/usdt.c
> @@ -0,0 +1,426 @@
> +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
> +#include <ctype.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <libelf.h>
> +#include <gelf.h>
> +#include <unistd.h>
> +#include <linux/ptrace.h>
> +#include <linux/kernel.h>
> +
> +#include "bpf.h"
> +#include "libbpf.h"
> +#include "libbpf_common.h"
> +#include "libbpf_internal.h"
> +#include "hashmap.h"
> +
> +/* libbpf's USDT support consists of BPF-side state/code and user-space
> + * state/code working together in concert. BPF-side parts are defined in
> + * usdt.bpf.h header library. User-space state is encapsulated by struct
> + * usdt_manager and all the supporting code centered around usdt_manager.
> + *
> + * usdt.bpf.h defines two BPF maps that usdt_manager expects: USDT spec map
> + * and IP-to-spec-ID map, which is auxiliary map necessary for kernels that
> + * don't support BPF cookie (see below). These two maps are implicitly
> + * embedded into user's end BPF object file when user's code included
> + * usdt.bpf.h. This means that libbpf doesn't do anything special to create
> + * these USDT support maps. They are created by normal libbpf logic of
> + * instantiating BPF maps when opening and loading BPF object.
> + *
> + * As such, libbpf is basically unaware of the need to do anything
> + * USDT-related until the very first call to bpf_program__attach_usdt(), which
> + * can be called by user explicitly or happen automatically during skeleton
> + * attach (or, equivalently, through generic bpf_program__attach() call). At
> + * this point, libbpf will instantiate and initialize struct usdt_manager and
> + * store it in bpf_object. USDT manager is per-BPF object construct, as each
> + * independent BPF object might or might not have USDT programs, and thus all
> + * the expected USDT-related state. There is no coordination between two
> + * bpf_object in parts of USDT attachment, they are oblivious of each other's
> + * existence and libbpf is just oblivious, dealing with bpf_object-specific
> + * USDT state.
> + *
> + * Quick crash course on USDTs.
> + *
> + * From user-space application's point of view, USDT is essentially just
> + * a slightly special function call that normally has zero overhead, unless it

Maybe better to claim 'low overhead' instead of 'zero' here and elsewhere?
Or elaborate about the overhead more explicitly. Agreed that it's so low as to
effectively be zero in most cases, but someone somewhere is going to put the nop
in a 4096-bytes-of-instr tight loop, see changed iTLB behavior, and get
frustrated.

A contrived scenario to be sure, but no other USDT documentation I can find
claims zero overhead, rather 'low', guessing for a good reason.

> + * is being traced by some external entity (e.g, BPF-based tool). Here's how
> + * a typical application can trigger USDT probe:
> + *
> + * #include <sys/sdt.h>  // provided by systemtap-sdt-devel package
> + * // folly also provide similar functionality in folly/tracing/StaticTracepoint.h
> + *
> + * STAP_PROBE3(my_usdt_provider, my_usdt_probe_name, 123, x, &y);
> + *
> + * USDT is identified by it's <provider-name>:<probe-name> pair of names. Each
> + * individual USDT has a fixed number of arguments (3 in the above example)
> + * and specifies values of each argument as if it was a function call.
> + *
> + * USDT call is actually not a function call, but is instead replaced by
> + * a single NOP instruction (thus zero overhead, effectively). But in addition

This zero overhead claim bothers me less since NOP is directly mentioned.

> + * to that, those USDT macros generate special SHT_NOTE ELF records in
> + * .note.stapsdt ELF section. Here's an example USDT definition as emitted by
> + * `readelf -n <binary>`:

[...]

> + * Arguments is the most interesting part. This USDT specification string is
> + * providing information about all the USDT arguments and their locations. The
> + * part before @ sign defined byte size of the argument (1, 2, 4, or 8) and
> + * whether the argument is singed or unsigned (negative size means signed).
> + * The part after @ sign is assembly-like definition of argument location.
> + * Technically, assembler can provide some pretty advanced definitions, but

Perhaps it would be more precise to state that argument specifiers can be 'any
operand accepted by Gnu Asm Syntax' (per [0]).

[0]: https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation

> + * libbpf is currently supporting three most common cases:
> + *   1) immediate constant, see 5th and 9th args above (-4@$5 and -4@-9);
> + *   2) register value, e.g., 8@%rdx, which means "unsigned 8-byte integer
> + *      whose value is in register %rdx";
> + *   3) memory dereference addressed by register, e.g., -4@-1204(%rbp), which
> + *      specifies signed 32-bit integer stored at offset -1204 bytes from
> + *      memory address stored in %rbp.

[...]