All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Alexei Starovoitov <ast@fb.com>
Cc: "Daniel Borkmann" <daniel@iogearbox.net>,
	"Andrii Nakryiko" <andrii@kernel.org>, bpf <bpf@vger.kernel.org>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Kernel Team" <kernel-team@fb.com>,
	"Toke Høiland-Jørgensen" <toke@redhat.com>
Subject: Re: [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF
Date: Tue, 14 Dec 2021 10:31:38 -0800	[thread overview]
Message-ID: <CAEf4BzbmBNRB0sWAxHpSaW6fYMbgrCDm9K=8XScYGa2PEpdsPA@mail.gmail.com> (raw)
In-Reply-To: <9656836e-f9ea-f1ff-80c2-f4aba51f0d8d@fb.com>

On Tue, Dec 14, 2021 at 9:58 AM Alexei Starovoitov <ast@fb.com> wrote:
>
> On 12/14/21 9:51 AM, Andrii Nakryiko wrote:
> > On Tue, Dec 14, 2021 at 7:09 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>
> >> On 12/14/21 1:48 AM, Andrii Nakryiko wrote:
> >>> The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is
> >>> one of the first extremely frustrating gotchas that all new BPF users go
> >>> through and in some cases have to learn it a very hard way.
> >>>
> >>> Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem
> >>> dropped the dependency on memlock and uses memcg-based memory accounting
> >>> instead. Unfortunately, detecting memcg-based BPF memory accounting is
> >>> far from trivial (as can be evidenced by this patch), so in practice
> >>> most BPF applications still do unconditional RLIMIT_MEMLOCK increase.
> >>>
> >>> As we move towards libbpf 1.0, it would be good to allow users to forget
> >>> about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment
> >>> automatically. This patch paves the way forward in this matter. Libbpf
> >>> will do feature detection of memcg-based accounting, and if detected,
> >>> will do nothing. But if the kernel is too old, just like BCC, libbpf
> >>> will automatically increase RLIMIT_MEMLOCK on behalf of user
> >>> application ([0]).
> >>>
> >>> As this is technically a breaking change, during the transition period
> >>> applications have to opt into libbpf 1.0 mode by setting
> >>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling
> >>> libbpf_set_strict_mode().
> >>>
> >>> Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit
> >>> with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do
> >>> nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be
> >>> called before the first bpf_prog_load(), bpf_btf_load(), or
> >>> bpf_object__load() call, otherwise it has no effect and will return
> >>> -EBUSY.
> >>>
> >>>     [0] Closes: https://github.com/libbpf/libbpf/issues/369
> >>>
> >>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> >> [...]
> >>>
> >>> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
> >>> + * memcg-based memory accounting for BPF maps and progs. This was done in [0].
> >>> + * We use the difference in reporting memlock value in BPF map's fdinfo before
> >>> + * and after [0] to detect whether memcg accounting is done for BPF subsystem
> >>> + * or not.
> >>> + *
> >>> + * Before the change, memlock value for ARRAY map would be calculated as:
> >>> + *
> >>> + *   memlock = sizeof(struct bpf_array) + round_up(value_size, 8) * max_entries;
> >>> + *   memlock = round_up(memlock, PAGE_SIZE);
> >>> + *
> >>> + *
> >>> + * After, memlock is approximated as:
> >>> + *
> >>> + *   memlock = round_up(key_size + value_size, 8) * max_entries;
> >>> + *   memlock = round_up(memlock, PAGE_SIZE);
> >>> + *
> >>> + * In this check we use the fact that sizeof(struct bpf_array) is about 300
> >>> + * bytes, so if we use value_size = (PAGE_SIZE - 100), before memcg
> >>> + * approximation memlock would be rounded up to 2 * PAGE_SIZE, while with
> >>> + * memcg approximation it will stay at single PAGE_SIZE (key_size is 4 for
> >>> + * array and doesn't make much difference given 100 byte decrement we use for
> >>> + * value_size).
> >>> + *
> >>> + *   [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
> >>> + */
> >>> +int probe_memcg_account(void)
> >>> +{
> >>> +     const size_t map_create_attr_sz = offsetofend(union bpf_attr, map_extra);
> >>> +     long page_sz = sysconf(_SC_PAGESIZE), memlock_sz;
> >>> +     char buf[128];
> >>> +     union bpf_attr attr;
> >>> +     int map_fd;
> >>> +     FILE *f;
> >>> +
> >>> +     memset(&attr, 0, map_create_attr_sz);
> >>> +     attr.map_type = BPF_MAP_TYPE_ARRAY;
> >>> +     attr.key_size = 4;
> >>> +     attr.value_size = page_sz - 100;
> >>> +     attr.max_entries = 1;
> >>> +     map_fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, map_create_attr_sz);
> >>> +     if (map_fd < 0)
> >>> +             return -errno;
> >>> +
> >>> +     sprintf(buf, "/proc/self/fdinfo/%d", map_fd);
> >>> +     f = fopen(buf, "r");
> >>> +     while (f && !feof(f) && fgets(buf, sizeof(buf), f)) {
> >>> +             if (fscanf(f, "memlock: %ld\n", &memlock_sz) == 1) {
> >>> +                     fclose(f);
> >>> +                     close(map_fd);
> >>> +                     return memlock_sz == page_sz ? 1 : 0;
> >>> +             }
> >>> +     }
> >>> +
> >>> +     /* proc FS is disabled or we failed to parse fdinfo properly, assume
> >>> +      * we need setrlimit
> >>> +      */
> >>> +     if (f)
> >>> +             fclose(f);
> >>> +     close(map_fd);
> >>> +     return 0;
> >>> +}
> >>
> >> One other option which might be slightly more robust perhaps could be to probe
> >> for a BPF helper that has been added along with 5.11 kernel. As Toke noted earlier
> >> it might not work with ooo backports, but if its good with RHEL in this specific
> >> case, we should be covered for 99% of cases. Potentially, we could then still try
> >> to fallback to the above probing logic?
> >
> > Ok, I was originally thinking of probe bpf_sock_from_file() (which was
> > added after memcg change), but it's PITA. But I see that slightly
> > before that (but in the same 5.11 release) bpf_ktime_get_coarse_ns()
>
> Note that it had fixes after that, so in the kernel version where

You mean 5e0bc3082e2e ("bpf: Forbid bpf_ktime_get_coarse_ns and
bpf_timer_* in tracing progs"), right? This shouldn't matter if I use
BPF_PROG_TYPE_SOCKET_FILTER for probing.

fdinfo parsing approach has unnecessary dependency on PROCFS and is
more code (and very detailed knowledge of approximation and memlock
calculation formula). I like ktime_get_coarse_ns approach due to
minimal amount of code and no reliance on any other kernel config
besides CONFIG_BPF_SYSCALL.

But in the end I care about the overall feature, not a particular
implementation of the detection. Should I send
ktime_get_coarse_ns-based approach or we go with this one? I've
implemented and tested all three variants already, so no time savings
are expected either way.

> it appeared it may be detected slightly differently than in
> the newer kernels (depending on how far fixes were backported).
> imo I would stick with this array+fdinfo approach.

  reply	other threads:[~2021-12-14 18:31 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-14  0:48 [PATCH v3 bpf-next 0/2] libbpf: auto-bump RLIMIT_MEMLOCK on old kernels Andrii Nakryiko
2021-12-14  0:48 ` [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF Andrii Nakryiko
2021-12-14 15:09   ` Daniel Borkmann
2021-12-14 17:51     ` Andrii Nakryiko
2021-12-14 17:58       ` Alexei Starovoitov
2021-12-14 18:31         ` Andrii Nakryiko [this message]
2021-12-14 20:51           ` Alexei Starovoitov
2021-12-14 21:09             ` Daniel Borkmann
2021-12-14  0:48 ` [PATCH v3 bpf-next 2/2] selftests/bpf: remove explicit setrlimit(RLIMIT_MEMLOCK) in main selftests Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEf4BzbmBNRB0sWAxHpSaW6fYMbgrCDm9K=8XScYGa2PEpdsPA@mail.gmail.com' \
    --to=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@fb.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=toke@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.