From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3DE98C433EF for ; Tue, 28 Jun 2022 17:54:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229721AbiF1Rx7 (ORCPT ); Tue, 28 Jun 2022 13:53:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42180 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230292AbiF1Rx6 (ORCPT ); Tue, 28 Jun 2022 13:53:58 -0400 Received: from mail-pg1-x52e.google.com (mail-pg1-x52e.google.com [IPv6:2607:f8b0:4864:20::52e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6FBEE2E3 for ; Tue, 28 Jun 2022 10:53:57 -0700 (PDT) Received: by mail-pg1-x52e.google.com with SMTP id 23so12887182pgc.8 for ; Tue, 28 Jun 2022 10:53:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=BawOF30jreBw6h+htpf+3bv16wUJ+LKNYVml1Jeqpw0=; b=teopuH3YajUvyRSjpzuJ0w6HC+e25mYt3gIiW4Oz6o7f9migqOsNznrGpM3E/faGW9 hA954PfTeCDeExUSQWdZbedOjy/LkDa6k48rXl/Z88jdBJnhzfj68ZK6ICc7PS/Y87jz kmdCF/NlivvaIpNsM4sDUX33e1eXEvwbceF7/mzweKS/8MGvEKysJnhG/GpjJmdOPEEv cy9pKwYKYORdfxehAboJ82yTbcP+6SehWlX+yUD9IesX+mn06mLAv52SqmpK9oitYaCM F1awtQ3XNVOH0dugB3T6vC/QZT/Sg3rvLljQsQQFuaZg5ElBpdV9jGfzHu9XDgMXp8mP XEfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=BawOF30jreBw6h+htpf+3bv16wUJ+LKNYVml1Jeqpw0=; b=GQraST1qCJzUOmBAEorqjPDmhDVBXA5i2MHxh0Ocp2xBt6jmU3IiB3u1bKGcihn7p2 csiz8ppHaO9/Ytj711Ve3QEtHVpc99MqfiLPg3rVJI8OAAq/BHaqlqX+qxa/8BTY64SI XNUU4Zg3mvebL+APVvcNDwI39+A8Y3SlcBN/u4kLZYhXb+L0+D1I7sruJcggI2hWknEA qrN8v2u2UFzRFPZTlJKFSL/lo5RzfTt5neAmL2SrGS36arkM6jz25+0mm7WvnDfnPMhK WmdpMfONVS15EJRF4Kt/tqnWKZZUer37kYHKDmDmQCDLy7JF8h3VPYxaL86a/tcKzGYP y36Q== X-Gm-Message-State: AJIora8wNSAnOkUH+GRgYDsBNtX8ytJiUdlTWgnfYSi7j3g32Cr+mMwy zW5+GqpsRDoH98GdvodG2czDXJXXEkCm/O4QVe33rQ== X-Google-Smtp-Source: AGRyM1sC3ogLLDXNLb1gHLxxs5KL+4rKb/G8HX+uluA6x+uMkIcdcbQdiPIqw344QtLriRX9dDPn+fUdexdtPtQfjLA= X-Received: by 2002:a62:1582:0:b0:525:6361:85cd with SMTP id 124-20020a621582000000b00525636185cdmr4665342pfv.72.1656438836687; Tue, 28 Jun 2022 10:53:56 -0700 (PDT) MIME-Version: 1.0 References: <20220628164529.80050-1-quentin@isovalent.com> In-Reply-To: <20220628164529.80050-1-quentin@isovalent.com> From: Stanislav Fomichev Date: Tue, 28 Jun 2022 10:53:45 -0700 Message-ID: Subject: Re: [PATCH bpf-next] bpftool: Probe for memcg-based accounting before bumping rlimit To: Quentin Monnet Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Yafang Shao , netdev@vger.kernel.org, bpf@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Tue, Jun 28, 2022 at 9:45 AM Quentin Monnet wrote: > > Bpftool used to bump the memlock rlimit to make sure to be able to load > BPF objects. After the kernel has switched to memcg-based memory > accounting [0] in 5.11, bpftool has relied on libbpf to probe the system > for memcg-based accounting support and for raising the rlimit if > necessary [1]. But this was later reverted, because the probe would > sometimes fail, resulting in bpftool not being able to load all required > objects [2]. > > Here we add a more efficient probe, in bpftool itself. We first lower > the rlimit to 0, then we attempt to load a BPF object (and finally reset > the rlimit): if the load succeeds, then memcg-based memory accounting is > supported. > > This approach was earlier proposed for the probe in libbpf itself [3], > but given that the library may be used in multithreaded applications, > the probe could have undesirable consequences if one thread attempts to > lock kernel memory while memlock rlimit is at 0. Since bpftool is > single-threaded and the rlimit is process-based, this is fine to do in > bpftool itself. > > This probe was inspired by the similar one from the cilium/ebpf Go > library [4]. > > [0] commit 97306be45fbe ("Merge branch 'switch to memcg-based memory accounting'") > [1] commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK") > [2] commit 6b4384ff1088 ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"") > [3] https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/t/#u > [4] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > > Cc: Stanislav Fomichev > Cc: Yafang Shao > Suggested-by: Daniel Borkmann > Signed-off-by: Quentin Monnet > --- > tools/bpf/bpftool/common.c | 71 ++++++++++++++++++++++++++++++++++-- > tools/include/linux/kernel.h | 5 +++ > 2 files changed, 73 insertions(+), 3 deletions(-) > > diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > index a0d4acd7c54a..e07769802f76 100644 > --- a/tools/bpf/bpftool/common.c > +++ b/tools/bpf/bpftool/common.c > @@ -13,14 +13,17 @@ > #include > #include > #include > -#include > -#include > #include > #include > #include > #include > #include > > +#include > +#include > +#include > +#include > + > #include > #include > #include /* libbpf_num_possible_cpus */ > @@ -73,11 +76,73 @@ static bool is_bpffs(char *path) > return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > } > > +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to > + * memcg-based memory accounting for BPF maps and programs. This was done in > + * commit 97306be45fbe ("Merge branch 'switch to memcg-based memory > + * accounting'"), in Linux 5.11. > + * > + * Libbpf also offers to probe for memcg-based accounting vs rlimit, but does > + * so by checking for the availability of a given BPF helper and this has > + * failed on some kernels with backports in the past, see commit 6b4384ff1088 > + * ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK""). > + * Instead, we can probe by lowering the process-based rlimit to 0, trying to > + * load a BPF object, and resetting the rlimit. If the load succeeds then > + * memcg-based accounting is supported. > + * > + * This would be too dangerous to do in the library, because multithreaded > + * applications might attempt to load items while the rlimit is at 0. Given > + * that bpftool is single-threaded, this is fine to do here. > + */ > +static bool known_to_need_rlimit(void) > +{ > + const size_t prog_load_attr_sz = offsetofend(union bpf_attr, attach_btf_obj_fd); nit: Any specific reason you're hard coding this sz via offseofend? Why not use sizeof(bpf_attr) directly as a syscall/memset size? The kernel should handle all these cases where bpftool has extra zero padding, right? > + struct bpf_insn insns[] = { > + BPF_EXIT_INSN(), > + }; > + struct rlimit rlim_init, rlim_cur_zero = {}; > + size_t insn_cnt = ARRAY_SIZE(insns); > + union bpf_attr attr; > + int prog_fd, err; > + > + memset(&attr, 0, prog_load_attr_sz); > + attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER; > + attr.insns = ptr_to_u64(insns); > + attr.insn_cnt = insn_cnt; > + attr.license = ptr_to_u64("GPL"); > + > + if (getrlimit(RLIMIT_MEMLOCK, &rlim_init)) > + return false; > + > + /* Drop the soft limit to zero. We maintain the hard limit to its > + * current value, because lowering it would be a permanent operation > + * for unprivileged users. > + */ > + rlim_cur_zero.rlim_max = rlim_init.rlim_max; > + if (setrlimit(RLIMIT_MEMLOCK, &rlim_cur_zero)) > + return false; > + > + /* Do not use bpf_prog_load() from libbpf here, because it calls > + * bump_rlimit_memlock(), interfering with the current probe. > + */ > + prog_fd = syscall(__NR_bpf, BPF_PROG_LOAD, &attr, prog_load_attr_sz); > + err = errno; > + > + /* reset soft rlimit to its initial value */ > + setrlimit(RLIMIT_MEMLOCK, &rlim_init); > + > + if (prog_fd < 0) > + return err == EPERM; > + > + close(prog_fd); > + return false; > +} > + > void set_max_rlimit(void) > { > struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > > - setrlimit(RLIMIT_MEMLOCK, &rinf); > + if (known_to_need_rlimit()) > + setrlimit(RLIMIT_MEMLOCK, &rinf); > } > > static int > diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h > index 4b0673bf52c2..5c90d65cc2d3 100644 > --- a/tools/include/linux/kernel.h > +++ b/tools/include/linux/kernel.h > @@ -24,6 +24,11 @@ > #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER) > #endif > > +#ifndef offsetofend > +# define offsetofend(TYPE, FIELD) \ > + (offsetof(TYPE, FIELD) + sizeof(((TYPE *)0)->FIELD)) > +#endif > + > #ifndef container_of > /** > * container_of - cast a member of a structure out to the containing structure > -- > 2.34.1 >