From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F880C433F5 for ; Tue, 14 Dec 2021 21:09:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230481AbhLNVJX (ORCPT ); Tue, 14 Dec 2021 16:09:23 -0500 Received: from www62.your-server.de ([213.133.104.62]:57024 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230480AbhLNVJX (ORCPT ); Tue, 14 Dec 2021 16:09:23 -0500 Received: from sslproxy02.your-server.de ([78.47.166.47]) by www62.your-server.de with esmtpsa (TLSv1.3:TLS_AES_256_GCM_SHA384:256) (Exim 4.92.3) (envelope-from ) id 1mxF31-000Dxo-6W; Tue, 14 Dec 2021 22:09:19 +0100 Received: from [85.1.206.226] (helo=linux.home) by sslproxy02.your-server.de with esmtpsa (TLSv1.3:TLS_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1mxF30-0003rN-Vk; Tue, 14 Dec 2021 22:09:19 +0100 Subject: Re: [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF To: Alexei Starovoitov , Andrii Nakryiko Cc: Andrii Nakryiko , bpf , Alexei Starovoitov , Kernel Team , =?UTF-8?Q?Toke_H=c3=b8iland-J=c3=b8rgensen?= References: <20211214004856.3785613-1-andrii@kernel.org> <20211214004856.3785613-2-andrii@kernel.org> <177504f5-c89a-a05e-8542-9c326d9a10c1@iogearbox.net> <9656836e-f9ea-f1ff-80c2-f4aba51f0d8d@fb.com> <35b54575-ede0-e9b3-ad3c-4b18ffe0089e@fb.com> From: Daniel Borkmann Message-ID: <59860cf4-2113-9e8a-3edb-1b7a19dff2c6@iogearbox.net> Date: Tue, 14 Dec 2021 22:09:18 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: <35b54575-ede0-e9b3-ad3c-4b18ffe0089e@fb.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.103.3/26387/Tue Dec 14 10:33:30 2021) Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On 12/14/21 9:51 PM, Alexei Starovoitov wrote: > On 12/14/21 10:31 AM, Andrii Nakryiko wrote: >> On Tue, Dec 14, 2021 at 9:58 AM Alexei Starovoitov wrote: >>> On 12/14/21 9:51 AM, Andrii Nakryiko wrote: >>>> On Tue, Dec 14, 2021 at 7:09 AM Daniel Borkmann wrote: >>>>> On 12/14/21 1:48 AM, Andrii Nakryiko wrote: >>>>>> The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is >>>>>> one of the first extremely frustrating gotchas that all new BPF users go >>>>>> through and in some cases have to learn it a very hard way. >>>>>> >>>>>> Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem >>>>>> dropped the dependency on memlock and uses memcg-based memory accounting >>>>>> instead. Unfortunately, detecting memcg-based BPF memory accounting is >>>>>> far from trivial (as can be evidenced by this patch), so in practice >>>>>> most BPF applications still do unconditional RLIMIT_MEMLOCK increase. >>>>>> >>>>>> As we move towards libbpf 1.0, it would be good to allow users to forget >>>>>> about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment >>>>>> automatically. This patch paves the way forward in this matter. Libbpf >>>>>> will do feature detection of memcg-based accounting, and if detected, >>>>>> will do nothing. But if the kernel is too old, just like BCC, libbpf >>>>>> will automatically increase RLIMIT_MEMLOCK on behalf of user >>>>>> application ([0]). >>>>>> >>>>>> As this is technically a breaking change, during the transition period >>>>>> applications have to opt into libbpf 1.0 mode by setting >>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling >>>>>> libbpf_set_strict_mode(). >>>>>> >>>>>> Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit >>>>>> with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do >>>>>> nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be >>>>>> called before the first bpf_prog_load(), bpf_btf_load(), or >>>>>> bpf_object__load() call, otherwise it has no effect and will return >>>>>> -EBUSY. >>>>>> >>>>>>      [0] Closes: https://github.com/libbpf/libbpf/issues/369 >>>>>> >>>>>> Signed-off-by: Andrii Nakryiko >>>>> [...] >>>>>> >>>>>> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to >>>>>> + * memcg-based memory accounting for BPF maps and progs. This was done in [0]. >>>>>> + * We use the difference in reporting memlock value in BPF map's fdinfo before >>>>>> + * and after [0] to detect whether memcg accounting is done for BPF subsystem >>>>>> + * or not. >>>>>> + * >>>>>> + * Before the change, memlock value for ARRAY map would be calculated as: >>>>>> + * >>>>>> + *   memlock = sizeof(struct bpf_array) + round_up(value_size, 8) * max_entries; >>>>>> + *   memlock = round_up(memlock, PAGE_SIZE); >>>>>> + * >>>>>> + * >>>>>> + * After, memlock is approximated as: >>>>>> + * >>>>>> + *   memlock = round_up(key_size + value_size, 8) * max_entries; >>>>>> + *   memlock = round_up(memlock, PAGE_SIZE); >>>>>> + * >>>>>> + * In this check we use the fact that sizeof(struct bpf_array) is about 300 >>>>>> + * bytes, so if we use value_size = (PAGE_SIZE - 100), before memcg >>>>>> + * approximation memlock would be rounded up to 2 * PAGE_SIZE, while with >>>>>> + * memcg approximation it will stay at single PAGE_SIZE (key_size is 4 for >>>>>> + * array and doesn't make much difference given 100 byte decrement we use for >>>>>> + * value_size). >>>>>> + * >>>>>> + *   [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/ >>>>>> + */ >>>>>> +int probe_memcg_account(void) >>>>>> +{ >>>>>> +     const size_t map_create_attr_sz = offsetofend(union bpf_attr, map_extra); >>>>>> +     long page_sz = sysconf(_SC_PAGESIZE), memlock_sz; >>>>>> +     char buf[128]; >>>>>> +     union bpf_attr attr; >>>>>> +     int map_fd; >>>>>> +     FILE *f; >>>>>> + >>>>>> +     memset(&attr, 0, map_create_attr_sz); >>>>>> +     attr.map_type = BPF_MAP_TYPE_ARRAY; >>>>>> +     attr.key_size = 4; >>>>>> +     attr.value_size = page_sz - 100; >>>>>> +     attr.max_entries = 1; >>>>>> +     map_fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, map_create_attr_sz); >>>>>> +     if (map_fd < 0) >>>>>> +             return -errno; >>>>>> + >>>>>> +     sprintf(buf, "/proc/self/fdinfo/%d", map_fd); >>>>>> +     f = fopen(buf, "r"); >>>>>> +     while (f && !feof(f) && fgets(buf, sizeof(buf), f)) { >>>>>> +             if (fscanf(f, "memlock: %ld\n", &memlock_sz) == 1) { >>>>>> +                     fclose(f); >>>>>> +                     close(map_fd); >>>>>> +                     return memlock_sz == page_sz ? 1 : 0; >>>>>> +             } >>>>>> +     } >>>>>> + >>>>>> +     /* proc FS is disabled or we failed to parse fdinfo properly, assume >>>>>> +      * we need setrlimit >>>>>> +      */ >>>>>> +     if (f) >>>>>> +             fclose(f); >>>>>> +     close(map_fd); >>>>>> +     return 0; >>>>>> +} >>>>> >>>>> One other option which might be slightly more robust perhaps could be to probe >>>>> for a BPF helper that has been added along with 5.11 kernel. As Toke noted earlier >>>>> it might not work with ooo backports, but if its good with RHEL in this specific >>>>> case, we should be covered for 99% of cases. Potentially, we could then still try >>>>> to fallback to the above probing logic? >>>> >>>> Ok, I was originally thinking of probe bpf_sock_from_file() (which was >>>> added after memcg change), but it's PITA. But I see that slightly >>>> before that (but in the same 5.11 release) bpf_ktime_get_coarse_ns() >>> >>> Note that it had fixes after that, so in the kernel version where >> >> You mean 5e0bc3082e2e ("bpf: Forbid bpf_ktime_get_coarse_ns and >> bpf_timer_* in tracing progs"), right? This shouldn't matter if I use >> BPF_PROG_TYPE_SOCKET_FILTER for probing. > > hmm. I guess we allow it in unpriv too. > >> fdinfo parsing approach has unnecessary dependency on PROCFS and is >> more code (and very detailed knowledge of approximation and memlock >> calculation formula). I like ktime_get_coarse_ns approach due to >> minimal amount of code and no reliance on any other kernel config >> besides CONFIG_BPF_SYSCALL. >> >> But in the end I care about the overall feature, not a particular >> implementation of the detection. Should I send >> ktime_get_coarse_ns-based approach or we go with this one? I've >> implemented and tested all three variants already, so no time savings >> are expected either way. > > Either v3 or v4 are fine by me. Let Daniel pick. Same here, I can take in v4 given less dependencies. Thanks, Daniel