All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 bpf-next 0/2] libbpf: auto-bump RLIMIT_MEMLOCK on old kernels
@ 2021-12-14  0:48 Andrii Nakryiko
  2021-12-14  0:48 ` [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF Andrii Nakryiko
  2021-12-14  0:48 ` [PATCH v3 bpf-next 2/2] selftests/bpf: remove explicit setrlimit(RLIMIT_MEMLOCK) in main selftests Andrii Nakryiko
  0 siblings, 2 replies; 9+ messages in thread
From: Andrii Nakryiko @ 2021-12-14  0:48 UTC (permalink / raw)
  To: bpf, ast, daniel; +Cc: andrii, kernel-team

Make libbpf bump RLIMIT_MEMLOCK, similarly to BCC, if kernel is old enough to
use memcg-based memory accounting for BPF. Patch #2 drops explicit
setrlimi(RLIMIT_MEMLOCK) calls in test_progs, test_maps, and test_verifier.

v2->v3:
  - use difference in fdinfo's memlock reporting to detect memcg;
v1->v2:
  - fix up out-of-sync comments (Toke).

Andrii Nakryiko (2):
  libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF
  selftests/bpf: remove explicit setrlimit(RLIMIT_MEMLOCK) in main
    selftests

 tools/lib/bpf/bpf.c                           | 109 ++++++++++++++++++
 tools/lib/bpf/bpf.h                           |   2 +
 tools/lib/bpf/libbpf.c                        |  47 ++------
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/lib/bpf/libbpf_internal.h               |  39 +++++++
 tools/lib/bpf/libbpf_legacy.h                 |  12 +-
 tools/testing/selftests/bpf/bench.c           |  16 ---
 tools/testing/selftests/bpf/prog_tests/btf.c  |   1 -
 .../bpf/prog_tests/select_reuseport.c         |   1 -
 .../selftests/bpf/prog_tests/sk_lookup.c      |   1 -
 .../selftests/bpf/prog_tests/sock_fields.c    |   1 -
 tools/testing/selftests/bpf/test_maps.c       |   1 -
 tools/testing/selftests/bpf/test_progs.c      |   2 -
 tools/testing/selftests/bpf/test_verifier.c   |   4 +-
 14 files changed, 174 insertions(+), 63 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF
  2021-12-14  0:48 [PATCH v3 bpf-next 0/2] libbpf: auto-bump RLIMIT_MEMLOCK on old kernels Andrii Nakryiko
@ 2021-12-14  0:48 ` Andrii Nakryiko
  2021-12-14 15:09   ` Daniel Borkmann
  2021-12-14  0:48 ` [PATCH v3 bpf-next 2/2] selftests/bpf: remove explicit setrlimit(RLIMIT_MEMLOCK) in main selftests Andrii Nakryiko
  1 sibling, 1 reply; 9+ messages in thread
From: Andrii Nakryiko @ 2021-12-14  0:48 UTC (permalink / raw)
  To: bpf, ast, daniel; +Cc: andrii, kernel-team

The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is
one of the first extremely frustrating gotchas that all new BPF users go
through and in some cases have to learn it a very hard way.

Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem
dropped the dependency on memlock and uses memcg-based memory accounting
instead. Unfortunately, detecting memcg-based BPF memory accounting is
far from trivial (as can be evidenced by this patch), so in practice
most BPF applications still do unconditional RLIMIT_MEMLOCK increase.

As we move towards libbpf 1.0, it would be good to allow users to forget
about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment
automatically. This patch paves the way forward in this matter. Libbpf
will do feature detection of memcg-based accounting, and if detected,
will do nothing. But if the kernel is too old, just like BCC, libbpf
will automatically increase RLIMIT_MEMLOCK on behalf of user
application ([0]).

As this is technically a breaking change, during the transition period
applications have to opt into libbpf 1.0 mode by setting
LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling
libbpf_set_strict_mode().

Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit
with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do
nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be
called before the first bpf_prog_load(), bpf_btf_load(), or
bpf_object__load() call, otherwise it has no effect and will return
-EBUSY.

  [0] Closes: https://github.com/libbpf/libbpf/issues/369

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/lib/bpf/bpf.c             | 109 ++++++++++++++++++++++++++++++++
 tools/lib/bpf/bpf.h             |   2 +
 tools/lib/bpf/libbpf.c          |  47 +++-----------
 tools/lib/bpf/libbpf.map        |   1 +
 tools/lib/bpf/libbpf_internal.h |  39 ++++++++++++
 tools/lib/bpf/libbpf_legacy.h   |  12 +++-
 6 files changed, 171 insertions(+), 39 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 6b2407e12060..008460491321 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -29,6 +29,7 @@
 #include <errno.h>
 #include <linux/bpf.h>
 #include <limits.h>
+#include <sys/resource.h>
 #include "bpf.h"
 #include "libbpf.h"
 #include "libbpf_internal.h"
@@ -94,6 +95,106 @@ static inline int sys_bpf_prog_load(union bpf_attr *attr, unsigned int size, int
 	return fd;
 }
 
+/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
+ * memcg-based memory accounting for BPF maps and progs. This was done in [0].
+ * We use the difference in reporting memlock value in BPF map's fdinfo before
+ * and after [0] to detect whether memcg accounting is done for BPF subsystem
+ * or not.
+ *
+ * Before the change, memlock value for ARRAY map would be calculated as:
+ *
+ *   memlock = sizeof(struct bpf_array) + round_up(value_size, 8) * max_entries;
+ *   memlock = round_up(memlock, PAGE_SIZE);
+ *
+ *
+ * After, memlock is approximated as:
+ *
+ *   memlock = round_up(key_size + value_size, 8) * max_entries;
+ *   memlock = round_up(memlock, PAGE_SIZE);
+ *
+ * In this check we use the fact that sizeof(struct bpf_array) is about 300
+ * bytes, so if we use value_size = (PAGE_SIZE - 100), before memcg
+ * approximation memlock would be rounded up to 2 * PAGE_SIZE, while with
+ * memcg approximation it will stay at single PAGE_SIZE (key_size is 4 for
+ * array and doesn't make much difference given 100 byte decrement we use for
+ * value_size).
+ *
+ *   [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
+ */
+int probe_memcg_account(void)
+{
+	const size_t map_create_attr_sz = offsetofend(union bpf_attr, map_extra);
+	long page_sz = sysconf(_SC_PAGESIZE), memlock_sz;
+	char buf[128];
+	union bpf_attr attr;
+	int map_fd;
+	FILE *f;
+
+	memset(&attr, 0, map_create_attr_sz);
+	attr.map_type = BPF_MAP_TYPE_ARRAY;
+	attr.key_size = 4;
+	attr.value_size = page_sz - 100;
+	attr.max_entries = 1;
+	map_fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, map_create_attr_sz);
+	if (map_fd < 0)
+		return -errno;
+
+	sprintf(buf, "/proc/self/fdinfo/%d", map_fd);
+	f = fopen(buf, "r");
+	while (f && !feof(f) && fgets(buf, sizeof(buf), f)) {
+		if (fscanf(f, "memlock: %ld\n", &memlock_sz) == 1) {
+			fclose(f);
+			close(map_fd);
+			return memlock_sz == page_sz ? 1 : 0;
+		}
+	}
+
+	/* proc FS is disabled or we failed to parse fdinfo properly, assume
+	 * we need setrlimit
+	 */
+	if (f)
+		fclose(f);
+	close(map_fd);
+	return 0;
+}
+
+static bool memlock_bumped;
+static rlim_t memlock_rlim = RLIM_INFINITY;
+
+int libbpf_set_memlock_rlim(size_t memlock_bytes)
+{
+	if (memlock_bumped)
+		return libbpf_err(-EBUSY);
+
+	memlock_rlim = memlock_bytes;
+	return 0;
+}
+
+int bump_rlimit_memlock(void)
+{
+	struct rlimit rlim;
+
+	/* this the default in libbpf 1.0, but for now user has to opt-in explicitly */
+	if (!(libbpf_mode & LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK))
+		return 0;
+
+	/* if kernel supports memcg-based accounting, skip bumping RLIMIT_MEMLOCK */
+	if (memlock_bumped || kernel_supports(NULL, FEAT_MEMCG_ACCOUNT))
+		return 0;
+
+	memlock_bumped = true;
+
+	/* zero memlock_rlim_max disables auto-bumping RLIMIT_MEMLOCK */
+	if (memlock_rlim == 0)
+		return 0;
+
+	rlim.rlim_cur = rlim.rlim_max = memlock_rlim;
+	if (setrlimit(RLIMIT_MEMLOCK, &rlim))
+		return -errno;
+
+	return 0;
+}
+
 int bpf_map_create(enum bpf_map_type map_type,
 		   const char *map_name,
 		   __u32 key_size,
@@ -105,6 +206,8 @@ int bpf_map_create(enum bpf_map_type map_type,
 	union bpf_attr attr;
 	int fd;
 
+	bump_rlimit_memlock();
+
 	memset(&attr, 0, attr_sz);
 
 	if (!OPTS_VALID(opts, bpf_map_create_opts))
@@ -251,6 +354,8 @@ int bpf_prog_load_v0_6_0(enum bpf_prog_type prog_type,
 	union bpf_attr attr;
 	char *log_buf;
 
+	bump_rlimit_memlock();
+
 	if (!OPTS_VALID(opts, bpf_prog_load_opts))
 		return libbpf_err(-EINVAL);
 
@@ -456,6 +561,8 @@ int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
 	union bpf_attr attr;
 	int fd;
 
+	bump_rlimit_memlock();
+
 	memset(&attr, 0, sizeof(attr));
 	attr.prog_type = type;
 	attr.insn_cnt = (__u32)insns_cnt;
@@ -1056,6 +1163,8 @@ int bpf_btf_load(const void *btf_data, size_t btf_size, const struct bpf_btf_loa
 	__u32 log_level;
 	int fd;
 
+	bump_rlimit_memlock();
+
 	memset(&attr, 0, attr_sz);
 
 	if (!OPTS_VALID(opts, bpf_btf_load_opts))
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 94e553a0ff9d..00619f64a040 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -35,6 +35,8 @@
 extern "C" {
 #endif
 
+int libbpf_set_memlock_rlim(size_t memlock_bytes);
+
 struct bpf_map_create_opts {
 	size_t sz; /* size of this struct for forward/backward compatibility */
 
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 902f1ad5b7e6..efd7c2570118 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -187,42 +187,6 @@ const char *libbpf_version_string(void)
 #undef __S
 }
 
-enum kern_feature_id {
-	/* v4.14: kernel support for program & map names. */
-	FEAT_PROG_NAME,
-	/* v5.2: kernel support for global data sections. */
-	FEAT_GLOBAL_DATA,
-	/* BTF support */
-	FEAT_BTF,
-	/* BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO support */
-	FEAT_BTF_FUNC,
-	/* BTF_KIND_VAR and BTF_KIND_DATASEC support */
-	FEAT_BTF_DATASEC,
-	/* BTF_FUNC_GLOBAL is supported */
-	FEAT_BTF_GLOBAL_FUNC,
-	/* BPF_F_MMAPABLE is supported for arrays */
-	FEAT_ARRAY_MMAP,
-	/* kernel support for expected_attach_type in BPF_PROG_LOAD */
-	FEAT_EXP_ATTACH_TYPE,
-	/* bpf_probe_read_{kernel,user}[_str] helpers */
-	FEAT_PROBE_READ_KERN,
-	/* BPF_PROG_BIND_MAP is supported */
-	FEAT_PROG_BIND_MAP,
-	/* Kernel support for module BTFs */
-	FEAT_MODULE_BTF,
-	/* BTF_KIND_FLOAT support */
-	FEAT_BTF_FLOAT,
-	/* BPF perf link support */
-	FEAT_PERF_LINK,
-	/* BTF_KIND_DECL_TAG support */
-	FEAT_BTF_DECL_TAG,
-	/* BTF_KIND_TYPE_TAG support */
-	FEAT_BTF_TYPE_TAG,
-	__FEAT_CNT,
-};
-
-static bool kernel_supports(const struct bpf_object *obj, enum kern_feature_id feat_id);
-
 enum reloc_type {
 	RELO_LD64,
 	RELO_CALL,
@@ -4354,6 +4318,10 @@ bpf_object__probe_loading(struct bpf_object *obj)
 	if (obj->gen_loader)
 		return 0;
 
+	ret = bump_rlimit_memlock();
+	if (ret)
+		pr_warn("Failed to bump RLIMIT_MEMLOCK (err = %d), you might need to do it explicitly!\n", ret);
+
 	/* make sure basic loading works */
 	ret = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, NULL, "GPL", insns, insn_cnt, NULL);
 	if (ret < 0)
@@ -4720,14 +4688,17 @@ static struct kern_feature_desc {
 	[FEAT_BTF_TYPE_TAG] = {
 		"BTF_KIND_TYPE_TAG support", probe_kern_btf_type_tag,
 	},
+	[FEAT_MEMCG_ACCOUNT] = {
+		"memcg-based memory accounting", probe_memcg_account,
+	},
 };
 
-static bool kernel_supports(const struct bpf_object *obj, enum kern_feature_id feat_id)
+bool kernel_supports(const struct bpf_object *obj, enum kern_feature_id feat_id)
 {
 	struct kern_feature_desc *feat = &feature_probes[feat_id];
 	int ret;
 
-	if (obj->gen_loader)
+	if (obj && obj->gen_loader)
 		/* To generate loader program assume the latest kernel
 		 * to avoid doing extra prog_load, map_create syscalls.
 		 */
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 4d483af7dba6..b3938b3f8fc9 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -427,4 +427,5 @@ LIBBPF_0.7.0 {
 		bpf_program__log_level;
 		bpf_program__set_log_buf;
 		bpf_program__set_log_level;
+		libbpf_set_memlock_rlim_max;
 };
diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h
index 355c41019aed..0a6754606234 100644
--- a/tools/lib/bpf/libbpf_internal.h
+++ b/tools/lib/bpf/libbpf_internal.h
@@ -272,6 +272,45 @@ static inline bool libbpf_validate_opts(const char *opts,
 					(opts)->sz - __off);		      \
 })
 
+enum kern_feature_id {
+	/* v4.14: kernel support for program & map names. */
+	FEAT_PROG_NAME,
+	/* v5.2: kernel support for global data sections. */
+	FEAT_GLOBAL_DATA,
+	/* BTF support */
+	FEAT_BTF,
+	/* BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO support */
+	FEAT_BTF_FUNC,
+	/* BTF_KIND_VAR and BTF_KIND_DATASEC support */
+	FEAT_BTF_DATASEC,
+	/* BTF_FUNC_GLOBAL is supported */
+	FEAT_BTF_GLOBAL_FUNC,
+	/* BPF_F_MMAPABLE is supported for arrays */
+	FEAT_ARRAY_MMAP,
+	/* kernel support for expected_attach_type in BPF_PROG_LOAD */
+	FEAT_EXP_ATTACH_TYPE,
+	/* bpf_probe_read_{kernel,user}[_str] helpers */
+	FEAT_PROBE_READ_KERN,
+	/* BPF_PROG_BIND_MAP is supported */
+	FEAT_PROG_BIND_MAP,
+	/* Kernel support for module BTFs */
+	FEAT_MODULE_BTF,
+	/* BTF_KIND_FLOAT support */
+	FEAT_BTF_FLOAT,
+	/* BPF perf link support */
+	FEAT_PERF_LINK,
+	/* BTF_KIND_DECL_TAG support */
+	FEAT_BTF_DECL_TAG,
+	/* BTF_KIND_TYPE_TAG support */
+	FEAT_BTF_TYPE_TAG,
+	/* memcg-based accounting for BPF maps and progs */
+	FEAT_MEMCG_ACCOUNT,
+	__FEAT_CNT,
+};
+
+int probe_memcg_account(void);
+bool kernel_supports(const struct bpf_object *obj, enum kern_feature_id feat_id);
+int bump_rlimit_memlock(void);
 
 int parse_cpu_mask_str(const char *s, bool **mask, int *mask_sz);
 int parse_cpu_mask_file(const char *fcpu, bool **mask, int *mask_sz);
diff --git a/tools/lib/bpf/libbpf_legacy.h b/tools/lib/bpf/libbpf_legacy.h
index bb03c568af7b..79131f761a27 100644
--- a/tools/lib/bpf/libbpf_legacy.h
+++ b/tools/lib/bpf/libbpf_legacy.h
@@ -45,7 +45,6 @@ enum libbpf_strict_mode {
 	 * (positive) error code.
 	 */
 	LIBBPF_STRICT_DIRECT_ERRS = 0x02,
-
 	/*
 	 * Enforce strict BPF program section (SEC()) names.
 	 * E.g., while prefiously SEC("xdp_whatever") or SEC("perf_event_blah") were
@@ -63,6 +62,17 @@ enum libbpf_strict_mode {
 	 * Clients can maintain it on their own if it is valuable for them.
 	 */
 	LIBBPF_STRICT_NO_OBJECT_LIST = 0x08,
+	/*
+	 * Automatically bump RLIMIT_MEMLOCK using setrlimit() before the
+	 * first BPF program or map creation operation. This is done only if
+	 * kernel is too old to support memcg-based memory accounting for BPF
+	 * subsystem. By default, RLIMIT_MEMLOCK limit is set to RLIM_INFINITY,
+	 * but it can be overriden with libbpf_set_memlock_rlim_max() API.
+	 * Note that libbpf_set_memlock_rlim_max() needs to be called before
+	 * the very first bpf_prog_load(), bpf_map_create() or bpf_object__load()
+	 * operation.
+	 */
+	LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK = 0x10,
 
 	__LIBBPF_STRICT_LAST,
 };
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 bpf-next 2/2] selftests/bpf: remove explicit setrlimit(RLIMIT_MEMLOCK) in main selftests
  2021-12-14  0:48 [PATCH v3 bpf-next 0/2] libbpf: auto-bump RLIMIT_MEMLOCK on old kernels Andrii Nakryiko
  2021-12-14  0:48 ` [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF Andrii Nakryiko
@ 2021-12-14  0:48 ` Andrii Nakryiko
  1 sibling, 0 replies; 9+ messages in thread
From: Andrii Nakryiko @ 2021-12-14  0:48 UTC (permalink / raw)
  To: bpf, ast, daniel; +Cc: andrii, kernel-team

As libbpf now is able to automatically take care of RLIMIT_MEMLOCK
increase (or skip it altogether on recent enough kernels), remove
explicit setrlimit() invocations in bench, test_maps, test_verifier, and
test_progs.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/testing/selftests/bpf/bench.c              | 16 ----------------
 tools/testing/selftests/bpf/prog_tests/btf.c     |  1 -
 .../selftests/bpf/prog_tests/select_reuseport.c  |  1 -
 .../testing/selftests/bpf/prog_tests/sk_lookup.c |  1 -
 .../selftests/bpf/prog_tests/sock_fields.c       |  1 -
 tools/testing/selftests/bpf/test_maps.c          |  1 -
 tools/testing/selftests/bpf/test_progs.c         |  2 --
 tools/testing/selftests/bpf/test_verifier.c      |  4 +++-
 8 files changed, 3 insertions(+), 24 deletions(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index bbb42e2cee0c..f973320e6dbf 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -29,26 +29,10 @@ static int libbpf_print_fn(enum libbpf_print_level level,
 	return vfprintf(stderr, format, args);
 }
 
-static int bump_memlock_rlimit(void)
-{
-	struct rlimit rlim_new = {
-		.rlim_cur	= RLIM_INFINITY,
-		.rlim_max	= RLIM_INFINITY,
-	};
-
-	return setrlimit(RLIMIT_MEMLOCK, &rlim_new);
-}
-
 void setup_libbpf(void)
 {
-	int err;
-
 	libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
 	libbpf_set_print(libbpf_print_fn);
-
-	err = bump_memlock_rlimit();
-	if (err)
-		fprintf(stderr, "failed to increase RLIMIT_MEMLOCK: %d", err);
 }
 
 void false_hits_report_progress(int iter, struct bench_res *res, long delta_ns)
diff --git a/tools/testing/selftests/bpf/prog_tests/btf.c b/tools/testing/selftests/bpf/prog_tests/btf.c
index 01b776a7beeb..8ba53acf9eb4 100644
--- a/tools/testing/selftests/bpf/prog_tests/btf.c
+++ b/tools/testing/selftests/bpf/prog_tests/btf.c
@@ -22,7 +22,6 @@
 #include <bpf/libbpf.h>
 #include <bpf/btf.h>
 
-#include "bpf_rlimit.h"
 #include "bpf_util.h"
 #include "../test_btf.h"
 #include "test_progs.h"
diff --git a/tools/testing/selftests/bpf/prog_tests/select_reuseport.c b/tools/testing/selftests/bpf/prog_tests/select_reuseport.c
index 980ac0f2c0bb..1cbd8cd64044 100644
--- a/tools/testing/selftests/bpf/prog_tests/select_reuseport.c
+++ b/tools/testing/selftests/bpf/prog_tests/select_reuseport.c
@@ -18,7 +18,6 @@
 #include <netinet/in.h>
 #include <bpf/bpf.h>
 #include <bpf/libbpf.h>
-#include "bpf_rlimit.h"
 #include "bpf_util.h"
 
 #include "test_progs.h"
diff --git a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
index 57846cc7ce36..597d0467a926 100644
--- a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
+++ b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
@@ -30,7 +30,6 @@
 #include <bpf/bpf.h>
 
 #include "test_progs.h"
-#include "bpf_rlimit.h"
 #include "bpf_util.h"
 #include "cgroup_helpers.h"
 #include "network_helpers.h"
diff --git a/tools/testing/selftests/bpf/prog_tests/sock_fields.c b/tools/testing/selftests/bpf/prog_tests/sock_fields.c
index fae40db4d81f..9fc040eaa482 100644
--- a/tools/testing/selftests/bpf/prog_tests/sock_fields.c
+++ b/tools/testing/selftests/bpf/prog_tests/sock_fields.c
@@ -15,7 +15,6 @@
 #include "network_helpers.h"
 #include "cgroup_helpers.h"
 #include "test_progs.h"
-#include "bpf_rlimit.h"
 #include "test_sock_fields.skel.h"
 
 enum bpf_linum_array_idx {
diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index f4cd658bbe00..50f7e74ca0b9 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -23,7 +23,6 @@
 #include <bpf/libbpf.h>
 
 #include "bpf_util.h"
-#include "bpf_rlimit.h"
 #include "test_maps.h"
 #include "testing_helpers.h"
 
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 296928948bb9..2ecb73a65206 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -4,7 +4,6 @@
 #define _GNU_SOURCE
 #include "test_progs.h"
 #include "cgroup_helpers.h"
-#include "bpf_rlimit.h"
 #include <argp.h>
 #include <pthread.h>
 #include <sched.h>
@@ -1342,7 +1341,6 @@ int main(int argc, char **argv)
 
 	/* Use libbpf 1.0 API mode */
 	libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
-
 	libbpf_set_print(libbpf_print_fn);
 
 	srand(time(NULL));
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index ad5d30bafd93..b0bd2a1f6d52 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -41,7 +41,6 @@
 #  define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS 1
 # endif
 #endif
-#include "bpf_rlimit.h"
 #include "bpf_rand.h"
 #include "bpf_util.h"
 #include "test_btf.h"
@@ -1395,6 +1394,9 @@ int main(int argc, char **argv)
 		return EXIT_FAILURE;
 	}
 
+	/* Use libbpf 1.0 API mode */
+	libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+
 	bpf_semi_rand_init();
 	return do_test(unpriv, from, to);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF
  2021-12-14  0:48 ` [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF Andrii Nakryiko
@ 2021-12-14 15:09   ` Daniel Borkmann
  2021-12-14 17:51     ` Andrii Nakryiko
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Borkmann @ 2021-12-14 15:09 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, ast; +Cc: kernel-team, toke

On 12/14/21 1:48 AM, Andrii Nakryiko wrote:
> The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is
> one of the first extremely frustrating gotchas that all new BPF users go
> through and in some cases have to learn it a very hard way.
> 
> Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem
> dropped the dependency on memlock and uses memcg-based memory accounting
> instead. Unfortunately, detecting memcg-based BPF memory accounting is
> far from trivial (as can be evidenced by this patch), so in practice
> most BPF applications still do unconditional RLIMIT_MEMLOCK increase.
> 
> As we move towards libbpf 1.0, it would be good to allow users to forget
> about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment
> automatically. This patch paves the way forward in this matter. Libbpf
> will do feature detection of memcg-based accounting, and if detected,
> will do nothing. But if the kernel is too old, just like BCC, libbpf
> will automatically increase RLIMIT_MEMLOCK on behalf of user
> application ([0]).
> 
> As this is technically a breaking change, during the transition period
> applications have to opt into libbpf 1.0 mode by setting
> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling
> libbpf_set_strict_mode().
> 
> Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit
> with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do
> nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be
> called before the first bpf_prog_load(), bpf_btf_load(), or
> bpf_object__load() call, otherwise it has no effect and will return
> -EBUSY.
> 
>    [0] Closes: https://github.com/libbpf/libbpf/issues/369
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
[...]
>   
> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
> + * memcg-based memory accounting for BPF maps and progs. This was done in [0].
> + * We use the difference in reporting memlock value in BPF map's fdinfo before
> + * and after [0] to detect whether memcg accounting is done for BPF subsystem
> + * or not.
> + *
> + * Before the change, memlock value for ARRAY map would be calculated as:
> + *
> + *   memlock = sizeof(struct bpf_array) + round_up(value_size, 8) * max_entries;
> + *   memlock = round_up(memlock, PAGE_SIZE);
> + *
> + *
> + * After, memlock is approximated as:
> + *
> + *   memlock = round_up(key_size + value_size, 8) * max_entries;
> + *   memlock = round_up(memlock, PAGE_SIZE);
> + *
> + * In this check we use the fact that sizeof(struct bpf_array) is about 300
> + * bytes, so if we use value_size = (PAGE_SIZE - 100), before memcg
> + * approximation memlock would be rounded up to 2 * PAGE_SIZE, while with
> + * memcg approximation it will stay at single PAGE_SIZE (key_size is 4 for
> + * array and doesn't make much difference given 100 byte decrement we use for
> + * value_size).
> + *
> + *   [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
> + */
> +int probe_memcg_account(void)
> +{
> +	const size_t map_create_attr_sz = offsetofend(union bpf_attr, map_extra);
> +	long page_sz = sysconf(_SC_PAGESIZE), memlock_sz;
> +	char buf[128];
> +	union bpf_attr attr;
> +	int map_fd;
> +	FILE *f;
> +
> +	memset(&attr, 0, map_create_attr_sz);
> +	attr.map_type = BPF_MAP_TYPE_ARRAY;
> +	attr.key_size = 4;
> +	attr.value_size = page_sz - 100;
> +	attr.max_entries = 1;
> +	map_fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, map_create_attr_sz);
> +	if (map_fd < 0)
> +		return -errno;
> +
> +	sprintf(buf, "/proc/self/fdinfo/%d", map_fd);
> +	f = fopen(buf, "r");
> +	while (f && !feof(f) && fgets(buf, sizeof(buf), f)) {
> +		if (fscanf(f, "memlock: %ld\n", &memlock_sz) == 1) {
> +			fclose(f);
> +			close(map_fd);
> +			return memlock_sz == page_sz ? 1 : 0;
> +		}
> +	}
> +
> +	/* proc FS is disabled or we failed to parse fdinfo properly, assume
> +	 * we need setrlimit
> +	 */
> +	if (f)
> +		fclose(f);
> +	close(map_fd);
> +	return 0;
> +}

One other option which might be slightly more robust perhaps could be to probe
for a BPF helper that has been added along with 5.11 kernel. As Toke noted earlier
it might not work with ooo backports, but if its good with RHEL in this specific
case, we should be covered for 99% of cases. Potentially, we could then still try
to fallback to the above probing logic?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF
  2021-12-14 15:09   ` Daniel Borkmann
@ 2021-12-14 17:51     ` Andrii Nakryiko
  2021-12-14 17:58       ` Alexei Starovoitov
  0 siblings, 1 reply; 9+ messages in thread
From: Andrii Nakryiko @ 2021-12-14 17:51 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Kernel Team,
	Toke Høiland-Jørgensen

On Tue, Dec 14, 2021 at 7:09 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 12/14/21 1:48 AM, Andrii Nakryiko wrote:
> > The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is
> > one of the first extremely frustrating gotchas that all new BPF users go
> > through and in some cases have to learn it a very hard way.
> >
> > Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem
> > dropped the dependency on memlock and uses memcg-based memory accounting
> > instead. Unfortunately, detecting memcg-based BPF memory accounting is
> > far from trivial (as can be evidenced by this patch), so in practice
> > most BPF applications still do unconditional RLIMIT_MEMLOCK increase.
> >
> > As we move towards libbpf 1.0, it would be good to allow users to forget
> > about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment
> > automatically. This patch paves the way forward in this matter. Libbpf
> > will do feature detection of memcg-based accounting, and if detected,
> > will do nothing. But if the kernel is too old, just like BCC, libbpf
> > will automatically increase RLIMIT_MEMLOCK on behalf of user
> > application ([0]).
> >
> > As this is technically a breaking change, during the transition period
> > applications have to opt into libbpf 1.0 mode by setting
> > LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling
> > libbpf_set_strict_mode().
> >
> > Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit
> > with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do
> > nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be
> > called before the first bpf_prog_load(), bpf_btf_load(), or
> > bpf_object__load() call, otherwise it has no effect and will return
> > -EBUSY.
> >
> >    [0] Closes: https://github.com/libbpf/libbpf/issues/369
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> [...]
> >
> > +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
> > + * memcg-based memory accounting for BPF maps and progs. This was done in [0].
> > + * We use the difference in reporting memlock value in BPF map's fdinfo before
> > + * and after [0] to detect whether memcg accounting is done for BPF subsystem
> > + * or not.
> > + *
> > + * Before the change, memlock value for ARRAY map would be calculated as:
> > + *
> > + *   memlock = sizeof(struct bpf_array) + round_up(value_size, 8) * max_entries;
> > + *   memlock = round_up(memlock, PAGE_SIZE);
> > + *
> > + *
> > + * After, memlock is approximated as:
> > + *
> > + *   memlock = round_up(key_size + value_size, 8) * max_entries;
> > + *   memlock = round_up(memlock, PAGE_SIZE);
> > + *
> > + * In this check we use the fact that sizeof(struct bpf_array) is about 300
> > + * bytes, so if we use value_size = (PAGE_SIZE - 100), before memcg
> > + * approximation memlock would be rounded up to 2 * PAGE_SIZE, while with
> > + * memcg approximation it will stay at single PAGE_SIZE (key_size is 4 for
> > + * array and doesn't make much difference given 100 byte decrement we use for
> > + * value_size).
> > + *
> > + *   [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
> > + */
> > +int probe_memcg_account(void)
> > +{
> > +     const size_t map_create_attr_sz = offsetofend(union bpf_attr, map_extra);
> > +     long page_sz = sysconf(_SC_PAGESIZE), memlock_sz;
> > +     char buf[128];
> > +     union bpf_attr attr;
> > +     int map_fd;
> > +     FILE *f;
> > +
> > +     memset(&attr, 0, map_create_attr_sz);
> > +     attr.map_type = BPF_MAP_TYPE_ARRAY;
> > +     attr.key_size = 4;
> > +     attr.value_size = page_sz - 100;
> > +     attr.max_entries = 1;
> > +     map_fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, map_create_attr_sz);
> > +     if (map_fd < 0)
> > +             return -errno;
> > +
> > +     sprintf(buf, "/proc/self/fdinfo/%d", map_fd);
> > +     f = fopen(buf, "r");
> > +     while (f && !feof(f) && fgets(buf, sizeof(buf), f)) {
> > +             if (fscanf(f, "memlock: %ld\n", &memlock_sz) == 1) {
> > +                     fclose(f);
> > +                     close(map_fd);
> > +                     return memlock_sz == page_sz ? 1 : 0;
> > +             }
> > +     }
> > +
> > +     /* proc FS is disabled or we failed to parse fdinfo properly, assume
> > +      * we need setrlimit
> > +      */
> > +     if (f)
> > +             fclose(f);
> > +     close(map_fd);
> > +     return 0;
> > +}
>
> One other option which might be slightly more robust perhaps could be to probe
> for a BPF helper that has been added along with 5.11 kernel. As Toke noted earlier
> it might not work with ooo backports, but if its good with RHEL in this specific
> case, we should be covered for 99% of cases. Potentially, we could then still try
> to fallback to the above probing logic?

Ok, I was originally thinking of probe bpf_sock_from_file() (which was
added after memcg change), but it's PITA. But I see that slightly
before that (but in the same 5.11 release) bpf_ktime_get_coarse_ns()
helper was added, which is the simplest helper to test for. Let me
test that instead and it should be very reliable (apart from the out
of order backports, but I personally can live with that, given that we
fall back to a safe setrlimit() default).

But I'm not going to do fallback, helper doesn't add any extra
dependencies and should be very reliable.

>
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF
  2021-12-14 17:51     ` Andrii Nakryiko
@ 2021-12-14 17:58       ` Alexei Starovoitov
  2021-12-14 18:31         ` Andrii Nakryiko
  0 siblings, 1 reply; 9+ messages in thread
From: Alexei Starovoitov @ 2021-12-14 17:58 UTC (permalink / raw)
  To: Andrii Nakryiko, Daniel Borkmann
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Kernel Team,
	Toke Høiland-Jørgensen

On 12/14/21 9:51 AM, Andrii Nakryiko wrote:
> On Tue, Dec 14, 2021 at 7:09 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> On 12/14/21 1:48 AM, Andrii Nakryiko wrote:
>>> The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is
>>> one of the first extremely frustrating gotchas that all new BPF users go
>>> through and in some cases have to learn it a very hard way.
>>>
>>> Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem
>>> dropped the dependency on memlock and uses memcg-based memory accounting
>>> instead. Unfortunately, detecting memcg-based BPF memory accounting is
>>> far from trivial (as can be evidenced by this patch), so in practice
>>> most BPF applications still do unconditional RLIMIT_MEMLOCK increase.
>>>
>>> As we move towards libbpf 1.0, it would be good to allow users to forget
>>> about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment
>>> automatically. This patch paves the way forward in this matter. Libbpf
>>> will do feature detection of memcg-based accounting, and if detected,
>>> will do nothing. But if the kernel is too old, just like BCC, libbpf
>>> will automatically increase RLIMIT_MEMLOCK on behalf of user
>>> application ([0]).
>>>
>>> As this is technically a breaking change, during the transition period
>>> applications have to opt into libbpf 1.0 mode by setting
>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling
>>> libbpf_set_strict_mode().
>>>
>>> Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit
>>> with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do
>>> nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be
>>> called before the first bpf_prog_load(), bpf_btf_load(), or
>>> bpf_object__load() call, otherwise it has no effect and will return
>>> -EBUSY.
>>>
>>>     [0] Closes: https://github.com/libbpf/libbpf/issues/369
>>>
>>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>> [...]
>>>
>>> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
>>> + * memcg-based memory accounting for BPF maps and progs. This was done in [0].
>>> + * We use the difference in reporting memlock value in BPF map's fdinfo before
>>> + * and after [0] to detect whether memcg accounting is done for BPF subsystem
>>> + * or not.
>>> + *
>>> + * Before the change, memlock value for ARRAY map would be calculated as:
>>> + *
>>> + *   memlock = sizeof(struct bpf_array) + round_up(value_size, 8) * max_entries;
>>> + *   memlock = round_up(memlock, PAGE_SIZE);
>>> + *
>>> + *
>>> + * After, memlock is approximated as:
>>> + *
>>> + *   memlock = round_up(key_size + value_size, 8) * max_entries;
>>> + *   memlock = round_up(memlock, PAGE_SIZE);
>>> + *
>>> + * In this check we use the fact that sizeof(struct bpf_array) is about 300
>>> + * bytes, so if we use value_size = (PAGE_SIZE - 100), before memcg
>>> + * approximation memlock would be rounded up to 2 * PAGE_SIZE, while with
>>> + * memcg approximation it will stay at single PAGE_SIZE (key_size is 4 for
>>> + * array and doesn't make much difference given 100 byte decrement we use for
>>> + * value_size).
>>> + *
>>> + *   [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
>>> + */
>>> +int probe_memcg_account(void)
>>> +{
>>> +     const size_t map_create_attr_sz = offsetofend(union bpf_attr, map_extra);
>>> +     long page_sz = sysconf(_SC_PAGESIZE), memlock_sz;
>>> +     char buf[128];
>>> +     union bpf_attr attr;
>>> +     int map_fd;
>>> +     FILE *f;
>>> +
>>> +     memset(&attr, 0, map_create_attr_sz);
>>> +     attr.map_type = BPF_MAP_TYPE_ARRAY;
>>> +     attr.key_size = 4;
>>> +     attr.value_size = page_sz - 100;
>>> +     attr.max_entries = 1;
>>> +     map_fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, map_create_attr_sz);
>>> +     if (map_fd < 0)
>>> +             return -errno;
>>> +
>>> +     sprintf(buf, "/proc/self/fdinfo/%d", map_fd);
>>> +     f = fopen(buf, "r");
>>> +     while (f && !feof(f) && fgets(buf, sizeof(buf), f)) {
>>> +             if (fscanf(f, "memlock: %ld\n", &memlock_sz) == 1) {
>>> +                     fclose(f);
>>> +                     close(map_fd);
>>> +                     return memlock_sz == page_sz ? 1 : 0;
>>> +             }
>>> +     }
>>> +
>>> +     /* proc FS is disabled or we failed to parse fdinfo properly, assume
>>> +      * we need setrlimit
>>> +      */
>>> +     if (f)
>>> +             fclose(f);
>>> +     close(map_fd);
>>> +     return 0;
>>> +}
>>
>> One other option which might be slightly more robust perhaps could be to probe
>> for a BPF helper that has been added along with 5.11 kernel. As Toke noted earlier
>> it might not work with ooo backports, but if its good with RHEL in this specific
>> case, we should be covered for 99% of cases. Potentially, we could then still try
>> to fallback to the above probing logic?
> 
> Ok, I was originally thinking of probe bpf_sock_from_file() (which was
> added after memcg change), but it's PITA. But I see that slightly
> before that (but in the same 5.11 release) bpf_ktime_get_coarse_ns()

Note that it had fixes after that, so in the kernel version where
it appeared it may be detected slightly differently than in
the newer kernels (depending on how far fixes were backported).
imo I would stick with this array+fdinfo approach.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF
  2021-12-14 17:58       ` Alexei Starovoitov
@ 2021-12-14 18:31         ` Andrii Nakryiko
  2021-12-14 20:51           ` Alexei Starovoitov
  0 siblings, 1 reply; 9+ messages in thread
From: Andrii Nakryiko @ 2021-12-14 18:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Kernel Team, Toke Høiland-Jørgensen

On Tue, Dec 14, 2021 at 9:58 AM Alexei Starovoitov <ast@fb.com> wrote:
>
> On 12/14/21 9:51 AM, Andrii Nakryiko wrote:
> > On Tue, Dec 14, 2021 at 7:09 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>
> >> On 12/14/21 1:48 AM, Andrii Nakryiko wrote:
> >>> The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is
> >>> one of the first extremely frustrating gotchas that all new BPF users go
> >>> through and in some cases have to learn it a very hard way.
> >>>
> >>> Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem
> >>> dropped the dependency on memlock and uses memcg-based memory accounting
> >>> instead. Unfortunately, detecting memcg-based BPF memory accounting is
> >>> far from trivial (as can be evidenced by this patch), so in practice
> >>> most BPF applications still do unconditional RLIMIT_MEMLOCK increase.
> >>>
> >>> As we move towards libbpf 1.0, it would be good to allow users to forget
> >>> about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment
> >>> automatically. This patch paves the way forward in this matter. Libbpf
> >>> will do feature detection of memcg-based accounting, and if detected,
> >>> will do nothing. But if the kernel is too old, just like BCC, libbpf
> >>> will automatically increase RLIMIT_MEMLOCK on behalf of user
> >>> application ([0]).
> >>>
> >>> As this is technically a breaking change, during the transition period
> >>> applications have to opt into libbpf 1.0 mode by setting
> >>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling
> >>> libbpf_set_strict_mode().
> >>>
> >>> Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit
> >>> with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do
> >>> nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be
> >>> called before the first bpf_prog_load(), bpf_btf_load(), or
> >>> bpf_object__load() call, otherwise it has no effect and will return
> >>> -EBUSY.
> >>>
> >>>     [0] Closes: https://github.com/libbpf/libbpf/issues/369
> >>>
> >>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> >> [...]
> >>>
> >>> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
> >>> + * memcg-based memory accounting for BPF maps and progs. This was done in [0].
> >>> + * We use the difference in reporting memlock value in BPF map's fdinfo before
> >>> + * and after [0] to detect whether memcg accounting is done for BPF subsystem
> >>> + * or not.
> >>> + *
> >>> + * Before the change, memlock value for ARRAY map would be calculated as:
> >>> + *
> >>> + *   memlock = sizeof(struct bpf_array) + round_up(value_size, 8) * max_entries;
> >>> + *   memlock = round_up(memlock, PAGE_SIZE);
> >>> + *
> >>> + *
> >>> + * After, memlock is approximated as:
> >>> + *
> >>> + *   memlock = round_up(key_size + value_size, 8) * max_entries;
> >>> + *   memlock = round_up(memlock, PAGE_SIZE);
> >>> + *
> >>> + * In this check we use the fact that sizeof(struct bpf_array) is about 300
> >>> + * bytes, so if we use value_size = (PAGE_SIZE - 100), before memcg
> >>> + * approximation memlock would be rounded up to 2 * PAGE_SIZE, while with
> >>> + * memcg approximation it will stay at single PAGE_SIZE (key_size is 4 for
> >>> + * array and doesn't make much difference given 100 byte decrement we use for
> >>> + * value_size).
> >>> + *
> >>> + *   [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
> >>> + */
> >>> +int probe_memcg_account(void)
> >>> +{
> >>> +     const size_t map_create_attr_sz = offsetofend(union bpf_attr, map_extra);
> >>> +     long page_sz = sysconf(_SC_PAGESIZE), memlock_sz;
> >>> +     char buf[128];
> >>> +     union bpf_attr attr;
> >>> +     int map_fd;
> >>> +     FILE *f;
> >>> +
> >>> +     memset(&attr, 0, map_create_attr_sz);
> >>> +     attr.map_type = BPF_MAP_TYPE_ARRAY;
> >>> +     attr.key_size = 4;
> >>> +     attr.value_size = page_sz - 100;
> >>> +     attr.max_entries = 1;
> >>> +     map_fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, map_create_attr_sz);
> >>> +     if (map_fd < 0)
> >>> +             return -errno;
> >>> +
> >>> +     sprintf(buf, "/proc/self/fdinfo/%d", map_fd);
> >>> +     f = fopen(buf, "r");
> >>> +     while (f && !feof(f) && fgets(buf, sizeof(buf), f)) {
> >>> +             if (fscanf(f, "memlock: %ld\n", &memlock_sz) == 1) {
> >>> +                     fclose(f);
> >>> +                     close(map_fd);
> >>> +                     return memlock_sz == page_sz ? 1 : 0;
> >>> +             }
> >>> +     }
> >>> +
> >>> +     /* proc FS is disabled or we failed to parse fdinfo properly, assume
> >>> +      * we need setrlimit
> >>> +      */
> >>> +     if (f)
> >>> +             fclose(f);
> >>> +     close(map_fd);
> >>> +     return 0;
> >>> +}
> >>
> >> One other option which might be slightly more robust perhaps could be to probe
> >> for a BPF helper that has been added along with 5.11 kernel. As Toke noted earlier
> >> it might not work with ooo backports, but if its good with RHEL in this specific
> >> case, we should be covered for 99% of cases. Potentially, we could then still try
> >> to fallback to the above probing logic?
> >
> > Ok, I was originally thinking of probe bpf_sock_from_file() (which was
> > added after memcg change), but it's PITA. But I see that slightly
> > before that (but in the same 5.11 release) bpf_ktime_get_coarse_ns()
>
> Note that it had fixes after that, so in the kernel version where

You mean 5e0bc3082e2e ("bpf: Forbid bpf_ktime_get_coarse_ns and
bpf_timer_* in tracing progs"), right? This shouldn't matter if I use
BPF_PROG_TYPE_SOCKET_FILTER for probing.

fdinfo parsing approach has unnecessary dependency on PROCFS and is
more code (and very detailed knowledge of approximation and memlock
calculation formula). I like ktime_get_coarse_ns approach due to
minimal amount of code and no reliance on any other kernel config
besides CONFIG_BPF_SYSCALL.

But in the end I care about the overall feature, not a particular
implementation of the detection. Should I send
ktime_get_coarse_ns-based approach or we go with this one? I've
implemented and tested all three variants already, so no time savings
are expected either way.

> it appeared it may be detected slightly differently than in
> the newer kernels (depending on how far fixes were backported).
> imo I would stick with this array+fdinfo approach.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF
  2021-12-14 18:31         ` Andrii Nakryiko
@ 2021-12-14 20:51           ` Alexei Starovoitov
  2021-12-14 21:09             ` Daniel Borkmann
  0 siblings, 1 reply; 9+ messages in thread
From: Alexei Starovoitov @ 2021-12-14 20:51 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Daniel Borkmann, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Kernel Team, Toke Høiland-Jørgensen

On 12/14/21 10:31 AM, Andrii Nakryiko wrote:
> On Tue, Dec 14, 2021 at 9:58 AM Alexei Starovoitov <ast@fb.com> wrote:
>>
>> On 12/14/21 9:51 AM, Andrii Nakryiko wrote:
>>> On Tue, Dec 14, 2021 at 7:09 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>
>>>> On 12/14/21 1:48 AM, Andrii Nakryiko wrote:
>>>>> The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is
>>>>> one of the first extremely frustrating gotchas that all new BPF users go
>>>>> through and in some cases have to learn it a very hard way.
>>>>>
>>>>> Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem
>>>>> dropped the dependency on memlock and uses memcg-based memory accounting
>>>>> instead. Unfortunately, detecting memcg-based BPF memory accounting is
>>>>> far from trivial (as can be evidenced by this patch), so in practice
>>>>> most BPF applications still do unconditional RLIMIT_MEMLOCK increase.
>>>>>
>>>>> As we move towards libbpf 1.0, it would be good to allow users to forget
>>>>> about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment
>>>>> automatically. This patch paves the way forward in this matter. Libbpf
>>>>> will do feature detection of memcg-based accounting, and if detected,
>>>>> will do nothing. But if the kernel is too old, just like BCC, libbpf
>>>>> will automatically increase RLIMIT_MEMLOCK on behalf of user
>>>>> application ([0]).
>>>>>
>>>>> As this is technically a breaking change, during the transition period
>>>>> applications have to opt into libbpf 1.0 mode by setting
>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling
>>>>> libbpf_set_strict_mode().
>>>>>
>>>>> Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit
>>>>> with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do
>>>>> nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be
>>>>> called before the first bpf_prog_load(), bpf_btf_load(), or
>>>>> bpf_object__load() call, otherwise it has no effect and will return
>>>>> -EBUSY.
>>>>>
>>>>>      [0] Closes: https://github.com/libbpf/libbpf/issues/369
>>>>>
>>>>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>>>> [...]
>>>>>
>>>>> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
>>>>> + * memcg-based memory accounting for BPF maps and progs. This was done in [0].
>>>>> + * We use the difference in reporting memlock value in BPF map's fdinfo before
>>>>> + * and after [0] to detect whether memcg accounting is done for BPF subsystem
>>>>> + * or not.
>>>>> + *
>>>>> + * Before the change, memlock value for ARRAY map would be calculated as:
>>>>> + *
>>>>> + *   memlock = sizeof(struct bpf_array) + round_up(value_size, 8) * max_entries;
>>>>> + *   memlock = round_up(memlock, PAGE_SIZE);
>>>>> + *
>>>>> + *
>>>>> + * After, memlock is approximated as:
>>>>> + *
>>>>> + *   memlock = round_up(key_size + value_size, 8) * max_entries;
>>>>> + *   memlock = round_up(memlock, PAGE_SIZE);
>>>>> + *
>>>>> + * In this check we use the fact that sizeof(struct bpf_array) is about 300
>>>>> + * bytes, so if we use value_size = (PAGE_SIZE - 100), before memcg
>>>>> + * approximation memlock would be rounded up to 2 * PAGE_SIZE, while with
>>>>> + * memcg approximation it will stay at single PAGE_SIZE (key_size is 4 for
>>>>> + * array and doesn't make much difference given 100 byte decrement we use for
>>>>> + * value_size).
>>>>> + *
>>>>> + *   [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
>>>>> + */
>>>>> +int probe_memcg_account(void)
>>>>> +{
>>>>> +     const size_t map_create_attr_sz = offsetofend(union bpf_attr, map_extra);
>>>>> +     long page_sz = sysconf(_SC_PAGESIZE), memlock_sz;
>>>>> +     char buf[128];
>>>>> +     union bpf_attr attr;
>>>>> +     int map_fd;
>>>>> +     FILE *f;
>>>>> +
>>>>> +     memset(&attr, 0, map_create_attr_sz);
>>>>> +     attr.map_type = BPF_MAP_TYPE_ARRAY;
>>>>> +     attr.key_size = 4;
>>>>> +     attr.value_size = page_sz - 100;
>>>>> +     attr.max_entries = 1;
>>>>> +     map_fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, map_create_attr_sz);
>>>>> +     if (map_fd < 0)
>>>>> +             return -errno;
>>>>> +
>>>>> +     sprintf(buf, "/proc/self/fdinfo/%d", map_fd);
>>>>> +     f = fopen(buf, "r");
>>>>> +     while (f && !feof(f) && fgets(buf, sizeof(buf), f)) {
>>>>> +             if (fscanf(f, "memlock: %ld\n", &memlock_sz) == 1) {
>>>>> +                     fclose(f);
>>>>> +                     close(map_fd);
>>>>> +                     return memlock_sz == page_sz ? 1 : 0;
>>>>> +             }
>>>>> +     }
>>>>> +
>>>>> +     /* proc FS is disabled or we failed to parse fdinfo properly, assume
>>>>> +      * we need setrlimit
>>>>> +      */
>>>>> +     if (f)
>>>>> +             fclose(f);
>>>>> +     close(map_fd);
>>>>> +     return 0;
>>>>> +}
>>>>
>>>> One other option which might be slightly more robust perhaps could be to probe
>>>> for a BPF helper that has been added along with 5.11 kernel. As Toke noted earlier
>>>> it might not work with ooo backports, but if its good with RHEL in this specific
>>>> case, we should be covered for 99% of cases. Potentially, we could then still try
>>>> to fallback to the above probing logic?
>>>
>>> Ok, I was originally thinking of probe bpf_sock_from_file() (which was
>>> added after memcg change), but it's PITA. But I see that slightly
>>> before that (but in the same 5.11 release) bpf_ktime_get_coarse_ns()
>>
>> Note that it had fixes after that, so in the kernel version where
> 
> You mean 5e0bc3082e2e ("bpf: Forbid bpf_ktime_get_coarse_ns and
> bpf_timer_* in tracing progs"), right? This shouldn't matter if I use
> BPF_PROG_TYPE_SOCKET_FILTER for probing.

hmm. I guess we allow it in unpriv too.

> fdinfo parsing approach has unnecessary dependency on PROCFS and is
> more code (and very detailed knowledge of approximation and memlock
> calculation formula). I like ktime_get_coarse_ns approach due to
> minimal amount of code and no reliance on any other kernel config
> besides CONFIG_BPF_SYSCALL.
> 
> But in the end I care about the overall feature, not a particular
> implementation of the detection. Should I send
> ktime_get_coarse_ns-based approach or we go with this one? I've
> implemented and tested all three variants already, so no time savings
> are expected either way.

Either v3 or v4 are fine by me. Let Daniel pick.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF
  2021-12-14 20:51           ` Alexei Starovoitov
@ 2021-12-14 21:09             ` Daniel Borkmann
  0 siblings, 0 replies; 9+ messages in thread
From: Daniel Borkmann @ 2021-12-14 21:09 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Kernel Team,
	Toke Høiland-Jørgensen

On 12/14/21 9:51 PM, Alexei Starovoitov wrote:
> On 12/14/21 10:31 AM, Andrii Nakryiko wrote:
>> On Tue, Dec 14, 2021 at 9:58 AM Alexei Starovoitov <ast@fb.com> wrote:
>>> On 12/14/21 9:51 AM, Andrii Nakryiko wrote:
>>>> On Tue, Dec 14, 2021 at 7:09 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>> On 12/14/21 1:48 AM, Andrii Nakryiko wrote:
>>>>>> The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is
>>>>>> one of the first extremely frustrating gotchas that all new BPF users go
>>>>>> through and in some cases have to learn it a very hard way.
>>>>>>
>>>>>> Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem
>>>>>> dropped the dependency on memlock and uses memcg-based memory accounting
>>>>>> instead. Unfortunately, detecting memcg-based BPF memory accounting is
>>>>>> far from trivial (as can be evidenced by this patch), so in practice
>>>>>> most BPF applications still do unconditional RLIMIT_MEMLOCK increase.
>>>>>>
>>>>>> As we move towards libbpf 1.0, it would be good to allow users to forget
>>>>>> about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment
>>>>>> automatically. This patch paves the way forward in this matter. Libbpf
>>>>>> will do feature detection of memcg-based accounting, and if detected,
>>>>>> will do nothing. But if the kernel is too old, just like BCC, libbpf
>>>>>> will automatically increase RLIMIT_MEMLOCK on behalf of user
>>>>>> application ([0]).
>>>>>>
>>>>>> As this is technically a breaking change, during the transition period
>>>>>> applications have to opt into libbpf 1.0 mode by setting
>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling
>>>>>> libbpf_set_strict_mode().
>>>>>>
>>>>>> Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit
>>>>>> with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do
>>>>>> nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be
>>>>>> called before the first bpf_prog_load(), bpf_btf_load(), or
>>>>>> bpf_object__load() call, otherwise it has no effect and will return
>>>>>> -EBUSY.
>>>>>>
>>>>>>      [0] Closes: https://github.com/libbpf/libbpf/issues/369
>>>>>>
>>>>>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>>>>> [...]
>>>>>>
>>>>>> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
>>>>>> + * memcg-based memory accounting for BPF maps and progs. This was done in [0].
>>>>>> + * We use the difference in reporting memlock value in BPF map's fdinfo before
>>>>>> + * and after [0] to detect whether memcg accounting is done for BPF subsystem
>>>>>> + * or not.
>>>>>> + *
>>>>>> + * Before the change, memlock value for ARRAY map would be calculated as:
>>>>>> + *
>>>>>> + *   memlock = sizeof(struct bpf_array) + round_up(value_size, 8) * max_entries;
>>>>>> + *   memlock = round_up(memlock, PAGE_SIZE);
>>>>>> + *
>>>>>> + *
>>>>>> + * After, memlock is approximated as:
>>>>>> + *
>>>>>> + *   memlock = round_up(key_size + value_size, 8) * max_entries;
>>>>>> + *   memlock = round_up(memlock, PAGE_SIZE);
>>>>>> + *
>>>>>> + * In this check we use the fact that sizeof(struct bpf_array) is about 300
>>>>>> + * bytes, so if we use value_size = (PAGE_SIZE - 100), before memcg
>>>>>> + * approximation memlock would be rounded up to 2 * PAGE_SIZE, while with
>>>>>> + * memcg approximation it will stay at single PAGE_SIZE (key_size is 4 for
>>>>>> + * array and doesn't make much difference given 100 byte decrement we use for
>>>>>> + * value_size).
>>>>>> + *
>>>>>> + *   [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
>>>>>> + */
>>>>>> +int probe_memcg_account(void)
>>>>>> +{
>>>>>> +     const size_t map_create_attr_sz = offsetofend(union bpf_attr, map_extra);
>>>>>> +     long page_sz = sysconf(_SC_PAGESIZE), memlock_sz;
>>>>>> +     char buf[128];
>>>>>> +     union bpf_attr attr;
>>>>>> +     int map_fd;
>>>>>> +     FILE *f;
>>>>>> +
>>>>>> +     memset(&attr, 0, map_create_attr_sz);
>>>>>> +     attr.map_type = BPF_MAP_TYPE_ARRAY;
>>>>>> +     attr.key_size = 4;
>>>>>> +     attr.value_size = page_sz - 100;
>>>>>> +     attr.max_entries = 1;
>>>>>> +     map_fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, map_create_attr_sz);
>>>>>> +     if (map_fd < 0)
>>>>>> +             return -errno;
>>>>>> +
>>>>>> +     sprintf(buf, "/proc/self/fdinfo/%d", map_fd);
>>>>>> +     f = fopen(buf, "r");
>>>>>> +     while (f && !feof(f) && fgets(buf, sizeof(buf), f)) {
>>>>>> +             if (fscanf(f, "memlock: %ld\n", &memlock_sz) == 1) {
>>>>>> +                     fclose(f);
>>>>>> +                     close(map_fd);
>>>>>> +                     return memlock_sz == page_sz ? 1 : 0;
>>>>>> +             }
>>>>>> +     }
>>>>>> +
>>>>>> +     /* proc FS is disabled or we failed to parse fdinfo properly, assume
>>>>>> +      * we need setrlimit
>>>>>> +      */
>>>>>> +     if (f)
>>>>>> +             fclose(f);
>>>>>> +     close(map_fd);
>>>>>> +     return 0;
>>>>>> +}
>>>>>
>>>>> One other option which might be slightly more robust perhaps could be to probe
>>>>> for a BPF helper that has been added along with 5.11 kernel. As Toke noted earlier
>>>>> it might not work with ooo backports, but if its good with RHEL in this specific
>>>>> case, we should be covered for 99% of cases. Potentially, we could then still try
>>>>> to fallback to the above probing logic?
>>>>
>>>> Ok, I was originally thinking of probe bpf_sock_from_file() (which was
>>>> added after memcg change), but it's PITA. But I see that slightly
>>>> before that (but in the same 5.11 release) bpf_ktime_get_coarse_ns()
>>>
>>> Note that it had fixes after that, so in the kernel version where
>>
>> You mean 5e0bc3082e2e ("bpf: Forbid bpf_ktime_get_coarse_ns and
>> bpf_timer_* in tracing progs"), right? This shouldn't matter if I use
>> BPF_PROG_TYPE_SOCKET_FILTER for probing.
> 
> hmm. I guess we allow it in unpriv too.
> 
>> fdinfo parsing approach has unnecessary dependency on PROCFS and is
>> more code (and very detailed knowledge of approximation and memlock
>> calculation formula). I like ktime_get_coarse_ns approach due to
>> minimal amount of code and no reliance on any other kernel config
>> besides CONFIG_BPF_SYSCALL.
>>
>> But in the end I care about the overall feature, not a particular
>> implementation of the detection. Should I send
>> ktime_get_coarse_ns-based approach or we go with this one? I've
>> implemented and tested all three variants already, so no time savings
>> are expected either way.
> 
> Either v3 or v4 are fine by me. Let Daniel pick.

Same here, I can take in v4 given less dependencies.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-12-14 21:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-14  0:48 [PATCH v3 bpf-next 0/2] libbpf: auto-bump RLIMIT_MEMLOCK on old kernels Andrii Nakryiko
2021-12-14  0:48 ` [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF Andrii Nakryiko
2021-12-14 15:09   ` Daniel Borkmann
2021-12-14 17:51     ` Andrii Nakryiko
2021-12-14 17:58       ` Alexei Starovoitov
2021-12-14 18:31         ` Andrii Nakryiko
2021-12-14 20:51           ` Alexei Starovoitov
2021-12-14 21:09             ` Daniel Borkmann
2021-12-14  0:48 ` [PATCH v3 bpf-next 2/2] selftests/bpf: remove explicit setrlimit(RLIMIT_MEMLOCK) in main selftests Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.