[PATCH bpf-next 4/5] libbpf: add support for sleepable kprobe and uprobe programs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next 4/5] libbpf: add support for sleepable kprobe and uprobe programs
  2022-04-28 16:53 [PATCH bpf-next 0/5] sleepable uprobe support Delyan Kratunov
@ 2022-04-28 16:53 ` Delyan Kratunov
  2022-04-28 18:33   ` Andrii Nakryiko
  2022-04-28 16:53 ` [PATCH bpf-next 5/5] selftests/bpf: add tests for sleepable kprobes and uprobes Delyan Kratunov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Delyan Kratunov @ 2022-04-28 16:53 UTC (permalink / raw)
  To: daniel, ast, andrii, bpf

Add section mappings for uprobe.s and kprobe.s programs. The latter
cannot currently attach but they're still useful to open and load in
order to validate that prohibition.

Signed-off-by: Delyan Kratunov <delyank@fb.com>
---
 tools/lib/bpf/libbpf.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 9a213aaaac8a..9e89a478d40e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8692,9 +8692,12 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("sk_reuseport/migrate",	SK_REUSEPORT, BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
 	SEC_DEF("sk_reuseport",		SK_REUSEPORT, BPF_SK_REUSEPORT_SELECT, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
 	SEC_DEF("kprobe/",		KPROBE,	0, SEC_NONE, attach_kprobe),
+	SEC_DEF("kprobe.s/",		KPROBE,	0, SEC_SLEEPABLE, attach_kprobe),
 	SEC_DEF("uprobe+",		KPROBE,	0, SEC_NONE, attach_uprobe),
+	SEC_DEF("uprobe.s+",		KPROBE,	0, SEC_SLEEPABLE, attach_uprobe),
 	SEC_DEF("kretprobe/",		KPROBE, 0, SEC_NONE, attach_kprobe),
 	SEC_DEF("uretprobe+",		KPROBE, 0, SEC_NONE, attach_uprobe),
+	SEC_DEF("uretprobe.s+",		KPROBE, 0, SEC_SLEEPABLE, attach_uprobe),
 	SEC_DEF("kprobe.multi/",	KPROBE,	BPF_TRACE_KPROBE_MULTI, SEC_NONE, attach_kprobe_multi),
 	SEC_DEF("kretprobe.multi/",	KPROBE,	BPF_TRACE_KPROBE_MULTI, SEC_NONE, attach_kprobe_multi),
 	SEC_DEF("usdt+",		KPROBE,	0, SEC_NONE, attach_usdt),
@@ -10432,13 +10435,18 @@ static int attach_kprobe(const struct bpf_program *prog, long cookie, struct bpf
 	const char *func_name;
 	char *func;
 	int n;
+	bool sleepable = false;
 
 	opts.retprobe = str_has_pfx(prog->sec_name, "kretprobe/");
+	sleepable = str_has_pfx(prog->sec_name, "kprobe.s/");
 	if (opts.retprobe)
 		func_name = prog->sec_name + sizeof("kretprobe/") - 1;
+	else if (sleepable)
+		func_name = prog->sec_name + sizeof("kprobe.s/") - 1;
 	else
 		func_name = prog->sec_name + sizeof("kprobe/") - 1;
 
+
 	n = sscanf(func_name, "%m[a-zA-Z0-9_.]+%li", &func, &offset);
 	if (n < 1) {
 		pr_warn("kprobe name is invalid: %s\n", func_name);
@@ -10957,7 +10965,7 @@ static int attach_uprobe(const struct bpf_program *prog, long cookie, struct bpf
 		break;
 	case 3:
 	case 4:
-		opts.retprobe = strcmp(probe_type, "uretprobe") == 0;
+		opts.retprobe = str_has_pfx(probe_type, "uretprobe");
 		if (opts.retprobe && offset != 0) {
 			pr_warn("prog '%s': uretprobes do not support offset specification\n",
 				prog->name);
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf-next 5/5] selftests/bpf: add tests for sleepable kprobes and uprobes
  2022-04-28 16:53 [PATCH bpf-next 0/5] sleepable uprobe support Delyan Kratunov
  2022-04-28 16:53 ` [PATCH bpf-next 4/5] libbpf: add support for sleepable kprobe and uprobe programs Delyan Kratunov
@ 2022-04-28 16:53 ` Delyan Kratunov
  2022-04-28 18:41   ` Andrii Nakryiko
  2022-04-28 16:53 ` [PATCH bpf-next 1/5] bpf: move bpf_prog to bpf.h Delyan Kratunov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Delyan Kratunov @ 2022-04-28 16:53 UTC (permalink / raw)
  To: daniel, ast, andrii, bpf

Add tests that ensure sleepable kprobe programs cannot attach.

Also attach both sleepable and non-sleepable uprobe programs to the same
location (i.e. same bpf_prog_array).

Signed-off-by: Delyan Kratunov <delyank@fb.com>
---
 .../selftests/bpf/prog_tests/attach_probe.c   | 35 +++++++++++++++
 .../selftests/bpf/progs/test_attach_probe.c   | 44 +++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/attach_probe.c b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
index c0c6d410751d..c5c601537eea 100644
--- a/tools/testing/selftests/bpf/prog_tests/attach_probe.c
+++ b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
@@ -17,6 +17,12 @@ static void trigger_func2(void)
 	asm volatile ("");
 }
 
+/* attach point for byname sleepable uprobe */
+static void trigger_func3(void)
+{
+	asm volatile ("");
+}
+
 void test_attach_probe(void)
 {
 	DECLARE_LIBBPF_OPTS(bpf_uprobe_opts, uprobe_opts);
@@ -143,6 +149,28 @@ void test_attach_probe(void)
 	if (!ASSERT_OK_PTR(skel->links.handle_uretprobe_byname2, "attach_uretprobe_byname2"))
 		goto cleanup;
 
+	/* sleepable kprobes should not attach successfully */
+	skel->links.handle_kprobe_sleepable = bpf_program__attach(skel->progs.handle_kprobe_sleepable);
+	if (!ASSERT_NULL(skel->links.handle_kprobe_sleepable, "attach_kprobe_sleepable"))
+		goto cleanup;
+
+	/* test sleepable uprobe and uretprobe variants */
+	skel->links.handle_uprobe_byname3_sleepable = bpf_program__attach(skel->progs.handle_uprobe_byname3_sleepable);
+	if (!ASSERT_OK_PTR(skel->links.handle_uprobe_byname3_sleepable, "attach_uprobe_byname3_sleepable"))
+		goto cleanup;
+
+	skel->links.handle_uprobe_byname3 = bpf_program__attach(skel->progs.handle_uprobe_byname3);
+	if (!ASSERT_OK_PTR(skel->links.handle_uprobe_byname3, "attach_uprobe_byname3"))
+		goto cleanup;
+
+	skel->links.handle_uretprobe_byname3_sleepable = bpf_program__attach(skel->progs.handle_uretprobe_byname3_sleepable);
+	if (!ASSERT_OK_PTR(skel->links.handle_uretprobe_byname3_sleepable, "attach_uretprobe_byname3_sleepable"))
+		goto cleanup;
+
+	skel->links.handle_uretprobe_byname3 = bpf_program__attach(skel->progs.handle_uretprobe_byname3);
+	if (!ASSERT_OK_PTR(skel->links.handle_uretprobe_byname3, "attach_uretprobe_byname3"))
+		goto cleanup;
+
 	/* trigger & validate kprobe && kretprobe */
 	usleep(1);
 
@@ -156,6 +184,9 @@ void test_attach_probe(void)
 	/* trigger & validate uprobe attached by name */
 	trigger_func2();
 
+	/* trigger & validate sleepable uprobe attached by name */
+	trigger_func3();
+
 	ASSERT_EQ(skel->bss->kprobe_res, 1, "check_kprobe_res");
 	ASSERT_EQ(skel->bss->kretprobe_res, 2, "check_kretprobe_res");
 	ASSERT_EQ(skel->bss->uprobe_res, 3, "check_uprobe_res");
@@ -164,6 +195,10 @@ void test_attach_probe(void)
 	ASSERT_EQ(skel->bss->uretprobe_byname_res, 6, "check_uretprobe_byname_res");
 	ASSERT_EQ(skel->bss->uprobe_byname2_res, 7, "check_uprobe_byname2_res");
 	ASSERT_EQ(skel->bss->uretprobe_byname2_res, 8, "check_uretprobe_byname2_res");
+	ASSERT_EQ(skel->bss->uprobe_byname3_sleepable_res, 9, "check_uprobe_byname3_sleepable_res");
+	ASSERT_EQ(skel->bss->uprobe_byname3_res, 10, "check_uprobe_byname3_res");
+	ASSERT_EQ(skel->bss->uretprobe_byname3_sleepable_res, 11, "check_uretprobe_byname3_sleepable_res");
+	ASSERT_EQ(skel->bss->uretprobe_byname3_res, 12, "check_uretprobe_byname3_res");
 
 cleanup:
 	test_attach_probe__destroy(skel);
diff --git a/tools/testing/selftests/bpf/progs/test_attach_probe.c b/tools/testing/selftests/bpf/progs/test_attach_probe.c
index af994d16bb10..265a23af74d4 100644
--- a/tools/testing/selftests/bpf/progs/test_attach_probe.c
+++ b/tools/testing/selftests/bpf/progs/test_attach_probe.c
@@ -14,6 +14,10 @@ int uprobe_byname_res = 0;
 int uretprobe_byname_res = 0;
 int uprobe_byname2_res = 0;
 int uretprobe_byname2_res = 0;
+int uprobe_byname3_sleepable_res = 0;
+int uprobe_byname3_res = 0;
+int uretprobe_byname3_sleepable_res = 0;
+int uretprobe_byname3_res = 0;
 
 SEC("kprobe/sys_nanosleep")
 int handle_kprobe(struct pt_regs *ctx)
@@ -22,6 +26,13 @@ int handle_kprobe(struct pt_regs *ctx)
 	return 0;
 }
 
+SEC("kprobe.s/sys_nanosleep")
+int handle_kprobe_sleepable(struct pt_regs *ctx)
+{
+	kprobe_res = 2;
+	return 0;
+}
+
 SEC("kretprobe/sys_nanosleep")
 int BPF_KRETPROBE(handle_kretprobe)
 {
@@ -76,4 +87,37 @@ int handle_uretprobe_byname2(struct pt_regs *ctx)
 	return 0;
 }
 
+SEC("uprobe.s//proc/self/exe:trigger_func3")
+int handle_uprobe_byname3_sleepable(struct pt_regs *ctx)
+{
+	uprobe_byname3_sleepable_res = 9;
+	return 0;
+}
+
+/**
+ * same target as the uprobe.s above to force sleepable and non-sleepable
+ * programs in the same bpf_prog_array
+ */
+SEC("uprobe//proc/self/exe:trigger_func3")
+int handle_uprobe_byname3(struct pt_regs *ctx)
+{
+	uprobe_byname3_res = 10;
+	return 0;
+}
+
+SEC("uretprobe.s//proc/self/exe:trigger_func3")
+int handle_uretprobe_byname3_sleepable(struct pt_regs *ctx)
+{
+	uretprobe_byname3_sleepable_res = 11;
+	return 0;
+}
+
+SEC("uretprobe//proc/self/exe:trigger_func3")
+int handle_uretprobe_byname3(struct pt_regs *ctx)
+{
+	uretprobe_byname3_res = 12;
+	return 0;
+}
+
+
 char _license[] SEC("license") = "GPL";
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf-next 0/5] sleepable uprobe support
@ 2022-04-28 16:53 Delyan Kratunov
  2022-04-28 16:53 ` [PATCH bpf-next 4/5] libbpf: add support for sleepable kprobe and uprobe programs Delyan Kratunov
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Delyan Kratunov @ 2022-04-28 16:53 UTC (permalink / raw)
  To: daniel, ast, andrii, bpf; +Cc: paulmck

This series implements support for sleepable uprobe programs.
Key work is in patches 2 and 3, the rest is plumbing and tests.

The main observation is that the only obstacle in the way of sleepable uprobe
programs is not the uprobe infrastructure, which already runs in a user context,
but the rcu usage around bpf_prog_array.

Details are in patch 2 but the tl;dr is that we chain trace_tasks and normal rcu
grace periods when releasing to array to accommodate users of either rcu type.
This introduces latency on non-sleepable users but that's deemed acceptable, given
recent benchmarks by Andrii [1]. We're a couple of orders of magnitude under
the rate of bpf_prog_array churn that would raise flags (~1MM/s per Paul).

  [1]: https://lore.kernel.org/bpf/CAEf4BzbpjN6ca7D9KOTiFPOoBYkciYvTz0UJNp5c-_3ptm=Mrg@mail.gmail.com/

Delyan Kratunov (5):
  bpf: move bpf_prog to bpf.h
  bpf: implement sleepable uprobes by chaining tasks and normal rcu
  bpf: allow sleepable uprobe programs to attach
  libbpf: add support for sleepable kprobe and uprobe programs
  selftests/bpf: add tests for sleepable kprobes and uprobes

 include/linux/bpf.h                           | 96 +++++++++++++++++++
 include/linux/filter.h                        | 34 -------
 include/linux/trace_events.h                  |  1 +
 kernel/bpf/core.c                             | 10 +-
 kernel/bpf/syscall.c                          |  8 ++
 kernel/bpf/verifier.c                         |  4 +-
 kernel/trace/bpf_trace.c                      | 23 +++++
 kernel/trace/trace_uprobe.c                   |  4 +-
 tools/lib/bpf/libbpf.c                        | 10 +-
 .../selftests/bpf/prog_tests/attach_probe.c   | 35 +++++++
 .../selftests/bpf/progs/test_attach_probe.c   | 44 +++++++++
 11 files changed, 228 insertions(+), 41 deletions(-)

--
2.35.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH bpf-next 1/5] bpf: move bpf_prog to bpf.h
  2022-04-28 16:53 [PATCH bpf-next 0/5] sleepable uprobe support Delyan Kratunov
  2022-04-28 16:53 ` [PATCH bpf-next 4/5] libbpf: add support for sleepable kprobe and uprobe programs Delyan Kratunov
  2022-04-28 16:53 ` [PATCH bpf-next 5/5] selftests/bpf: add tests for sleepable kprobes and uprobes Delyan Kratunov
@ 2022-04-28 16:53 ` Delyan Kratunov
  2022-04-28 16:54 ` [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu Delyan Kratunov
  2022-04-28 16:54 ` [PATCH bpf-next 3/5] bpf: allow sleepable uprobe programs to attach Delyan Kratunov
  4 siblings, 0 replies; 17+ messages in thread
From: Delyan Kratunov @ 2022-04-28 16:53 UTC (permalink / raw)
  To: daniel, ast, andrii, bpf

In order to add a version of bpf_prog_run_array which accesses the
bpf_prog->aux member, we need bpf_prog to be more than a forward
declaration inside bpf.h.

Given that filter.h already includes bpf.h, this merely reorders
the type declarations for filter.h users. bpf.h users now have access to
bpf_prog internals.

Signed-off-by: Delyan Kratunov <delyank@fb.com>
---
 include/linux/bpf.h    | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/filter.h | 34 ----------------------------------
 2 files changed, 36 insertions(+), 34 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7bf441563ffc..7d7f4806f5fb 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -5,6 +5,7 @@
 #define _LINUX_BPF_H 1

 #include <uapi/linux/bpf.h>
+#include <uapi/linux/filter.h>

 #include <linux/workqueue.h>
 #include <linux/file.h>
@@ -22,6 +23,7 @@
 #include <linux/sched/mm.h>
 #include <linux/slab.h>
 #include <linux/percpu-refcount.h>
+#include <linux/stddef.h>
 #include <linux/bpfptr.h>

 struct bpf_verifier_env;
@@ -1019,6 +1021,40 @@ struct bpf_prog_aux {
 	};
 };

+struct bpf_prog {
+	u16			pages;		/* Number of allocated pages */
+	u16			jited:1,	/* Is our filter JIT'ed? */
+				jit_requested:1,/* archs need to JIT the prog */
+				gpl_compatible:1, /* Is filter GPL compatible? */
+				cb_access:1,	/* Is control block accessed? */
+				dst_needed:1,	/* Do we need dst entry? */
+				blinding_requested:1, /* needs constant blinding */
+				blinded:1,	/* Was blinded */
+				is_func:1,	/* program is a bpf function */
+				kprobe_override:1, /* Do we override a kprobe? */
+				has_callchain_buf:1, /* callchain buffer allocated? */
+				enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
+				call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */
+				call_get_func_ip:1, /* Do we call get_func_ip() */
+				tstamp_type_access:1; /* Accessed __sk_buff->tstamp_type */
+	enum bpf_prog_type	type;		/* Type of BPF program */
+	enum bpf_attach_type	expected_attach_type; /* For some prog types */
+	u32			len;		/* Number of filter blocks */
+	u32			jited_len;	/* Size of jited insns in bytes */
+	u8			tag[BPF_TAG_SIZE];
+	struct bpf_prog_stats __percpu *stats;
+	int __percpu		*active;
+	unsigned int		(*bpf_func)(const void *ctx,
+					    const struct bpf_insn *insn);
+	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
+	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
+	/* Instructions for interpreter */
+	union {
+		DECLARE_FLEX_ARRAY(struct sock_filter, insns);
+		DECLARE_FLEX_ARRAY(struct bpf_insn, insnsi);
+	};
+};
+
 struct bpf_array_aux {
 	/* Programs with direct jumps into programs part of this array. */
 	struct list_head poke_progs;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index ed0c0ff42ad5..d0cbb31b1b4d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -559,40 +559,6 @@ struct bpf_prog_stats {
 	struct u64_stats_sync syncp;
 } __aligned(2 * sizeof(u64));

-struct bpf_prog {
-	u16			pages;		/* Number of allocated pages */
-	u16			jited:1,	/* Is our filter JIT'ed? */
-				jit_requested:1,/* archs need to JIT the prog */
-				gpl_compatible:1, /* Is filter GPL compatible? */
-				cb_access:1,	/* Is control block accessed? */
-				dst_needed:1,	/* Do we need dst entry? */
-				blinding_requested:1, /* needs constant blinding */
-				blinded:1,	/* Was blinded */
-				is_func:1,	/* program is a bpf function */
-				kprobe_override:1, /* Do we override a kprobe? */
-				has_callchain_buf:1, /* callchain buffer allocated? */
-				enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
-				call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */
-				call_get_func_ip:1, /* Do we call get_func_ip() */
-				tstamp_type_access:1; /* Accessed __sk_buff->tstamp_type */
-	enum bpf_prog_type	type;		/* Type of BPF program */
-	enum bpf_attach_type	expected_attach_type; /* For some prog types */
-	u32			len;		/* Number of filter blocks */
-	u32			jited_len;	/* Size of jited insns in bytes */
-	u8			tag[BPF_TAG_SIZE];
-	struct bpf_prog_stats __percpu *stats;
-	int __percpu		*active;
-	unsigned int		(*bpf_func)(const void *ctx,
-					    const struct bpf_insn *insn);
-	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
-	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
-	/* Instructions for interpreter */
-	union {
-		DECLARE_FLEX_ARRAY(struct sock_filter, insns);
-		DECLARE_FLEX_ARRAY(struct bpf_insn, insnsi);
-	};
-};
-
 struct sk_filter {
 	refcount_t	refcnt;
 	struct rcu_head	rcu;
--
2.35.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu
  2022-04-28 16:53 [PATCH bpf-next 0/5] sleepable uprobe support Delyan Kratunov
                   ` (2 preceding siblings ...)
  2022-04-28 16:53 ` [PATCH bpf-next 1/5] bpf: move bpf_prog to bpf.h Delyan Kratunov
@ 2022-04-28 16:54 ` Delyan Kratunov
  2022-04-28 18:19   ` Andrii Nakryiko
  2022-04-28 16:54 ` [PATCH bpf-next 3/5] bpf: allow sleepable uprobe programs to attach Delyan Kratunov
  4 siblings, 1 reply; 17+ messages in thread
From: Delyan Kratunov @ 2022-04-28 16:54 UTC (permalink / raw)
  To: daniel, ast, andrii, bpf; +Cc: paulmck

uprobes work by raising a trap, setting a task flag from within the
interrupt handler, and processing the actual work for the uprobe on the
way back to userspace. As a result, uprobe handlers already execute in a
user context. The primary obstacle to sleepable bpf uprobe programs is
therefore on the bpf side.

Namely, the bpf_prog_array attached to the uprobe is protected by normal
rcu and runs with disabled preemption. In order for uprobe bpf programs
to become actually sleepable, we need it to be protected by the tasks_trace
rcu flavor instead (and kfree() called after a corresponding grace period).

One way to achieve this is by tracking an array-has-contained-sleepable-prog
flag in bpf_prog_array and switching rcu flavors based on it. However, this
is deemed somewhat unwieldly and the rcu flavor transition would be hard
to reason about.

Instead, based on Alexei's proposal, we change the free path for
bpf_prog_array to chain a tasks_trace and normal grace periods
one after the other. Users who iterate under tasks_trace read section would
be safe, as would users who iterate under normal read sections (from
non-sleepable locations). The downside is that we take the tasks_trace latency
unconditionally but that's deemed acceptable under expected workloads.

The other interesting implication is wrt non-sleepable uprobe
programs. Because they need access to dynamically sized rcu-protected
maps, we conditionally disable preemption and take an rcu read section
around them, in addition to the overarching tasks_trace section.

Signed-off-by: Delyan Kratunov <delyank@fb.com>
---
 include/linux/bpf.h          | 60 ++++++++++++++++++++++++++++++++++++
 include/linux/trace_events.h |  1 +
 kernel/bpf/core.c            | 10 +++++-
 kernel/trace/bpf_trace.c     | 23 ++++++++++++++
 kernel/trace/trace_uprobe.c  |  4 +--
 5 files changed, 94 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7d7f4806f5fb..d8692d7176ce 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -25,6 +25,7 @@
 #include <linux/percpu-refcount.h>
 #include <linux/stddef.h>
 #include <linux/bpfptr.h>
+#include <linux/rcupdate_trace.h>

 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -1379,6 +1380,65 @@ bpf_prog_run_array(const struct bpf_prog_array *array,
 	return ret;
 }

+/**
+ * Notes on RCU design for bpf_prog_arrays containing sleepable programs:
+ *
+ * We use the trace RCU flavor read section to protect the bpf_prog_array overall.
+ * Because there are users of bpf_prog_array that only use the normal
+ * rcu flavor, to avoid tracking a used-for-sleepable-programs bit in the
+ * array, we chain call_rcu_tasks_trace and kfree_rcu on the free path.
+ * This is suboptimal if the array is never used for sleepable programs
+ * but not a cause of concern under expected workloads.
+ *
+ * In the case where a non-sleepable program is inside the array,
+ * we take the rcu read section and disable preemption for that program
+ * alone, so it can access rcu-protected dynamically sized maps.
+ */
+static __always_inline u32
+bpf_prog_run_array_sleepable(const struct bpf_prog_array __rcu *array_rcu,
+			     const void *ctx, bpf_prog_run_fn run_prog)
+{
+	const struct bpf_prog_array_item *item;
+	const struct bpf_prog *prog;
+	const struct bpf_prog_array *array;
+	struct bpf_run_ctx *old_run_ctx;
+	struct bpf_trace_run_ctx run_ctx;
+	u32 ret = 1;
+
+	might_fault();
+
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "shouldn't be holding rcu lock");
+
+	migrate_disable();
+	rcu_read_lock_trace();
+
+	array = rcu_dereference(array_rcu);
+	if (unlikely(!array))
+		goto out;
+	old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
+	item = &array->items[0];
+	while ((prog = READ_ONCE(item->prog))) {
+		if (!prog->aux->sleepable) {
+			preempt_disable();
+			rcu_read_lock();
+		}
+
+		run_ctx.bpf_cookie = item->bpf_cookie;
+		ret &= run_prog(prog, ctx);
+		item++;
+
+		if (!prog->aux->sleepable) {
+			rcu_read_unlock();
+			preempt_enable();
+		}
+	}
+	bpf_reset_run_ctx(old_run_ctx);
+out:
+	rcu_read_unlock_trace();
+	migrate_enable();
+	return ret;
+}
+
 #ifdef CONFIG_BPF_SYSCALL
 DECLARE_PER_CPU(int, bpf_prog_active);
 extern struct mutex bpf_stats_enabled_mutex;
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index e6e95a9f07a5..d45889f1210d 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -736,6 +736,7 @@ trace_trigger_soft_disabled(struct trace_event_file *file)

 #ifdef CONFIG_BPF_EVENTS
 unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx);
+unsigned int uprobe_call_bpf(struct trace_event_call *call, void *ctx);
 int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog, u64 bpf_cookie);
 void perf_event_detach_bpf_prog(struct perf_event *event);
 int perf_event_query_prog_array(struct perf_event *event, void __user *info);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 13e9dbeeedf3..a4c301ef2ba1 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2261,11 +2261,19 @@ struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags)
 	return &bpf_empty_prog_array.hdr;
 }

+static void __bpf_prog_array_free_rcu(struct rcu_head *rcu)
+{
+	struct bpf_prog_array *progs;
+
+	progs = container_of(rcu, struct bpf_prog_array, rcu);
+	kfree_rcu(progs, rcu);
+}
+
 void bpf_prog_array_free(struct bpf_prog_array *progs)
 {
 	if (!progs || progs == &bpf_empty_prog_array.hdr)
 		return;
-	kfree_rcu(progs, rcu);
+	call_rcu_tasks_trace(&progs->rcu, __bpf_prog_array_free_rcu);
 }

 int bpf_prog_array_length(struct bpf_prog_array *array)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f15b826f9899..6a2291d4b5a0 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -140,6 +140,29 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
 	return ret;
 }

+unsigned int uprobe_call_bpf(struct trace_event_call *call, void *ctx)
+{
+	unsigned int ret;
+
+	/*
+	 * Instead of moving rcu_read_lock/rcu_dereference/rcu_read_unlock
+	 * to all call sites, we did a bpf_prog_array_valid() there to check
+	 * whether call->prog_array is empty or not, which is
+	 * a heuristic to speed up execution.
+	 *
+	 * If bpf_prog_array_valid() fetched prog_array was
+	 * non-NULL, we go into uprobe_call_bpf() and do the actual
+	 * proper rcu_dereference() under RCU trace lock.
+	 * If it turns out that prog_array is NULL then, we bail out.
+	 * For the opposite, if the bpf_prog_array_valid() fetched pointer
+	 * was NULL, you'll skip the prog_array with the risk of missing
+	 * out of events when it was updated in between this and the
+	 * rcu_dereference() which is accepted risk.
+	 */
+	ret = bpf_prog_run_array_sleepable(call->prog_array, ctx, bpf_prog_run);
+	return ret;
+}
+
 #ifdef CONFIG_BPF_KPROBE_OVERRIDE
 BPF_CALL_2(bpf_override_return, struct pt_regs *, regs, unsigned long, rc)
 {
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 9711589273cd..3eb48897d15b 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1346,9 +1346,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 	if (bpf_prog_array_valid(call)) {
 		u32 ret;

-		preempt_disable();
-		ret = trace_call_bpf(call, regs);
-		preempt_enable();
+		ret = uprobe_call_bpf(call, regs);
 		if (!ret)
 			return;
 	}
--
2.35.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH bpf-next 3/5] bpf: allow sleepable uprobe programs to attach
  2022-04-28 16:53 [PATCH bpf-next 0/5] sleepable uprobe support Delyan Kratunov
                   ` (3 preceding siblings ...)
  2022-04-28 16:54 ` [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu Delyan Kratunov
@ 2022-04-28 16:54 ` Delyan Kratunov
  2022-04-28 18:22   ` Andrii Nakryiko
  4 siblings, 1 reply; 17+ messages in thread
From: Delyan Kratunov @ 2022-04-28 16:54 UTC (permalink / raw)
  To: daniel, ast, andrii, bpf

uprobe and kprobe programs have the same program type, KPROBE, which is
currently not allowed to load sleepable programs.

To avoid adding a new UPROBE type, we instead allow sleepable KPROBE
programs to load and defer the is-it-actually-a-uprobe-program check
to attachment time, where we're already validating the corresponding
perf_event.

A corollary of this patch is that you can now load a sleepable kprobe
program but cannot attach it.

Signed-off-by: Delyan Kratunov <delyank@fb.com>
---
 kernel/bpf/syscall.c  | 8 ++++++++
 kernel/bpf/verifier.c | 4 ++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e9e3e49c0eb7..3ce923f489d7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3009,6 +3009,14 @@ static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *pro
 	}

 	event = perf_file->private_data;
+	if (prog->aux->sleepable) {
+		if (!event->tp_event || (event->tp_event->flags & TRACE_EVENT_FL_UPROBE) == 0) {
+			err = -EINVAL;
+			bpf_link_cleanup(&link_primer);
+			goto out_put_file;
+		}
+	}
+
 	err = perf_event_set_bpf_prog(event, prog, attr->link_create.perf_event.bpf_cookie);
 	if (err) {
 		bpf_link_cleanup(&link_primer);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 71827d14724a..c6258118dd75 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14313,8 +14313,8 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 	}

 	if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING &&
-	    prog->type != BPF_PROG_TYPE_LSM) {
-		verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n");
+	    prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_KPROBE) {
+		verbose(env, "Only fentry/fexit/fmod_ret, lsm, and kprobe/uprobe programs can be sleepable\n");
 		return -EINVAL;
 	}

--
2.35.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu
  2022-04-28 16:54 ` [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu Delyan Kratunov
@ 2022-04-28 18:19   ` Andrii Nakryiko
  2022-04-28 19:15     ` Delyan Kratunov
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2022-04-28 18:19 UTC (permalink / raw)
  To: Delyan Kratunov; +Cc: daniel, ast, andrii, bpf, paulmck

On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote:
>
> uprobes work by raising a trap, setting a task flag from within the
> interrupt handler, and processing the actual work for the uprobe on the
> way back to userspace. As a result, uprobe handlers already execute in a
> user context. The primary obstacle to sleepable bpf uprobe programs is
> therefore on the bpf side.
>
> Namely, the bpf_prog_array attached to the uprobe is protected by normal
> rcu and runs with disabled preemption. In order for uprobe bpf programs
> to become actually sleepable, we need it to be protected by the tasks_trace
> rcu flavor instead (and kfree() called after a corresponding grace period).
>
> One way to achieve this is by tracking an array-has-contained-sleepable-prog
> flag in bpf_prog_array and switching rcu flavors based on it. However, this
> is deemed somewhat unwieldly and the rcu flavor transition would be hard
> to reason about.
>
> Instead, based on Alexei's proposal, we change the free path for
> bpf_prog_array to chain a tasks_trace and normal grace periods
> one after the other. Users who iterate under tasks_trace read section would
> be safe, as would users who iterate under normal read sections (from
> non-sleepable locations). The downside is that we take the tasks_trace latency
> unconditionally but that's deemed acceptable under expected workloads.

One example where this actually can become a problem is cgroup BPF
programs. There you can make single attachment to root cgroup, but it
will create one "effective" prog_array for each descendant (and will
keep destroying and creating them as child cgroups are created). So
there is this invisible multiplier which we can't really control.

So paying the (however small, but) price of call_rcu_tasks_trace() in
bpf_prog_array_free() which is used for purely non-sleepable cases
seems unfortunate. But I think an alternative to tracking this "has
sleepable" bit on a per bpf_prog_array case is to have
bpf_prog_array_free_sleepable() implementation independent of
bpf_prog_array_free() itself and call that sleepable variant from
uprobe detach handler, limiting the impact to things that actually
might be running as sleepable and which most likely won't churn
through a huge amount of arrays. WDYT?

Otherwise all looks good and surprisingly straightforward thanks to
the fact uprobe is already running in a sleepable context, awesome!

>
> The other interesting implication is wrt non-sleepable uprobe
> programs. Because they need access to dynamically sized rcu-protected
> maps, we conditionally disable preemption and take an rcu read section
> around them, in addition to the overarching tasks_trace section.
>
> Signed-off-by: Delyan Kratunov <delyank@fb.com>
> ---
>  include/linux/bpf.h          | 60 ++++++++++++++++++++++++++++++++++++
>  include/linux/trace_events.h |  1 +
>  kernel/bpf/core.c            | 10 +++++-
>  kernel/trace/bpf_trace.c     | 23 ++++++++++++++
>  kernel/trace/trace_uprobe.c  |  4 +--
>  5 files changed, 94 insertions(+), 4 deletions(-)
>

[...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next 3/5] bpf: allow sleepable uprobe programs to attach
  2022-04-28 16:54 ` [PATCH bpf-next 3/5] bpf: allow sleepable uprobe programs to attach Delyan Kratunov
@ 2022-04-28 18:22   ` Andrii Nakryiko
  2022-04-28 18:40     ` Alexei Starovoitov
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2022-04-28 18:22 UTC (permalink / raw)
  To: Delyan Kratunov; +Cc: daniel, ast, andrii, bpf

On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote:
>
> uprobe and kprobe programs have the same program type, KPROBE, which is
> currently not allowed to load sleepable programs.
>
> To avoid adding a new UPROBE type, we instead allow sleepable KPROBE
> programs to load and defer the is-it-actually-a-uprobe-program check
> to attachment time, where we're already validating the corresponding
> perf_event.
>
> A corollary of this patch is that you can now load a sleepable kprobe
> program but cannot attach it.
>
> Signed-off-by: Delyan Kratunov <delyank@fb.com>
> ---
>  kernel/bpf/syscall.c  | 8 ++++++++
>  kernel/bpf/verifier.c | 4 ++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index e9e3e49c0eb7..3ce923f489d7 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -3009,6 +3009,14 @@ static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *pro
>         }
>
>         event = perf_file->private_data;
> +       if (prog->aux->sleepable) {
> +               if (!event->tp_event || (event->tp_event->flags & TRACE_EVENT_FL_UPROBE) == 0) {

so far TRACE_EVENT_FL_UPROBE was contained to within kernel/trace so
far, maybe it's better to instead expose a helper function to check if
perf_event represents uprobe?

> +                       err = -EINVAL;
> +                       bpf_link_cleanup(&link_primer);
> +                       goto out_put_file;
> +               }
> +       }
> +
>         err = perf_event_set_bpf_prog(event, prog, attr->link_create.perf_event.bpf_cookie);
>         if (err) {
>                 bpf_link_cleanup(&link_primer);
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 71827d14724a..c6258118dd75 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -14313,8 +14313,8 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
>         }
>
>         if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING &&
> -           prog->type != BPF_PROG_TYPE_LSM) {
> -               verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n");
> +           prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_KPROBE) {
> +               verbose(env, "Only fentry/fexit/fmod_ret, lsm, and kprobe/uprobe programs can be sleepable\n");
>                 return -EINVAL;
>         }
>
> --
> 2.35.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next 4/5] libbpf: add support for sleepable kprobe and uprobe programs
  2022-04-28 16:53 ` [PATCH bpf-next 4/5] libbpf: add support for sleepable kprobe and uprobe programs Delyan Kratunov
@ 2022-04-28 18:33   ` Andrii Nakryiko
  2022-04-28 19:11     ` Delyan Kratunov
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2022-04-28 18:33 UTC (permalink / raw)
  To: Delyan Kratunov; +Cc: daniel, ast, andrii, bpf

On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote:
>
> Add section mappings for uprobe.s and kprobe.s programs. The latter
> cannot currently attach but they're still useful to open and load in
> order to validate that prohibition.
>

This patch made me realize that some changes I did few weeks ago
hasn't landed ([0]). I'm going to rebase and resubmit and I'll ask you
to rebase on top of those changes.

  [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=630550&state=*

> Signed-off-by: Delyan Kratunov <delyank@fb.com>
> ---
>  tools/lib/bpf/libbpf.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 9a213aaaac8a..9e89a478d40e 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -8692,9 +8692,12 @@ static const struct bpf_sec_def section_defs[] = {
>         SEC_DEF("sk_reuseport/migrate", SK_REUSEPORT, BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
>         SEC_DEF("sk_reuseport",         SK_REUSEPORT, BPF_SK_REUSEPORT_SELECT, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
>         SEC_DEF("kprobe/",              KPROBE, 0, SEC_NONE, attach_kprobe),
> +       SEC_DEF("kprobe.s/",            KPROBE, 0, SEC_SLEEPABLE, attach_kprobe),

but do we really have sleepable kprobes supported in the kernel? I
don't think yet, let's not advertise as if SEC("kprobe.s") is a thing
until we do

>         SEC_DEF("uprobe+",              KPROBE, 0, SEC_NONE, attach_uprobe),
> +       SEC_DEF("uprobe.s+",            KPROBE, 0, SEC_SLEEPABLE, attach_uprobe),
>         SEC_DEF("kretprobe/",           KPROBE, 0, SEC_NONE, attach_kprobe),
>         SEC_DEF("uretprobe+",           KPROBE, 0, SEC_NONE, attach_uprobe),
> +       SEC_DEF("uretprobe.s+",         KPROBE, 0, SEC_SLEEPABLE, attach_uprobe),
>         SEC_DEF("kprobe.multi/",        KPROBE, BPF_TRACE_KPROBE_MULTI, SEC_NONE, attach_kprobe_multi),
>         SEC_DEF("kretprobe.multi/",     KPROBE, BPF_TRACE_KPROBE_MULTI, SEC_NONE, attach_kprobe_multi),
>         SEC_DEF("usdt+",                KPROBE, 0, SEC_NONE, attach_usdt),
> @@ -10432,13 +10435,18 @@ static int attach_kprobe(const struct bpf_program *prog, long cookie, struct bpf
>         const char *func_name;
>         char *func;
>         int n;
> +       bool sleepable = false;
>
>         opts.retprobe = str_has_pfx(prog->sec_name, "kretprobe/");
> +       sleepable = str_has_pfx(prog->sec_name, "kprobe.s/");
>         if (opts.retprobe)
>                 func_name = prog->sec_name + sizeof("kretprobe/") - 1;
> +       else if (sleepable)
> +               func_name = prog->sec_name + sizeof("kprobe.s/") - 1;
>         else
>                 func_name = prog->sec_name + sizeof("kprobe/") - 1;
>
> +
>         n = sscanf(func_name, "%m[a-zA-Z0-9_.]+%li", &func, &offset);
>         if (n < 1) {
>                 pr_warn("kprobe name is invalid: %s\n", func_name);
> @@ -10957,7 +10965,7 @@ static int attach_uprobe(const struct bpf_program *prog, long cookie, struct bpf
>                 break;
>         case 3:
>         case 4:
> -               opts.retprobe = strcmp(probe_type, "uretprobe") == 0;
> +               opts.retprobe = str_has_pfx(probe_type, "uretprobe");

it's a total nit but I'd feel a bit more comfortable with explicit
check for "uretprobe" and "uretprobe.s" instead of a prefix match, if
you don't mind.

>                 if (opts.retprobe && offset != 0) {
>                         pr_warn("prog '%s': uretprobes do not support offset specification\n",
>                                 prog->name);
> --
> 2.35.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next 3/5] bpf: allow sleepable uprobe programs to attach
  2022-04-28 18:22   ` Andrii Nakryiko
@ 2022-04-28 18:40     ` Alexei Starovoitov
  0 siblings, 0 replies; 17+ messages in thread
From: Alexei Starovoitov @ 2022-04-28 18:40 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Delyan Kratunov, daniel, ast, andrii, bpf

On Thu, Apr 28, 2022 at 11:22 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote:
> >
> > uprobe and kprobe programs have the same program type, KPROBE, which is
> > currently not allowed to load sleepable programs.
> >
> > To avoid adding a new UPROBE type, we instead allow sleepable KPROBE
> > programs to load and defer the is-it-actually-a-uprobe-program check
> > to attachment time, where we're already validating the corresponding
> > perf_event.
> >
> > A corollary of this patch is that you can now load a sleepable kprobe
> > program but cannot attach it.
> >
> > Signed-off-by: Delyan Kratunov <delyank@fb.com>
> > ---
> >  kernel/bpf/syscall.c  | 8 ++++++++
> >  kernel/bpf/verifier.c | 4 ++--
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index e9e3e49c0eb7..3ce923f489d7 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -3009,6 +3009,14 @@ static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *pro
> >         }
> >
> >         event = perf_file->private_data;
> > +       if (prog->aux->sleepable) {
> > +               if (!event->tp_event || (event->tp_event->flags & TRACE_EVENT_FL_UPROBE) == 0) {
>
> so far TRACE_EVENT_FL_UPROBE was contained to within kernel/trace so
> far, maybe it's better to instead expose a helper function to check if
> perf_event represents uprobe?

or we can move prog->aux->sleepable check down into
perf_event_set_bpf_prog().
Which is probably cleaner.
We check other prog flags there.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next 5/5] selftests/bpf: add tests for sleepable kprobes and uprobes
  2022-04-28 16:53 ` [PATCH bpf-next 5/5] selftests/bpf: add tests for sleepable kprobes and uprobes Delyan Kratunov
@ 2022-04-28 18:41   ` Andrii Nakryiko
  0 siblings, 0 replies; 17+ messages in thread
From: Andrii Nakryiko @ 2022-04-28 18:41 UTC (permalink / raw)
  To: Delyan Kratunov; +Cc: daniel, ast, andrii, bpf

On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote:
>
> Add tests that ensure sleepable kprobe programs cannot attach.
>
> Also attach both sleepable and non-sleepable uprobe programs to the same
> location (i.e. same bpf_prog_array).
>

Yep, great thinking! I left a few comments below, otherwise looks good.

> Signed-off-by: Delyan Kratunov <delyank@fb.com>
> ---
>  .../selftests/bpf/prog_tests/attach_probe.c   | 35 +++++++++++++++
>  .../selftests/bpf/progs/test_attach_probe.c   | 44 +++++++++++++++++++
>  2 files changed, 79 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/attach_probe.c b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
> index c0c6d410751d..c5c601537eea 100644
> --- a/tools/testing/selftests/bpf/prog_tests/attach_probe.c
> +++ b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
> @@ -17,6 +17,12 @@ static void trigger_func2(void)
>         asm volatile ("");
>  }
>
> +/* attach point for byname sleepable uprobe */
> +static void trigger_func3(void)
> +{
> +       asm volatile ("");
> +}
> +
>  void test_attach_probe(void)
>  {
>         DECLARE_LIBBPF_OPTS(bpf_uprobe_opts, uprobe_opts);
> @@ -143,6 +149,28 @@ void test_attach_probe(void)
>         if (!ASSERT_OK_PTR(skel->links.handle_uretprobe_byname2, "attach_uretprobe_byname2"))
>                 goto cleanup;
>
> +       /* sleepable kprobes should not attach successfully */
> +       skel->links.handle_kprobe_sleepable = bpf_program__attach(skel->progs.handle_kprobe_sleepable);
> +       if (!ASSERT_NULL(skel->links.handle_kprobe_sleepable, "attach_kprobe_sleepable"))

we have ASSERT_ERR_PTR() which is more in line with ASSERT_OK_PTR(),
let's use that one.

With dropping SEC("kprobe.s") you'll have to do one extra step to make
sure that handle_kprobe_sleepable is actually sleepable program during
BPF verification. Please use bpf_program__set_flags() before load step
for that.

> +               goto cleanup;
> +
> +       /* test sleepable uprobe and uretprobe variants */
> +       skel->links.handle_uprobe_byname3_sleepable = bpf_program__attach(skel->progs.handle_uprobe_byname3_sleepable);
> +       if (!ASSERT_OK_PTR(skel->links.handle_uprobe_byname3_sleepable, "attach_uprobe_byname3_sleepable"))
> +               goto cleanup;
> +

[...]

> diff --git a/tools/testing/selftests/bpf/progs/test_attach_probe.c b/tools/testing/selftests/bpf/progs/test_attach_probe.c
> index af994d16bb10..265a23af74d4 100644
> --- a/tools/testing/selftests/bpf/progs/test_attach_probe.c
> +++ b/tools/testing/selftests/bpf/progs/test_attach_probe.c
> @@ -14,6 +14,10 @@ int uprobe_byname_res = 0;
>  int uretprobe_byname_res = 0;
>  int uprobe_byname2_res = 0;
>  int uretprobe_byname2_res = 0;
> +int uprobe_byname3_sleepable_res = 0;
> +int uprobe_byname3_res = 0;
> +int uretprobe_byname3_sleepable_res = 0;
> +int uretprobe_byname3_res = 0;
>
>  SEC("kprobe/sys_nanosleep")
>  int handle_kprobe(struct pt_regs *ctx)
> @@ -22,6 +26,13 @@ int handle_kprobe(struct pt_regs *ctx)
>         return 0;
>  }
>
> +SEC("kprobe.s/sys_nanosleep")

can you please leave comment here that this is supposed to fail to be
attached? It took me a bit to notice that you do negative test with
this program

> +int handle_kprobe_sleepable(struct pt_regs *ctx)
> +{
> +       kprobe_res = 2;
> +       return 0;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next 4/5] libbpf: add support for sleepable kprobe and uprobe programs
  2022-04-28 18:33   ` Andrii Nakryiko
@ 2022-04-28 19:11     ` Delyan Kratunov
  0 siblings, 0 replies; 17+ messages in thread
From: Delyan Kratunov @ 2022-04-28 19:11 UTC (permalink / raw)
  To: andrii.nakryiko; +Cc: daniel, ast, andrii, bpf

On Thu, 2022-04-28 at 11:33 -0700, Andrii Nakryiko wrote:
> On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote:
> > 
> > Add section mappings for uprobe.s and kprobe.s programs. The latter
> > cannot currently attach but they're still useful to open and load in
> > order to validate that prohibition.
> > 
> 
> This patch made me realize that some changes I did few weeks ago
> hasn't landed ([0]). I'm going to rebase and resubmit and I'll ask you
> to rebase on top of those changes.
> 
>   [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=630550&state=*
> 
> > Signed-off-by: Delyan Kratunov <delyank@fb.com>
> > ---
> >  tools/lib/bpf/libbpf.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 9a213aaaac8a..9e89a478d40e 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -8692,9 +8692,12 @@ static const struct bpf_sec_def section_defs[] = {
> >         SEC_DEF("sk_reuseport/migrate", SK_REUSEPORT, BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
> >         SEC_DEF("sk_reuseport",         SK_REUSEPORT, BPF_SK_REUSEPORT_SELECT, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
> >         SEC_DEF("kprobe/",              KPROBE, 0, SEC_NONE, attach_kprobe),
> > +       SEC_DEF("kprobe.s/",            KPROBE, 0, SEC_SLEEPABLE, attach_kprobe),
> 
> but do we really have sleepable kprobes supported in the kernel? I
> don't think yet, let's not advertise as if SEC("kprobe.s") is a thing
> until we do

Fair enough, I was being lazy. I'll have the test explicitly set flags/type, it's not
that hard anyway.

> >         SEC_DEF("uprobe+",              KPROBE, 0, SEC_NONE, attach_uprobe),
> > +       SEC_DEF("uprobe.s+",            KPROBE, 0, SEC_SLEEPABLE, attach_uprobe),
> >         SEC_DEF("kretprobe/",           KPROBE, 0, SEC_NONE, attach_kprobe),
> >         SEC_DEF("uretprobe+",           KPROBE, 0, SEC_NONE, attach_uprobe),
> > +       SEC_DEF("uretprobe.s+",         KPROBE, 0, SEC_SLEEPABLE, attach_uprobe),
> >         SEC_DEF("kprobe.multi/",        KPROBE, BPF_TRACE_KPROBE_MULTI, SEC_NONE, attach_kprobe_multi),
> >         SEC_DEF("kretprobe.multi/",     KPROBE, BPF_TRACE_KPROBE_MULTI, SEC_NONE, attach_kprobe_multi),
> >         SEC_DEF("usdt+",                KPROBE, 0, SEC_NONE, attach_usdt),
> > @@ -10432,13 +10435,18 @@ static int attach_kprobe(const struct bpf_program *prog, long cookie, struct bpf
> >         const char *func_name;
> >         char *func;
> >         int n;
> > +       bool sleepable = false;
> > 
> >         opts.retprobe = str_has_pfx(prog->sec_name, "kretprobe/");
> > +       sleepable = str_has_pfx(prog->sec_name, "kprobe.s/");
> >         if (opts.retprobe)
> >                 func_name = prog->sec_name + sizeof("kretprobe/") - 1;
> > +       else if (sleepable)
> > +               func_name = prog->sec_name + sizeof("kprobe.s/") - 1;
> >         else
> >                 func_name = prog->sec_name + sizeof("kprobe/") - 1;
> > 
> > +
> >         n = sscanf(func_name, "%m[a-zA-Z0-9_.]+%li", &func, &offset);
> >         if (n < 1) {
> >                 pr_warn("kprobe name is invalid: %s\n", func_name);
> > @@ -10957,7 +10965,7 @@ static int attach_uprobe(const struct bpf_program *prog, long cookie, struct bpf
> >                 break;
> >         case 3:
> >         case 4:
> > -               opts.retprobe = strcmp(probe_type, "uretprobe") == 0;
> > +               opts.retprobe = str_has_pfx(probe_type, "uretprobe");
> 
> it's a total nit but I'd feel a bit more comfortable with explicit
> check for "uretprobe" and "uretprobe.s" instead of a prefix match, if
> you don't mind.

Sure.

> 
> >                 if (opts.retprobe && offset != 0) {
> >                         pr_warn("prog '%s': uretprobes do not support offset specification\n",
> >                                 prog->name);
> > --
> > 2.35.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu
  2022-04-28 18:19   ` Andrii Nakryiko
@ 2022-04-28 19:15     ` Delyan Kratunov
  2022-04-28 20:58       ` Alexei Starovoitov
  0 siblings, 1 reply; 17+ messages in thread
From: Delyan Kratunov @ 2022-04-28 19:15 UTC (permalink / raw)
  To: andrii.nakryiko; +Cc: daniel, paulmck, ast, andrii, bpf

On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote:
> On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote:
> > 
> > uprobes work by raising a trap, setting a task flag from within the
> > interrupt handler, and processing the actual work for the uprobe on the
> > way back to userspace. As a result, uprobe handlers already execute in a
> > user context. The primary obstacle to sleepable bpf uprobe programs is
> > therefore on the bpf side.
> > 
> > Namely, the bpf_prog_array attached to the uprobe is protected by normal
> > rcu and runs with disabled preemption. In order for uprobe bpf programs
> > to become actually sleepable, we need it to be protected by the tasks_trace
> > rcu flavor instead (and kfree() called after a corresponding grace period).
> > 
> > One way to achieve this is by tracking an array-has-contained-sleepable-prog
> > flag in bpf_prog_array and switching rcu flavors based on it. However, this
> > is deemed somewhat unwieldly and the rcu flavor transition would be hard
> > to reason about.
> > 
> > Instead, based on Alexei's proposal, we change the free path for
> > bpf_prog_array to chain a tasks_trace and normal grace periods
> > one after the other. Users who iterate under tasks_trace read section would
> > be safe, as would users who iterate under normal read sections (from
> > non-sleepable locations). The downside is that we take the tasks_trace latency
> > unconditionally but that's deemed acceptable under expected workloads.
> 
> One example where this actually can become a problem is cgroup BPF
> programs. There you can make single attachment to root cgroup, but it
> will create one "effective" prog_array for each descendant (and will
> keep destroying and creating them as child cgroups are created). So
> there is this invisible multiplier which we can't really control.
> 
> So paying the (however small, but) price of call_rcu_tasks_trace() in
> bpf_prog_array_free() which is used for purely non-sleepable cases
> seems unfortunate. But I think an alternative to tracking this "has
> sleepable" bit on a per bpf_prog_array case is to have
> bpf_prog_array_free_sleepable() implementation independent of
> bpf_prog_array_free() itself and call that sleepable variant from
> uprobe detach handler, limiting the impact to things that actually
> might be running as sleepable and which most likely won't churn
> through a huge amount of arrays. WDYT?

Honestly, I don't like the idea of having two different APIs, where if you use the
wrong one, the program would only fail in rare and undebuggable circumstances. 

If we need specialization (and I'm not convinced we do - what's the rate of cgroup
creation that we can sustain?), we should track that in the bpf_prog_array itself. We
can have the allocation path set a flag and branch on it in free() to determine the
necessary grace periods.

> 
> Otherwise all looks good and surprisingly straightforward thanks to
> the fact uprobe is already running in a sleepable context, awesome!
> 
> > 
> > The other interesting implication is wrt non-sleepable uprobe
> > programs. Because they need access to dynamically sized rcu-protected
> > maps, we conditionally disable preemption and take an rcu read section
> > around them, in addition to the overarching tasks_trace section.
> > 
> > Signed-off-by: Delyan Kratunov <delyank@fb.com>
> > ---
> >  include/linux/bpf.h          | 60 ++++++++++++++++++++++++++++++++++++
> >  include/linux/trace_events.h |  1 +
> >  kernel/bpf/core.c            | 10 +++++-
> >  kernel/trace/bpf_trace.c     | 23 ++++++++++++++
> >  kernel/trace/trace_uprobe.c  |  4 +--
> >  5 files changed, 94 insertions(+), 4 deletions(-)
> > 
> 
> [...]


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu
  2022-04-28 19:15     ` Delyan Kratunov
@ 2022-04-28 20:58       ` Alexei Starovoitov
  2022-04-28 21:35         ` Delyan Kratunov
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2022-04-28 20:58 UTC (permalink / raw)
  To: Delyan Kratunov; +Cc: andrii.nakryiko, daniel, paulmck, ast, andrii, bpf

On Thu, Apr 28, 2022 at 12:15 PM Delyan Kratunov <delyank@fb.com> wrote:
>
> On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote:
> > On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote:
> > >
> > > uprobes work by raising a trap, setting a task flag from within the
> > > interrupt handler, and processing the actual work for the uprobe on the
> > > way back to userspace. As a result, uprobe handlers already execute in a
> > > user context. The primary obstacle to sleepable bpf uprobe programs is
> > > therefore on the bpf side.
> > >
> > > Namely, the bpf_prog_array attached to the uprobe is protected by normal
> > > rcu and runs with disabled preemption. In order for uprobe bpf programs
> > > to become actually sleepable, we need it to be protected by the tasks_trace
> > > rcu flavor instead (and kfree() called after a corresponding grace period).
> > >
> > > One way to achieve this is by tracking an array-has-contained-sleepable-prog
> > > flag in bpf_prog_array and switching rcu flavors based on it. However, this
> > > is deemed somewhat unwieldly and the rcu flavor transition would be hard
> > > to reason about.
> > >
> > > Instead, based on Alexei's proposal, we change the free path for
> > > bpf_prog_array to chain a tasks_trace and normal grace periods
> > > one after the other. Users who iterate under tasks_trace read section would
> > > be safe, as would users who iterate under normal read sections (from
> > > non-sleepable locations). The downside is that we take the tasks_trace latency
> > > unconditionally but that's deemed acceptable under expected workloads.
> >
> > One example where this actually can become a problem is cgroup BPF
> > programs. There you can make single attachment to root cgroup, but it
> > will create one "effective" prog_array for each descendant (and will
> > keep destroying and creating them as child cgroups are created). So
> > there is this invisible multiplier which we can't really control.
> >
> > So paying the (however small, but) price of call_rcu_tasks_trace() in
> > bpf_prog_array_free() which is used for purely non-sleepable cases
> > seems unfortunate. But I think an alternative to tracking this "has
> > sleepable" bit on a per bpf_prog_array case is to have
> > bpf_prog_array_free_sleepable() implementation independent of
> > bpf_prog_array_free() itself and call that sleepable variant from
> > uprobe detach handler, limiting the impact to things that actually
> > might be running as sleepable and which most likely won't churn
> > through a huge amount of arrays. WDYT?
>
> Honestly, I don't like the idea of having two different APIs, where if you use the
> wrong one, the program would only fail in rare and undebuggable circumstances.
>
> If we need specialization (and I'm not convinced we do - what's the rate of cgroup
> creation that we can sustain?), we should track that in the bpf_prog_array itself. We
> can have the allocation path set a flag and branch on it in free() to determine the
> necessary grace periods.

I think what Andrii is proposing is to leave bpf_prog_array_free() as-is
and introduce new bpf_prog_array_free_sleepable() (the way it is in
this patch) and call it from 2 places in kernel/trace/bpf_trace.c only.
This way cgroup won't be affected.

The rcu_trace protection here applies to prog_array itself.
Normal progs are still rcu, sleepable progs are rcu_trace.
Regardless whether they're called via trampoline or this new prog_array.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu
  2022-04-28 20:58       ` Alexei Starovoitov
@ 2022-04-28 21:35         ` Delyan Kratunov
  2022-04-28 22:52           ` Alexei Starovoitov
  0 siblings, 1 reply; 17+ messages in thread
From: Delyan Kratunov @ 2022-04-28 21:35 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: daniel, paulmck, ast, andrii, andrii.nakryiko, bpf

On Thu, 2022-04-28 at 13:58 -0700, Alexei Starovoitov wrote:
> On Thu, Apr 28, 2022 at 12:15 PM Delyan Kratunov <delyank@fb.com> wrote:
> > 
> > On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote:
> > > On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote:
> > > > 
> > > > uprobes work by raising a trap, setting a task flag from within the
> > > > interrupt handler, and processing the actual work for the uprobe on the
> > > > way back to userspace. As a result, uprobe handlers already execute in a
> > > > user context. The primary obstacle to sleepable bpf uprobe programs is
> > > > therefore on the bpf side.
> > > > 
> > > > Namely, the bpf_prog_array attached to the uprobe is protected by normal
> > > > rcu and runs with disabled preemption. In order for uprobe bpf programs
> > > > to become actually sleepable, we need it to be protected by the tasks_trace
> > > > rcu flavor instead (and kfree() called after a corresponding grace period).
> > > > 
> > > > One way to achieve this is by tracking an array-has-contained-sleepable-prog
> > > > flag in bpf_prog_array and switching rcu flavors based on it. However, this
> > > > is deemed somewhat unwieldly and the rcu flavor transition would be hard
> > > > to reason about.
> > > > 
> > > > Instead, based on Alexei's proposal, we change the free path for
> > > > bpf_prog_array to chain a tasks_trace and normal grace periods
> > > > one after the other. Users who iterate under tasks_trace read section would
> > > > be safe, as would users who iterate under normal read sections (from
> > > > non-sleepable locations). The downside is that we take the tasks_trace latency
> > > > unconditionally but that's deemed acceptable under expected workloads.
> > > 
> > > One example where this actually can become a problem is cgroup BPF
> > > programs. There you can make single attachment to root cgroup, but it
> > > will create one "effective" prog_array for each descendant (and will
> > > keep destroying and creating them as child cgroups are created). So
> > > there is this invisible multiplier which we can't really control.
> > > 
> > > So paying the (however small, but) price of call_rcu_tasks_trace() in
> > > bpf_prog_array_free() which is used for purely non-sleepable cases
> > > seems unfortunate. But I think an alternative to tracking this "has
> > > sleepable" bit on a per bpf_prog_array case is to have
> > > bpf_prog_array_free_sleepable() implementation independent of
> > > bpf_prog_array_free() itself and call that sleepable variant from
> > > uprobe detach handler, limiting the impact to things that actually
> > > might be running as sleepable and which most likely won't churn
> > > through a huge amount of arrays. WDYT?
> > 
> > Honestly, I don't like the idea of having two different APIs, where if you use the
> > wrong one, the program would only fail in rare and undebuggable circumstances.
> > 
> > If we need specialization (and I'm not convinced we do - what's the rate of cgroup
> > creation that we can sustain?), we should track that in the bpf_prog_array itself. We
> > can have the allocation path set a flag and branch on it in free() to determine the
> > necessary grace periods.
> 
> I think what Andrii is proposing is to leave bpf_prog_array_free() as-is
> and introduce new bpf_prog_array_free_sleepable() (the way it is in
> this patch) and call it from 2 places in kernel/trace/bpf_trace.c only.
> This way cgroup won't be affected.
> 
> The rcu_trace protection here applies to prog_array itself.
> Normal progs are still rcu, sleepable progs are rcu_trace.
> Regardless whether they're called via trampoline or this new prog_array.

Right, I understand the proposal. My objection is that if tomorrow someone mistakenly
keeps using bpf_prog_array_free when they actually need
bpf_prog_array_free_sleepable, the resulting bug is really difficult to find and
reason about. I can make it correct in this patch series but *keeping* things correct
going forward will be harder when there's two free variants.

Instead, we can have a ARRAY_USE_TRACE_RCU flag which automatically chains the grace
periods inside bpf_prog_array_free. That way we eliminate potential future confusion
and instead of introducing subtle rcu bugs, at worst you can incur a performance
penalty by using the flag where you don't need it. If we spend the time to one-way
flip the flag only when you actually insert a sleepable program into the array, even
that penalty is eliminated. 

The cost of this tradeoff in maintainability is 4 bytes more per array object (less
if we pack it but that's a worse idea performance-wise). Given how much we already
allocate, I think that's fair but I'm happy to discuss other less foot-gun-y ideas if
the memory usage is a constraint.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu
  2022-04-28 21:35         ` Delyan Kratunov
@ 2022-04-28 22:52           ` Alexei Starovoitov
  2022-05-02 17:31             ` Delyan Kratunov
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2022-04-28 22:52 UTC (permalink / raw)
  To: Delyan Kratunov; +Cc: daniel, paulmck, ast, andrii, andrii.nakryiko, bpf

On Thu, Apr 28, 2022 at 2:35 PM Delyan Kratunov <delyank@fb.com> wrote:
>
> On Thu, 2022-04-28 at 13:58 -0700, Alexei Starovoitov wrote:
> > On Thu, Apr 28, 2022 at 12:15 PM Delyan Kratunov <delyank@fb.com> wrote:
> > >
> > > On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote:
> > > > On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote:
> > > > >
> > > > > uprobes work by raising a trap, setting a task flag from within the
> > > > > interrupt handler, and processing the actual work for the uprobe on the
> > > > > way back to userspace. As a result, uprobe handlers already execute in a
> > > > > user context. The primary obstacle to sleepable bpf uprobe programs is
> > > > > therefore on the bpf side.
> > > > >
> > > > > Namely, the bpf_prog_array attached to the uprobe is protected by normal
> > > > > rcu and runs with disabled preemption. In order for uprobe bpf programs
> > > > > to become actually sleepable, we need it to be protected by the tasks_trace
> > > > > rcu flavor instead (and kfree() called after a corresponding grace period).
> > > > >
> > > > > One way to achieve this is by tracking an array-has-contained-sleepable-prog
> > > > > flag in bpf_prog_array and switching rcu flavors based on it. However, this
> > > > > is deemed somewhat unwieldly and the rcu flavor transition would be hard
> > > > > to reason about.
> > > > >
> > > > > Instead, based on Alexei's proposal, we change the free path for
> > > > > bpf_prog_array to chain a tasks_trace and normal grace periods
> > > > > one after the other. Users who iterate under tasks_trace read section would
> > > > > be safe, as would users who iterate under normal read sections (from
> > > > > non-sleepable locations). The downside is that we take the tasks_trace latency
> > > > > unconditionally but that's deemed acceptable under expected workloads.
> > > >
> > > > One example where this actually can become a problem is cgroup BPF
> > > > programs. There you can make single attachment to root cgroup, but it
> > > > will create one "effective" prog_array for each descendant (and will
> > > > keep destroying and creating them as child cgroups are created). So
> > > > there is this invisible multiplier which we can't really control.
> > > >
> > > > So paying the (however small, but) price of call_rcu_tasks_trace() in
> > > > bpf_prog_array_free() which is used for purely non-sleepable cases
> > > > seems unfortunate. But I think an alternative to tracking this "has
> > > > sleepable" bit on a per bpf_prog_array case is to have
> > > > bpf_prog_array_free_sleepable() implementation independent of
> > > > bpf_prog_array_free() itself and call that sleepable variant from
> > > > uprobe detach handler, limiting the impact to things that actually
> > > > might be running as sleepable and which most likely won't churn
> > > > through a huge amount of arrays. WDYT?
> > >
> > > Honestly, I don't like the idea of having two different APIs, where if you use the
> > > wrong one, the program would only fail in rare and undebuggable circumstances.
> > >
> > > If we need specialization (and I'm not convinced we do - what's the rate of cgroup
> > > creation that we can sustain?), we should track that in the bpf_prog_array itself. We
> > > can have the allocation path set a flag and branch on it in free() to determine the
> > > necessary grace periods.
> >
> > I think what Andrii is proposing is to leave bpf_prog_array_free() as-is
> > and introduce new bpf_prog_array_free_sleepable() (the way it is in
> > this patch) and call it from 2 places in kernel/trace/bpf_trace.c only.
> > This way cgroup won't be affected.
> >
> > The rcu_trace protection here applies to prog_array itself.
> > Normal progs are still rcu, sleepable progs are rcu_trace.
> > Regardless whether they're called via trampoline or this new prog_array.
>
> Right, I understand the proposal. My objection is that if tomorrow someone mistakenly
> keeps using bpf_prog_array_free when they actually need
> bpf_prog_array_free_sleepable, the resulting bug is really difficult to find and
> reason about. I can make it correct in this patch series but *keeping* things correct
> going forward will be harder when there's two free variants.

This kind of mistakes code reviews are supposed to catch.

> Instead, we can have a ARRAY_USE_TRACE_RCU flag which automatically chains the grace
> periods inside bpf_prog_array_free. That way we eliminate potential future confusion
> and instead of introducing subtle rcu bugs, at worst you can incur a performance
> penalty by using the flag where you don't need it. If we spend the time to one-way
> flip the flag only when you actually insert a sleepable program into the array, even
> that penalty is eliminated.

Are you suggesting to add such flag automatically when
sleepable bpf progs are added to prog_array?
That would not be correct.
Presence of sleepable progs has nothing to do with prog_array itself.
The bpf_prog_array_free[_sleepable]() frees that array.
Not the progs inside it.

If you mean adding this flag when prog_array is allocated
for uprobe attachment then I don't see how it helps code reviews.
Nothing automatic here. It's one place and single purpose flag.
Instead of dynamically checking it in free.
We can directly call the correct function.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu
  2022-04-28 22:52           ` Alexei Starovoitov
@ 2022-05-02 17:31             ` Delyan Kratunov
  0 siblings, 0 replies; 17+ messages in thread
From: Delyan Kratunov @ 2022-05-02 17:31 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: daniel, paulmck, ast, andrii, andrii.nakryiko, bpf

On Thu, 2022-04-28 at 15:52 -0700, Alexei Starovoitov wrote:
> On Thu, Apr 28, 2022 at 2:35 PM Delyan Kratunov <delyank@fb.com> wrote:
> > 
> > On Thu, 2022-04-28 at 13:58 -0700, Alexei Starovoitov wrote:
> > > On Thu, Apr 28, 2022 at 12:15 PM Delyan Kratunov <delyank@fb.com> wrote:
> > > > 
> > > > On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote:
> > > > > On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote:
> > > > > > 
> > > > > > uprobes work by raising a trap, setting a task flag from within the
> > > > > > interrupt handler, and processing the actual work for the uprobe on the
> > > > > > way back to userspace. As a result, uprobe handlers already execute in a
> > > > > > user context. The primary obstacle to sleepable bpf uprobe programs is
> > > > > > therefore on the bpf side.
> > > > > > 
> > > > > > Namely, the bpf_prog_array attached to the uprobe is protected by normal
> > > > > > rcu and runs with disabled preemption. In order for uprobe bpf programs
> > > > > > to become actually sleepable, we need it to be protected by the tasks_trace
> > > > > > rcu flavor instead (and kfree() called after a corresponding grace period).
> > > > > > 
> > > > > > One way to achieve this is by tracking an array-has-contained-sleepable-prog
> > > > > > flag in bpf_prog_array and switching rcu flavors based on it. However, this
> > > > > > is deemed somewhat unwieldly and the rcu flavor transition would be hard
> > > > > > to reason about.
> > > > > > 
> > > > > > Instead, based on Alexei's proposal, we change the free path for
> > > > > > bpf_prog_array to chain a tasks_trace and normal grace periods
> > > > > > one after the other. Users who iterate under tasks_trace read section would
> > > > > > be safe, as would users who iterate under normal read sections (from
> > > > > > non-sleepable locations). The downside is that we take the tasks_trace latency
> > > > > > unconditionally but that's deemed acceptable under expected workloads.
> > > > > 
> > > > > One example where this actually can become a problem is cgroup BPF
> > > > > programs. There you can make single attachment to root cgroup, but it
> > > > > will create one "effective" prog_array for each descendant (and will
> > > > > keep destroying and creating them as child cgroups are created). So
> > > > > there is this invisible multiplier which we can't really control.
> > > > > 
> > > > > So paying the (however small, but) price of call_rcu_tasks_trace() in
> > > > > bpf_prog_array_free() which is used for purely non-sleepable cases
> > > > > seems unfortunate. But I think an alternative to tracking this "has
> > > > > sleepable" bit on a per bpf_prog_array case is to have
> > > > > bpf_prog_array_free_sleepable() implementation independent of
> > > > > bpf_prog_array_free() itself and call that sleepable variant from
> > > > > uprobe detach handler, limiting the impact to things that actually
> > > > > might be running as sleepable and which most likely won't churn
> > > > > through a huge amount of arrays. WDYT?
> > > > 
> > > > Honestly, I don't like the idea of having two different APIs, where if you use the
> > > > wrong one, the program would only fail in rare and undebuggable circumstances.
> > > > 
> > > > If we need specialization (and I'm not convinced we do - what's the rate of cgroup
> > > > creation that we can sustain?), we should track that in the bpf_prog_array itself. We
> > > > can have the allocation path set a flag and branch on it in free() to determine the
> > > > necessary grace periods.
> > > 
> > > I think what Andrii is proposing is to leave bpf_prog_array_free() as-is
> > > and introduce new bpf_prog_array_free_sleepable() (the way it is in
> > > this patch) and call it from 2 places in kernel/trace/bpf_trace.c only.
> > > This way cgroup won't be affected.
> > > 
> > > The rcu_trace protection here applies to prog_array itself.
> > > Normal progs are still rcu, sleepable progs are rcu_trace.
> > > Regardless whether they're called via trampoline or this new prog_array.
> > 
> > Right, I understand the proposal. My objection is that if tomorrow someone mistakenly
> > keeps using bpf_prog_array_free when they actually need
> > bpf_prog_array_free_sleepable, the resulting bug is really difficult to find and
> > reason about. I can make it correct in this patch series but *keeping* things correct
> > going forward will be harder when there's two free variants.
> 
> This kind of mistakes code reviews are supposed to catch.

You definitely trust code review more than I do! :) 

> > Instead, we can have a ARRAY_USE_TRACE_RCU flag which automatically chains the grace
> > periods inside bpf_prog_array_free. That way we eliminate potential future confusion
> > and instead of introducing subtle rcu bugs, at worst you can incur a performance
> > penalty by using the flag where you don't need it. If we spend the time to one-way
> > flip the flag only when you actually insert a sleepable program into the array, even
> > that penalty is eliminated.
> 
> Are you suggesting to add such flag automatically when
> sleepable bpf progs are added to prog_array?
> That would not be correct.
> Presence of sleepable progs has nothing to do with prog_array itself.
> The bpf_prog_array_free[_sleepable]() frees that array.
> Not the progs inside it.

The only reason the array is under the tasks_trace rcu is the presence of sleepable
programs inside, so I'd say it is related. I'm reasonably certain we could
orchestrate the transition from one rcu flavor to the other safely. Alternatively, we
can have two rcu_heads and only take the tasks_trace one conditionally.

> If you mean adding this flag when prog_array is allocated
> for uprobe attachment then I don't see how it helps code reviews.
> Nothing automatic here. It's one place and single purpose flag.
> Instead of dynamically checking it in free.
> We can directly call the correct function.

Sure, I'll send a v2 with free_sleepable later today.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-05-02 17:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-28 16:53 [PATCH bpf-next 0/5] sleepable uprobe support Delyan Kratunov
2022-04-28 16:53 ` [PATCH bpf-next 4/5] libbpf: add support for sleepable kprobe and uprobe programs Delyan Kratunov
2022-04-28 18:33   ` Andrii Nakryiko
2022-04-28 19:11     ` Delyan Kratunov
2022-04-28 16:53 ` [PATCH bpf-next 5/5] selftests/bpf: add tests for sleepable kprobes and uprobes Delyan Kratunov
2022-04-28 18:41   ` Andrii Nakryiko
2022-04-28 16:53 ` [PATCH bpf-next 1/5] bpf: move bpf_prog to bpf.h Delyan Kratunov
2022-04-28 16:54 ` [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu Delyan Kratunov
2022-04-28 18:19   ` Andrii Nakryiko
2022-04-28 19:15     ` Delyan Kratunov
2022-04-28 20:58       ` Alexei Starovoitov
2022-04-28 21:35         ` Delyan Kratunov
2022-04-28 22:52           ` Alexei Starovoitov
2022-05-02 17:31             ` Delyan Kratunov
2022-04-28 16:54 ` [PATCH bpf-next 3/5] bpf: allow sleepable uprobe programs to attach Delyan Kratunov
2022-04-28 18:22   ` Andrii Nakryiko
2022-04-28 18:40     ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.