BPF Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v5 bpf-next 0/5] bpf: Introduce minimal support for sleepable progs
@ 2020-06-30  4:33 Alexei Starovoitov
  2020-06-30  4:33 ` [PATCH v5 bpf-next 1/5] bpf: Remove redundant synchronize_rcu Alexei Starovoitov
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2020-06-30  4:33 UTC (permalink / raw)
  To: davem; +Cc: daniel, paulmck, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

v4->v5:
- addressed Andrii's feedback.

v3->v4:
- fixed synchronize_rcu_tasks_trace() usage and accelerated with synchronize_rcu_mult().
- removed redundant synchronize_rcu(). Otherwise it won't be clear why
  synchronize_rcu_tasks_trace() is not needed in free_map callbacks.
- improved test coverage.

v2->v3:
- switched to rcu_trace
- added bpf_copy_from_user

Here is 'perf report' differences:
sleepable with SRCU:
   3.86%  bench     [k] __srcu_read_unlock
   3.22%  bench     [k] __srcu_read_lock
   0.92%  bench     [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry_sleep
   0.50%  bench     [k] bpf_trampoline_10297
   0.26%  bench     [k] __bpf_prog_exit_sleepable
   0.21%  bench     [k] __bpf_prog_enter_sleepable

sleepable with RCU_TRACE:
   0.79%  bench     [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry_sleep
   0.72%  bench     [k] bpf_trampoline_10381
   0.31%  bench     [k] __bpf_prog_exit_sleepable
   0.29%  bench     [k] __bpf_prog_enter_sleepable

non-sleepable with RCU:
   0.88%  bench     [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry
   0.84%  bench     [k] bpf_trampoline_10297
   0.13%  bench     [k] __bpf_prog_enter
   0.12%  bench     [k] __bpf_prog_exit

Happy to confirm that rcu_trace overhead is negligible.

v1->v2:
- split fmod_ret fix into separate patch
- added blacklist

v1:
This patch set introduces the minimal viable support for sleepable bpf programs.
In this patch only fentry/fexit/fmod_ret and lsm progs can be sleepable.
Only array and pre-allocated hash and lru maps allowed.

Alexei Starovoitov (5):
  bpf: Remove redundant synchronize_rcu.
  bpf: Introduce sleepable BPF programs
  bpf: Add bpf_copy_from_user() helper.
  libbpf: support sleepable progs
  selftests/bpf: Add sleepable tests

 arch/x86/net/bpf_jit_comp.c                   | 32 +++++----
 include/linux/bpf.h                           |  4 ++
 include/uapi/linux/bpf.h                      | 19 +++++-
 init/Kconfig                                  |  1 +
 kernel/bpf/arraymap.c                         | 10 +--
 kernel/bpf/hashtab.c                          | 20 +++---
 kernel/bpf/helpers.c                          | 22 +++++++
 kernel/bpf/lpm_trie.c                         |  5 --
 kernel/bpf/queue_stack_maps.c                 |  7 --
 kernel/bpf/reuseport_array.c                  |  2 -
 kernel/bpf/ringbuf.c                          |  7 --
 kernel/bpf/stackmap.c                         |  3 -
 kernel/bpf/syscall.c                          | 13 +++-
 kernel/bpf/trampoline.c                       | 28 +++++++-
 kernel/bpf/verifier.c                         | 62 ++++++++++++++++-
 kernel/trace/bpf_trace.c                      |  2 +
 tools/include/uapi/linux/bpf.h                | 19 +++++-
 tools/lib/bpf/libbpf.c                        | 25 ++++++-
 tools/testing/selftests/bpf/bench.c           |  2 +
 .../selftests/bpf/benchs/bench_trigger.c      | 17 +++++
 .../selftests/bpf/prog_tests/test_lsm.c       |  9 +++
 tools/testing/selftests/bpf/progs/lsm.c       | 66 ++++++++++++++++++-
 .../selftests/bpf/progs/trigger_bench.c       |  7 ++
 23 files changed, 315 insertions(+), 67 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 bpf-next 1/5] bpf: Remove redundant synchronize_rcu.
  2020-06-30  4:33 [PATCH v5 bpf-next 0/5] bpf: Introduce minimal support for sleepable progs Alexei Starovoitov
@ 2020-06-30  4:33 ` Alexei Starovoitov
  2020-06-30 20:55   ` Paul E. McKenney
  2020-06-30  4:33 ` [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs Alexei Starovoitov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2020-06-30  4:33 UTC (permalink / raw)
  To: davem; +Cc: daniel, paulmck, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

bpf_free_used_maps() or close(map_fd) will trigger map_free callback.
bpf_free_used_maps() is called after bpf prog is no longer executing:
bpf_prog_put->call_rcu->bpf_prog_free->bpf_free_used_maps.
Hence there is no need to call synchronize_rcu() to protect map elements.

Note that hash_of_maps and array_of_maps update/delete inner maps via
sys_bpf() that calls maybe_wait_bpf_programs() and synchronize_rcu().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
---
 kernel/bpf/arraymap.c         | 9 ---------
 kernel/bpf/hashtab.c          | 8 +++-----
 kernel/bpf/lpm_trie.c         | 5 -----
 kernel/bpf/queue_stack_maps.c | 7 -------
 kernel/bpf/reuseport_array.c  | 2 --
 kernel/bpf/ringbuf.c          | 7 -------
 kernel/bpf/stackmap.c         | 3 ---
 7 files changed, 3 insertions(+), 38 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index ec5cd11032aa..c66e8273fccd 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -386,13 +386,6 @@ static void array_map_free(struct bpf_map *map)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 
-	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
-	 * so the programs (can be more than one that used this map) were
-	 * disconnected from events. Wait for outstanding programs to complete
-	 * and free the array
-	 */
-	synchronize_rcu();
-
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
 		bpf_array_free_percpu(array);
 
@@ -546,8 +539,6 @@ static void fd_array_map_free(struct bpf_map *map)
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	int i;
 
-	synchronize_rcu();
-
 	/* make sure it's empty */
 	for (i = 0; i < array->map.max_entries; i++)
 		BUG_ON(array->ptrs[i] != NULL);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index acd06081d81d..d4378d7d442b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1290,12 +1290,10 @@ static void htab_map_free(struct bpf_map *map)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 
-	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
-	 * so the programs (can be more than one that used this map) were
-	 * disconnected from events. Wait for outstanding critical sections in
-	 * these programs to complete
+	/* bpf_free_used_maps() or close(map_fd) will trigger this map_free callback.
+	 * bpf_free_used_maps() is called after bpf prog is no longer executing.
+	 * There is no need to synchronize_rcu() here to protect map elements.
 	 */
-	synchronize_rcu();
 
 	/* some of free_htab_elem() callbacks for elements of this map may
 	 * not have executed. Wait for them.
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 1abd4e3f906d..44474bf3ab7a 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -589,11 +589,6 @@ static void trie_free(struct bpf_map *map)
 	struct lpm_trie_node __rcu **slot;
 	struct lpm_trie_node *node;
 
-	/* Wait for outstanding programs to complete
-	 * update/lookup/delete/get_next_key and free the trie.
-	 */
-	synchronize_rcu();
-
 	/* Always start at the root and walk down to a node that has no
 	 * children. Then free that node, nullify its reference in the parent
 	 * and start over.
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 80c66a6d7c54..44184f82916a 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -101,13 +101,6 @@ static void queue_stack_map_free(struct bpf_map *map)
 {
 	struct bpf_queue_stack *qs = bpf_queue_stack(map);
 
-	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
-	 * so the programs (can be more than one that used this map) were
-	 * disconnected from events. Wait for outstanding critical sections in
-	 * these programs to complete
-	 */
-	synchronize_rcu();
-
 	bpf_map_area_free(qs);
 }
 
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index a09922f656e4..3625c4fcc65c 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -96,8 +96,6 @@ static void reuseport_array_free(struct bpf_map *map)
 	struct sock *sk;
 	u32 i;
 
-	synchronize_rcu();
-
 	/*
 	 * ops->map_*_elem() will not be able to access this
 	 * array now. Hence, this function only races with
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index dbf37aff4827..13a8d3967e07 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -215,13 +215,6 @@ static void ringbuf_map_free(struct bpf_map *map)
 {
 	struct bpf_ringbuf_map *rb_map;
 
-	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
-	 * so the programs (can be more than one that used this map) were
-	 * disconnected from events. Wait for outstanding critical sections in
-	 * these programs to complete
-	 */
-	synchronize_rcu();
-
 	rb_map = container_of(map, struct bpf_ringbuf_map, map);
 	bpf_ringbuf_free(rb_map->rb);
 	kfree(rb_map);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 27dc9b1b08a5..071f98d0f7c6 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -604,9 +604,6 @@ static void stack_map_free(struct bpf_map *map)
 {
 	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
 
-	/* wait for bpf programs to complete before freeing stack map */
-	synchronize_rcu();
-
 	bpf_map_area_free(smap->elems);
 	pcpu_freelist_destroy(&smap->freelist);
 	bpf_map_area_free(smap);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs
  2020-06-30  4:33 [PATCH v5 bpf-next 0/5] bpf: Introduce minimal support for sleepable progs Alexei Starovoitov
  2020-06-30  4:33 ` [PATCH v5 bpf-next 1/5] bpf: Remove redundant synchronize_rcu Alexei Starovoitov
@ 2020-06-30  4:33 ` Alexei Starovoitov
  2020-06-30 22:04   ` Paul E. McKenney
  2020-06-30 23:26   ` Daniel Borkmann
  2020-06-30  4:33 ` [PATCH v5 bpf-next 3/5] bpf: Add bpf_copy_from_user() helper Alexei Starovoitov
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2020-06-30  4:33 UTC (permalink / raw)
  To: davem; +Cc: daniel, paulmck, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce sleepable BPF programs that can request such property for themselves
via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
to use helpers like bpf_copy_from_user() that might sleep. At present only
fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
when they are attached to kernel functions that are known to allow sleeping.

The non-sleepable programs are relying on implicit rcu_read_lock() and
migrate_disable() to protect life time of programs, maps that they use and
per-cpu kernel structures used to pass info between bpf programs and the
kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
should not be enclosed in migrate_disable() as well. Therefore
rcu_read_lock_trace is used to protect the life time of sleepable progs.

There are many networking and tracing program types. In many cases the
'struct bpf_prog *' pointer itself is rcu protected within some other kernel
data structure and the kernel code is using rcu_dereference() to load that
program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
Instead sleepable bpf programs are allowed with bpf trampoline only. The
program pointers are hard-coded into generated assembly of bpf trampoline and
synchronize_rcu_tasks_trace() is used to protect the life time of the program.
The same trampoline can hold both sleepable and non-sleepable progs.

When rcu_read_lock_trace is held it means that some sleepable bpf program is running
from bpf trampoline. Those programs can use bpf arrays and preallocated hash/lru
maps. These map types are waiting on programs to complete via
synchronize_rcu_tasks_trace();

Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
synchronize_rcu_tasks() to wait for sleepable progs to finish and for
trampoline assembly to finish.

This is the first step of introducing sleepable progs.

After that dynamically allocated hash maps can be allowed. All map elements
would have to be rcu_trace protected instead of normal rcu.
per-cpu maps will be allowed. Either via the following pattern:
void *elem = bpf_map_lookup_elem(map, key);
if (elem) {
   // access elem
   bpf_map_release_elem(map, elem);
}
where modified lookup() helper will do migrate_disable() and
new bpf_map_release_elem() will do corresponding migrate_enable().
Or explicit bpf_migrate_disable/enable() helpers will be introduced.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
---
 arch/x86/net/bpf_jit_comp.c    | 32 ++++++++++++------
 include/linux/bpf.h            |  3 ++
 include/uapi/linux/bpf.h       |  8 +++++
 init/Kconfig                   |  1 +
 kernel/bpf/arraymap.c          |  1 +
 kernel/bpf/hashtab.c           | 12 +++----
 kernel/bpf/syscall.c           | 13 +++++--
 kernel/bpf/trampoline.c        | 28 +++++++++++++--
 kernel/bpf/verifier.c          | 62 +++++++++++++++++++++++++++++++++-
 tools/include/uapi/linux/bpf.h |  8 +++++
 10 files changed, 144 insertions(+), 24 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 42b6709e6dc7..7d9ea7b41c71 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1379,10 +1379,15 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog,
 	u8 *prog = *pprog;
 	int cnt = 0;
 
-	if (emit_call(&prog, __bpf_prog_enter, prog))
-		return -EINVAL;
-	/* remember prog start time returned by __bpf_prog_enter */
-	emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0);
+	if (p->aux->sleepable) {
+		if (emit_call(&prog, __bpf_prog_enter_sleepable, prog))
+			return -EINVAL;
+	} else {
+		if (emit_call(&prog, __bpf_prog_enter, prog))
+			return -EINVAL;
+		/* remember prog start time returned by __bpf_prog_enter */
+		emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0);
+	}
 
 	/* arg1: lea rdi, [rbp - stack_size] */
 	EMIT4(0x48, 0x8D, 0x7D, -stack_size);
@@ -1402,13 +1407,18 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog,
 	if (mod_ret)
 		emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
 
-	/* arg1: mov rdi, progs[i] */
-	emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32,
-		       (u32) (long) p);
-	/* arg2: mov rsi, rbx <- start time in nsec */
-	emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6);
-	if (emit_call(&prog, __bpf_prog_exit, prog))
-		return -EINVAL;
+	if (p->aux->sleepable) {
+		if (emit_call(&prog, __bpf_prog_exit_sleepable, prog))
+			return -EINVAL;
+	} else {
+		/* arg1: mov rdi, progs[i] */
+		emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32,
+			       (u32) (long) p);
+		/* arg2: mov rsi, rbx <- start time in nsec */
+		emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6);
+		if (emit_call(&prog, __bpf_prog_exit, prog))
+			return -EINVAL;
+	}
 
 	*pprog = prog;
 	return 0;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3d2ade703a35..e2b1581b2195 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -495,6 +495,8 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end,
 /* these two functions are called from generated trampoline */
 u64 notrace __bpf_prog_enter(void);
 void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start);
+void notrace __bpf_prog_enter_sleepable(void);
+void notrace __bpf_prog_exit_sleepable(void);
 
 struct bpf_ksym {
 	unsigned long		 start;
@@ -687,6 +689,7 @@ struct bpf_prog_aux {
 	bool offload_requested;
 	bool attach_btf_trace; /* true if attaching to BTF-enabled raw tp */
 	bool func_proto_unreliable;
+	bool sleepable;
 	enum bpf_tramp_prog_type trampoline_prog_type;
 	struct bpf_trampoline *trampoline;
 	struct hlist_node tramp_hlist;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0cb8ec948816..73f9e3f84b77 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -332,6 +332,14 @@ enum bpf_link_type {
 /* The verifier internal test flag. Behavior is undefined */
 #define BPF_F_TEST_STATE_FREQ	(1U << 3)
 
+/* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will
+ * restrict map and helper usage for such programs. Sleepable BPF programs can
+ * only be attached to hooks where kernel execution context allows sleeping.
+ * Such programs are allowed to use helpers that may sleep like
+ * bpf_copy_from_user().
+ */
+#define BPF_F_SLEEPABLE		(1U << 4)
+
 /* When BPF ldimm64's insn[0].src_reg != 0 then this can have
  * two extensions:
  *
diff --git a/init/Kconfig b/init/Kconfig
index a46aa8f3174d..62687583f822 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1663,6 +1663,7 @@ config BPF_SYSCALL
 	bool "Enable bpf() system call"
 	select BPF
 	select IRQ_WORK
+	select TASKS_TRACE_RCU
 	default n
 	help
 	  Enable the bpf() system call that allows to manipulate eBPF
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index c66e8273fccd..b07abcd58785 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -10,6 +10,7 @@
 #include <linux/filter.h>
 #include <linux/perf_event.h>
 #include <uapi/linux/btf.h>
+#include <linux/rcupdate_trace.h>
 
 #include "map_in_map.h"
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d4378d7d442b..65a7919e189d 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -9,6 +9,7 @@
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
 #include <uapi/linux/btf.h>
+#include <linux/rcupdate_trace.h>
 #include "percpu_freelist.h"
 #include "bpf_lru_list.h"
 #include "map_in_map.h"
@@ -577,8 +578,7 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)
 	struct htab_elem *l;
 	u32 hash, key_size;
 
-	/* Must be called with rcu_read_lock. */
-	WARN_ON_ONCE(!rcu_read_lock_held());
+	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());
 
 	key_size = map->key_size;
 
@@ -935,7 +935,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		/* unknown flags */
 		return -EINVAL;
 
-	WARN_ON_ONCE(!rcu_read_lock_held());
+	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());
 
 	key_size = map->key_size;
 
@@ -1026,7 +1026,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 		/* unknown flags */
 		return -EINVAL;
 
-	WARN_ON_ONCE(!rcu_read_lock_held());
+	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());
 
 	key_size = map->key_size;
 
@@ -1214,7 +1214,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
 	u32 hash, key_size;
 	int ret = -ENOENT;
 
-	WARN_ON_ONCE(!rcu_read_lock_held());
+	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());
 
 	key_size = map->key_size;
 
@@ -1246,7 +1246,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 	u32 hash, key_size;
 	int ret = -ENOENT;
 
-	WARN_ON_ONCE(!rcu_read_lock_held());
+	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());
 
 	key_size = map->key_size;
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8da159936bab..782b2b029539 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -29,6 +29,7 @@
 #include <linux/bpf_lsm.h>
 #include <linux/poll.h>
 #include <linux/bpf-netns.h>
+#include <linux/rcupdate_trace.h>
 
 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
 			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
@@ -1728,10 +1729,14 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
 	btf_put(prog->aux->btf);
 	bpf_prog_free_linfo(prog);
 
-	if (deferred)
-		call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
-	else
+	if (deferred) {
+		if (prog->aux->sleepable)
+			call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu);
+		else
+			call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
+	} else {
 		__bpf_prog_put_rcu(&prog->aux->rcu);
+	}
 }
 
 static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
@@ -2096,6 +2101,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 	if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT |
 				 BPF_F_ANY_ALIGNMENT |
 				 BPF_F_TEST_STATE_FREQ |
+				 BPF_F_SLEEPABLE |
 				 BPF_F_TEST_RND_HI32))
 		return -EINVAL;
 
@@ -2151,6 +2157,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 	}
 
 	prog->aux->offload_requested = !!attr->prog_ifindex;
+	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
 
 	err = security_bpf_prog_alloc(prog->aux);
 	if (err)
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 9be85aa4ec5f..c2b76545153c 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -7,6 +7,8 @@
 #include <linux/rbtree_latch.h>
 #include <linux/perf_event.h>
 #include <linux/btf.h>
+#include <linux/rcupdate_trace.h>
+#include <linux/rcupdate_wait.h>
 
 /* dummy _ops. The verifier will operate on target program's ops. */
 const struct bpf_verifier_ops bpf_extension_verifier_ops = {
@@ -210,9 +212,12 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
 	 * updates to trampoline would change the code from underneath the
 	 * preempted task. Hence wait for tasks to voluntarily schedule or go
 	 * to userspace.
+	 * The same trampoline can hold both sleepable and non-sleepable progs.
+	 * synchronize_rcu_tasks_trace() is needed to make sure all sleepable
+	 * programs finish executing.
+	 * Wait for these two grace periods together.
 	 */
-
-	synchronize_rcu_tasks();
+	synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
 
 	err = arch_prepare_bpf_trampoline(new_image, new_image + PAGE_SIZE / 2,
 					  &tr->func.model, flags, tprogs,
@@ -344,7 +349,14 @@ void bpf_trampoline_put(struct bpf_trampoline *tr)
 	if (WARN_ON_ONCE(!hlist_empty(&tr->progs_hlist[BPF_TRAMP_FEXIT])))
 		goto out;
 	bpf_image_ksym_del(&tr->ksym);
-	/* wait for tasks to get out of trampoline before freeing it */
+	/* This code will be executed when all bpf progs (both sleepable and
+	 * non-sleepable) went through
+	 * bpf_prog_put()->call_rcu[_tasks_trace]()->bpf_prog_free_deferred().
+	 * Hence no need for another synchronize_rcu_tasks_trace() here,
+	 * but synchronize_rcu_tasks() is still needed, since trampoline
+	 * may not have had any sleepable programs and we need to wait
+	 * for tasks to get out of trampoline code before freeing it.
+	 */
 	synchronize_rcu_tasks();
 	bpf_jit_free_exec(tr->image);
 	hlist_del(&tr->hlist);
@@ -394,6 +406,16 @@ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start)
 	rcu_read_unlock();
 }
 
+void notrace __bpf_prog_enter_sleepable(void)
+{
+	rcu_read_lock_trace();
+}
+
+void notrace __bpf_prog_exit_sleepable(void)
+{
+	rcu_read_unlock_trace();
+}
+
 int __weak
 arch_prepare_bpf_trampoline(void *image, void *image_end,
 			    const struct btf_func_model *m, u32 flags,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7de98906ddf4..05aa990ba9a4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9112,6 +9112,23 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		return -EINVAL;
 	}
 
+	if (prog->aux->sleepable)
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_HASH:
+		case BPF_MAP_TYPE_LRU_HASH:
+		case BPF_MAP_TYPE_ARRAY:
+			if (!is_preallocated_map(map)) {
+				verbose(env,
+					"Sleepable programs can only use preallocated hash maps\n");
+				return -EINVAL;
+			}
+			break;
+		default:
+			verbose(env,
+				"Sleepable programs can only use array and hash maps\n");
+			return -EINVAL;
+		}
+
 	return 0;
 }
 
@@ -10722,6 +10739,22 @@ static int check_attach_modify_return(struct bpf_prog *prog, unsigned long addr)
 	return -EINVAL;
 }
 
+/* list of non-sleepable kernel functions that are otherwise
+ * available to attach by bpf_lsm or fmod_ret progs.
+ */
+static int check_sleepable_blacklist(unsigned long addr)
+{
+#ifdef CONFIG_BPF_LSM
+	if (addr == (long)bpf_lsm_task_free)
+		return -EINVAL;
+#endif
+#ifdef CONFIG_SECURITY
+	if (addr == (long)security_task_free)
+		return -EINVAL;
+#endif
+	return 0;
+}
+
 static int check_attach_btf_id(struct bpf_verifier_env *env)
 {
 	struct bpf_prog *prog = env->prog;
@@ -10739,6 +10772,12 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 	long addr;
 	u64 key;
 
+	if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING &&
+	    prog->type != BPF_PROG_TYPE_LSM) {
+		verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n");
+		return -EINVAL;
+	}
+
 	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS)
 		return check_struct_ops_btf_id(env);
 
@@ -10952,8 +10991,29 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 			if (ret)
 				verbose(env, "%s() is not modifiable\n",
 					prog->aux->attach_func_name);
+		} else if (prog->aux->sleepable) {
+			switch (prog->type) {
+			case BPF_PROG_TYPE_TRACING:
+				/* fentry/fexit progs can be sleepable only if they are
+				 * attached to ALLOW_ERROR_INJECTION or security_*() funcs.
+				 */
+				ret = check_attach_modify_return(prog, addr);
+				if (!ret)
+					ret = check_sleepable_blacklist(addr);
+				break;
+			case BPF_PROG_TYPE_LSM:
+				/* LSM progs check that they are attached to bpf_lsm_*() funcs
+				 * which are sleepable too.
+				 */
+				ret = check_sleepable_blacklist(addr);
+				break;
+			default:
+				break;
+			}
+			if (ret)
+				verbose(env, "%s is not sleepable\n",
+					prog->aux->attach_func_name);
 		}
-
 		if (ret)
 			goto out;
 		tr->func.addr = (void *)addr;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0cb8ec948816..73f9e3f84b77 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -332,6 +332,14 @@ enum bpf_link_type {
 /* The verifier internal test flag. Behavior is undefined */
 #define BPF_F_TEST_STATE_FREQ	(1U << 3)
 
+/* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will
+ * restrict map and helper usage for such programs. Sleepable BPF programs can
+ * only be attached to hooks where kernel execution context allows sleeping.
+ * Such programs are allowed to use helpers that may sleep like
+ * bpf_copy_from_user().
+ */
+#define BPF_F_SLEEPABLE		(1U << 4)
+
 /* When BPF ldimm64's insn[0].src_reg != 0 then this can have
  * two extensions:
  *
-- 
2.23.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 bpf-next 3/5] bpf: Add bpf_copy_from_user() helper.
  2020-06-30  4:33 [PATCH v5 bpf-next 0/5] bpf: Introduce minimal support for sleepable progs Alexei Starovoitov
  2020-06-30  4:33 ` [PATCH v5 bpf-next 1/5] bpf: Remove redundant synchronize_rcu Alexei Starovoitov
  2020-06-30  4:33 ` [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs Alexei Starovoitov
@ 2020-06-30  4:33 ` Alexei Starovoitov
  2020-06-30 11:41   ` KP Singh
  2020-06-30  4:33 ` [PATCH v5 bpf-next 4/5] libbpf: support sleepable progs Alexei Starovoitov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2020-06-30  4:33 UTC (permalink / raw)
  To: davem; +Cc: daniel, paulmck, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Sleepable BPF programs can now use copy_from_user() to access user memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
---
 include/linux/bpf.h            |  1 +
 include/uapi/linux/bpf.h       | 11 ++++++++++-
 kernel/bpf/helpers.c           | 22 ++++++++++++++++++++++
 kernel/trace/bpf_trace.c       |  2 ++
 tools/include/uapi/linux/bpf.h | 11 ++++++++++-
 5 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e2b1581b2195..c9f27d5fdb7c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1657,6 +1657,7 @@ extern const struct bpf_func_proto bpf_skc_to_tcp_sock_proto;
 extern const struct bpf_func_proto bpf_skc_to_tcp_timewait_sock_proto;
 extern const struct bpf_func_proto bpf_skc_to_tcp_request_sock_proto;
 extern const struct bpf_func_proto bpf_skc_to_udp6_sock_proto;
+extern const struct bpf_func_proto bpf_copy_from_user_proto;
 
 const struct bpf_func_proto *bpf_tracing_func_proto(
 	enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 73f9e3f84b77..6b347454dedc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3293,6 +3293,13 @@ union bpf_attr {
  *		Dynamically cast a *sk* pointer to a *udp6_sock* pointer.
  *	Return
  *		*sk* if casting is valid, or NULL otherwise.
+ *
+ * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)
+ * 	Description
+ * 		Read *size* bytes from user space address *user_ptr* and store
+ * 		the data in *dst*. This is a wrapper of copy_from_user().
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3435,7 +3442,9 @@ union bpf_attr {
 	FN(skc_to_tcp_sock),		\
 	FN(skc_to_tcp_timewait_sock),	\
 	FN(skc_to_tcp_request_sock),	\
-	FN(skc_to_udp6_sock),
+	FN(skc_to_udp6_sock),		\
+	FN(copy_from_user),		\
+	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index be43ab3e619f..5cc7425ee476 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -601,6 +601,28 @@ const struct bpf_func_proto bpf_event_output_data_proto =  {
 	.arg5_type      = ARG_CONST_SIZE_OR_ZERO,
 };
 
+BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
+	   const void __user *, user_ptr)
+{
+	int ret = copy_from_user(dst, user_ptr, size);
+
+	if (unlikely(ret)) {
+		memset(dst, 0, size);
+		ret = -EFAULT;
+	}
+
+	return ret;
+}
+
+const struct bpf_func_proto bpf_copy_from_user_proto = {
+	.func		= bpf_copy_from_user,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 const struct bpf_func_proto bpf_get_current_task_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 5d59dda5f661..96121fa7f7e6 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1137,6 +1137,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_ringbuf_query_proto;
 	case BPF_FUNC_jiffies64:
 		return &bpf_jiffies64_proto;
+	case BPF_FUNC_copy_from_user:
+		return prog->aux->sleepable ? &bpf_copy_from_user_proto : NULL;
 	default:
 		return NULL;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 73f9e3f84b77..6b347454dedc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3293,6 +3293,13 @@ union bpf_attr {
  *		Dynamically cast a *sk* pointer to a *udp6_sock* pointer.
  *	Return
  *		*sk* if casting is valid, or NULL otherwise.
+ *
+ * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)
+ * 	Description
+ * 		Read *size* bytes from user space address *user_ptr* and store
+ * 		the data in *dst*. This is a wrapper of copy_from_user().
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3435,7 +3442,9 @@ union bpf_attr {
 	FN(skc_to_tcp_sock),		\
 	FN(skc_to_tcp_timewait_sock),	\
 	FN(skc_to_tcp_request_sock),	\
-	FN(skc_to_udp6_sock),
+	FN(skc_to_udp6_sock),		\
+	FN(copy_from_user),		\
+	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.23.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 bpf-next 4/5] libbpf: support sleepable progs
  2020-06-30  4:33 [PATCH v5 bpf-next 0/5] bpf: Introduce minimal support for sleepable progs Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2020-06-30  4:33 ` [PATCH v5 bpf-next 3/5] bpf: Add bpf_copy_from_user() helper Alexei Starovoitov
@ 2020-06-30  4:33 ` Alexei Starovoitov
  2020-06-30  4:33 ` [PATCH v5 bpf-next 5/5] selftests/bpf: Add sleepable tests Alexei Starovoitov
  2020-06-30  5:14 ` [PATCH v5 bpf-next 0/5] bpf: Introduce minimal support for sleepable progs Andrii Nakryiko
  5 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2020-06-30  4:33 UTC (permalink / raw)
  To: davem; +Cc: daniel, paulmck, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Pass request to load program as sleepable via ".s" suffix in the section name.
If it happens in the future that all map types and helpers are allowed with
BPF_F_SLEEPABLE flag "fmod_ret/" and "lsm/" can be aliased to "fmod_ret.s/" and
"lsm.s/" to make all lsm and fmod_ret programs sleepable by default. The fentry
and fexit programs would always need to have sleepable vs non-sleepable
distinction, since not all fentry/fexit progs will be attached to sleepable
kernel functions.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: KP Singh <kpsingh@google.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/lib/bpf/libbpf.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 4ea7f4f1a691..c2f054fde30f 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -209,6 +209,7 @@ struct bpf_sec_def {
 	bool is_exp_attach_type_optional;
 	bool is_attachable;
 	bool is_attach_btf;
+	bool is_sleepable;
 	attach_fn_t attach_fn;
 };
 
@@ -5626,6 +5627,8 @@ __bpf_object__open(const char *path, const void *obj_buf, size_t obj_buf_sz,
 			/* couldn't guess, but user might manually specify */
 			continue;
 
+		if (prog->sec_def->is_sleepable)
+			prog->prog_flags |= BPF_F_SLEEPABLE;
 		bpf_program__set_type(prog, prog->sec_def->prog_type);
 		bpf_program__set_expected_attach_type(prog,
 				prog->sec_def->expected_attach_type);
@@ -6893,6 +6896,21 @@ static const struct bpf_sec_def section_defs[] = {
 		.expected_attach_type = BPF_TRACE_FEXIT,
 		.is_attach_btf = true,
 		.attach_fn = attach_trace),
+	SEC_DEF("fentry.s/", TRACING,
+		.expected_attach_type = BPF_TRACE_FENTRY,
+		.is_attach_btf = true,
+		.is_sleepable = true,
+		.attach_fn = attach_trace),
+	SEC_DEF("fmod_ret.s/", TRACING,
+		.expected_attach_type = BPF_MODIFY_RETURN,
+		.is_attach_btf = true,
+		.is_sleepable = true,
+		.attach_fn = attach_trace),
+	SEC_DEF("fexit.s/", TRACING,
+		.expected_attach_type = BPF_TRACE_FEXIT,
+		.is_attach_btf = true,
+		.is_sleepable = true,
+		.attach_fn = attach_trace),
 	SEC_DEF("freplace/", EXT,
 		.is_attach_btf = true,
 		.attach_fn = attach_trace),
@@ -6900,6 +6918,11 @@ static const struct bpf_sec_def section_defs[] = {
 		.is_attach_btf = true,
 		.expected_attach_type = BPF_LSM_MAC,
 		.attach_fn = attach_lsm),
+	SEC_DEF("lsm.s/", LSM,
+		.is_attach_btf = true,
+		.is_sleepable = true,
+		.expected_attach_type = BPF_LSM_MAC,
+		.attach_fn = attach_lsm),
 	SEC_DEF("iter/", TRACING,
 		.expected_attach_type = BPF_TRACE_ITER,
 		.is_attach_btf = true,
@@ -7614,7 +7637,7 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
 
 		prog->prog_ifindex = attr->ifindex;
 		prog->log_level = attr->log_level;
-		prog->prog_flags = attr->prog_flags;
+		prog->prog_flags |= attr->prog_flags;
 		if (!first_prog)
 			first_prog = prog;
 	}
-- 
2.23.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 bpf-next 5/5] selftests/bpf: Add sleepable tests
  2020-06-30  4:33 [PATCH v5 bpf-next 0/5] bpf: Introduce minimal support for sleepable progs Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2020-06-30  4:33 ` [PATCH v5 bpf-next 4/5] libbpf: support sleepable progs Alexei Starovoitov
@ 2020-06-30  4:33 ` Alexei Starovoitov
  2020-06-30  5:14 ` [PATCH v5 bpf-next 0/5] bpf: Introduce minimal support for sleepable progs Andrii Nakryiko
  5 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2020-06-30  4:33 UTC (permalink / raw)
  To: davem; +Cc: daniel, paulmck, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Modify few tests to sanity test sleepable bpf functionality.

Running 'bench trig-fentry-sleep' vs 'bench trig-fentry' and 'perf report':
sleepable with SRCU:
   3.86%  bench     [k] __srcu_read_unlock
   3.22%  bench     [k] __srcu_read_lock
   0.92%  bench     [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry_sleep
   0.50%  bench     [k] bpf_trampoline_10297
   0.26%  bench     [k] __bpf_prog_exit_sleepable
   0.21%  bench     [k] __bpf_prog_enter_sleepable

sleepable with RCU_TRACE:
   0.79%  bench     [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry_sleep
   0.72%  bench     [k] bpf_trampoline_10381
   0.31%  bench     [k] __bpf_prog_exit_sleepable
   0.29%  bench     [k] __bpf_prog_enter_sleepable

non-sleepable with RCU:
   0.88%  bench     [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry
   0.84%  bench     [k] bpf_trampoline_10297
   0.13%  bench     [k] __bpf_prog_enter
   0.12%  bench     [k] __bpf_prog_exit

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: KP Singh <kpsingh@google.com>
---
 tools/testing/selftests/bpf/bench.c           |  2 +
 .../selftests/bpf/benchs/bench_trigger.c      | 17 +++++
 .../selftests/bpf/prog_tests/test_lsm.c       |  9 +++
 tools/testing/selftests/bpf/progs/lsm.c       | 66 ++++++++++++++++++-
 .../selftests/bpf/progs/trigger_bench.c       |  7 ++
 5 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 944ad4721c83..1a427685a8a8 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -317,6 +317,7 @@ extern const struct bench bench_trig_tp;
 extern const struct bench bench_trig_rawtp;
 extern const struct bench bench_trig_kprobe;
 extern const struct bench bench_trig_fentry;
+extern const struct bench bench_trig_fentry_sleep;
 extern const struct bench bench_trig_fmodret;
 extern const struct bench bench_rb_libbpf;
 extern const struct bench bench_rb_custom;
@@ -338,6 +339,7 @@ static const struct bench *benchs[] = {
 	&bench_trig_rawtp,
 	&bench_trig_kprobe,
 	&bench_trig_fentry,
+	&bench_trig_fentry_sleep,
 	&bench_trig_fmodret,
 	&bench_rb_libbpf,
 	&bench_rb_custom,
diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
index 49c22832f216..2a0b6c9885a4 100644
--- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
+++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
@@ -90,6 +90,12 @@ static void trigger_fentry_setup()
 	attach_bpf(ctx.skel->progs.bench_trigger_fentry);
 }
 
+static void trigger_fentry_sleep_setup()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.bench_trigger_fentry_sleep);
+}
+
 static void trigger_fmodret_setup()
 {
 	setup_ctx();
@@ -155,6 +161,17 @@ const struct bench bench_trig_fentry = {
 	.report_final = hits_drops_report_final,
 };
 
+const struct bench bench_trig_fentry_sleep = {
+	.name = "trig-fentry-sleep",
+	.validate = trigger_validate,
+	.setup = trigger_fentry_sleep_setup,
+	.producer_thread = trigger_producer,
+	.consumer_thread = trigger_consumer,
+	.measure = trigger_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
 const struct bench bench_trig_fmodret = {
 	.name = "trig-fmodret",
 	.validate = trigger_validate,
diff --git a/tools/testing/selftests/bpf/prog_tests/test_lsm.c b/tools/testing/selftests/bpf/prog_tests/test_lsm.c
index b17eb2045c1d..6ab29226c99b 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_lsm.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_lsm.c
@@ -10,6 +10,7 @@
 #include <unistd.h>
 #include <malloc.h>
 #include <stdlib.h>
+#include <unistd.h>
 
 #include "lsm.skel.h"
 
@@ -55,6 +56,7 @@ void test_test_lsm(void)
 {
 	struct lsm *skel = NULL;
 	int err, duration = 0;
+	int buf = 1234;
 
 	skel = lsm__open_and_load();
 	if (CHECK(!skel, "skel_load", "lsm skeleton failed\n"))
@@ -81,6 +83,13 @@ void test_test_lsm(void)
 	CHECK(skel->bss->mprotect_count != 1, "mprotect_count",
 	      "mprotect_count = %d\n", skel->bss->mprotect_count);
 
+	syscall(__NR_setdomainname, &buf, -2L);
+	syscall(__NR_setdomainname, 0, -3L);
+	syscall(__NR_setdomainname, ~0L, -4L);
+
+	CHECK(skel->bss->copy_test != 3, "copy_test",
+	      "copy_test = %d\n", skel->bss->copy_test);
+
 close_prog:
 	lsm__destroy(skel);
 }
diff --git a/tools/testing/selftests/bpf/progs/lsm.c b/tools/testing/selftests/bpf/progs/lsm.c
index b4598d4bc4f7..49fa6ca99755 100644
--- a/tools/testing/selftests/bpf/progs/lsm.c
+++ b/tools/testing/selftests/bpf/progs/lsm.c
@@ -9,16 +9,41 @@
 #include <bpf/bpf_tracing.h>
 #include  <errno.h>
 
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} array SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} hash SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_LRU_HASH);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} lru_hash SEC(".maps");
+
 char _license[] SEC("license") = "GPL";
 
 int monitored_pid = 0;
 int mprotect_count = 0;
 int bprm_count = 0;
 
-SEC("lsm/file_mprotect")
+SEC("lsm.s/file_mprotect")
 int BPF_PROG(test_int_hook, struct vm_area_struct *vma,
 	     unsigned long reqprot, unsigned long prot, int ret)
 {
+	char args[64];
+	__u32 key = 0;
+	__u64 *value;
+
 	if (ret != 0)
 		return ret;
 
@@ -28,6 +53,18 @@ int BPF_PROG(test_int_hook, struct vm_area_struct *vma,
 	is_stack = (vma->vm_start <= vma->vm_mm->start_stack &&
 		    vma->vm_end >= vma->vm_mm->start_stack);
 
+	bpf_copy_from_user(args, sizeof(args), (void *)vma->vm_mm->arg_start);
+
+	value = bpf_map_lookup_elem(&array, &key);
+	if (value)
+		*value = 0;
+	value = bpf_map_lookup_elem(&hash, &key);
+	if (value)
+		*value = 0;
+	value = bpf_map_lookup_elem(&lru_hash, &key);
+	if (value)
+		*value = 0;
+
 	if (is_stack && monitored_pid == pid) {
 		mprotect_count++;
 		ret = -EPERM;
@@ -36,7 +73,7 @@ int BPF_PROG(test_int_hook, struct vm_area_struct *vma,
 	return ret;
 }
 
-SEC("lsm/bprm_committed_creds")
+SEC("lsm.s/bprm_committed_creds")
 int BPF_PROG(test_void_hook, struct linux_binprm *bprm)
 {
 	__u32 pid = bpf_get_current_pid_tgid() >> 32;
@@ -46,3 +83,28 @@ int BPF_PROG(test_void_hook, struct linux_binprm *bprm)
 
 	return 0;
 }
+SEC("lsm/task_free") /* lsm/ is ok, lsm.s/ fails */
+int BPF_PROG(test_task_free, struct task_struct *task)
+{
+	return 0;
+}
+
+int copy_test = 0;
+
+SEC("fentry.s/__x64_sys_setdomainname")
+int BPF_PROG(test_sys_setdomainname, struct pt_regs *regs)
+{
+	void *ptr = (void *)PT_REGS_PARM1(regs);
+	int len = PT_REGS_PARM2(regs);
+	int buf = 0;
+	long ret;
+
+	ret = bpf_copy_from_user(&buf, sizeof(buf), ptr);
+	if (len == -2 && ret == 0 && buf == 1234)
+		copy_test++;
+	if (len == -3 && ret == -EFAULT)
+		copy_test++;
+	if (len == -4 && ret == -EFAULT)
+		copy_test++;
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c
index 8b36b6640e7e..9a4d09590b3d 100644
--- a/tools/testing/selftests/bpf/progs/trigger_bench.c
+++ b/tools/testing/selftests/bpf/progs/trigger_bench.c
@@ -39,6 +39,13 @@ int bench_trigger_fentry(void *ctx)
 	return 0;
 }
 
+SEC("fentry.s/__x64_sys_getpgid")
+int bench_trigger_fentry_sleep(void *ctx)
+{
+	__sync_add_and_fetch(&hits, 1);
+	return 0;
+}
+
 SEC("fmod_ret/__x64_sys_getpgid")
 int bench_trigger_fmodret(void *ctx)
 {
-- 
2.23.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf: Introduce minimal support for sleepable progs
  2020-06-30  4:33 [PATCH v5 bpf-next 0/5] bpf: Introduce minimal support for sleepable progs Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2020-06-30  4:33 ` [PATCH v5 bpf-next 5/5] selftests/bpf: Add sleepable tests Alexei Starovoitov
@ 2020-06-30  5:14 ` Andrii Nakryiko
  5 siblings, 0 replies; 20+ messages in thread
From: Andrii Nakryiko @ 2020-06-30  5:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Paul E . McKenney, Networking,
	bpf, Kernel Team

On Mon, Jun 29, 2020 at 9:34 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> v4->v5:
> - addressed Andrii's feedback.
>
> v3->v4:
> - fixed synchronize_rcu_tasks_trace() usage and accelerated with synchronize_rcu_mult().
> - removed redundant synchronize_rcu(). Otherwise it won't be clear why
>   synchronize_rcu_tasks_trace() is not needed in free_map callbacks.
> - improved test coverage.
>
> v2->v3:
> - switched to rcu_trace
> - added bpf_copy_from_user
>
> Here is 'perf report' differences:
> sleepable with SRCU:
>    3.86%  bench     [k] __srcu_read_unlock
>    3.22%  bench     [k] __srcu_read_lock
>    0.92%  bench     [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry_sleep
>    0.50%  bench     [k] bpf_trampoline_10297
>    0.26%  bench     [k] __bpf_prog_exit_sleepable
>    0.21%  bench     [k] __bpf_prog_enter_sleepable
>
> sleepable with RCU_TRACE:
>    0.79%  bench     [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry_sleep
>    0.72%  bench     [k] bpf_trampoline_10381
>    0.31%  bench     [k] __bpf_prog_exit_sleepable
>    0.29%  bench     [k] __bpf_prog_enter_sleepable
>
> non-sleepable with RCU:
>    0.88%  bench     [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry
>    0.84%  bench     [k] bpf_trampoline_10297
>    0.13%  bench     [k] __bpf_prog_enter
>    0.12%  bench     [k] __bpf_prog_exit
>
> Happy to confirm that rcu_trace overhead is negligible.
>
> v1->v2:
> - split fmod_ret fix into separate patch
> - added blacklist
>
> v1:
> This patch set introduces the minimal viable support for sleepable bpf programs.
> In this patch only fentry/fexit/fmod_ret and lsm progs can be sleepable.
> Only array and pre-allocated hash and lru maps allowed.
>
> Alexei Starovoitov (5):
>   bpf: Remove redundant synchronize_rcu.
>   bpf: Introduce sleepable BPF programs
>   bpf: Add bpf_copy_from_user() helper.
>   libbpf: support sleepable progs
>   selftests/bpf: Add sleepable tests
>
>  arch/x86/net/bpf_jit_comp.c                   | 32 +++++----
>  include/linux/bpf.h                           |  4 ++
>  include/uapi/linux/bpf.h                      | 19 +++++-
>  init/Kconfig                                  |  1 +
>  kernel/bpf/arraymap.c                         | 10 +--
>  kernel/bpf/hashtab.c                          | 20 +++---
>  kernel/bpf/helpers.c                          | 22 +++++++
>  kernel/bpf/lpm_trie.c                         |  5 --
>  kernel/bpf/queue_stack_maps.c                 |  7 --
>  kernel/bpf/reuseport_array.c                  |  2 -
>  kernel/bpf/ringbuf.c                          |  7 --
>  kernel/bpf/stackmap.c                         |  3 -
>  kernel/bpf/syscall.c                          | 13 +++-
>  kernel/bpf/trampoline.c                       | 28 +++++++-
>  kernel/bpf/verifier.c                         | 62 ++++++++++++++++-
>  kernel/trace/bpf_trace.c                      |  2 +
>  tools/include/uapi/linux/bpf.h                | 19 +++++-
>  tools/lib/bpf/libbpf.c                        | 25 ++++++-
>  tools/testing/selftests/bpf/bench.c           |  2 +
>  .../selftests/bpf/benchs/bench_trigger.c      | 17 +++++
>  .../selftests/bpf/prog_tests/test_lsm.c       |  9 +++
>  tools/testing/selftests/bpf/progs/lsm.c       | 66 ++++++++++++++++++-
>  .../selftests/bpf/progs/trigger_bench.c       |  7 ++
>  23 files changed, 315 insertions(+), 67 deletions(-)
>
> --
> 2.23.0
>

For the series:

Acked-by: Andrii Nakryiko <andriin@fb.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 3/5] bpf: Add bpf_copy_from_user() helper.
  2020-06-30  4:33 ` [PATCH v5 bpf-next 3/5] bpf: Add bpf_copy_from_user() helper Alexei Starovoitov
@ 2020-06-30 11:41   ` KP Singh
  0 siblings, 0 replies; 20+ messages in thread
From: KP Singh @ 2020-06-30 11:41 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, daniel, paulmck, netdev, bpf, kernel-team

On 29-Jun 21:33, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Sleepable BPF programs can now use copy_from_user() to access user memory.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Acked-by: Andrii Nakryiko <andriin@fb.com>

Really looking forward!

Acked-by: KP Singh <kpsingh@google.com>

> ---
>  include/linux/bpf.h            |  1 +
>  include/uapi/linux/bpf.h       | 11 ++++++++++-
>  kernel/bpf/helpers.c           | 22 ++++++++++++++++++++++
>  kernel/trace/bpf_trace.c       |  2 ++
>  tools/include/uapi/linux/bpf.h | 11 ++++++++++-
>  5 files changed, 45 insertions(+), 2 deletions(-)
> 

[...]

> +	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 1/5] bpf: Remove redundant synchronize_rcu.
  2020-06-30  4:33 ` [PATCH v5 bpf-next 1/5] bpf: Remove redundant synchronize_rcu Alexei Starovoitov
@ 2020-06-30 20:55   ` Paul E. McKenney
  0 siblings, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2020-06-30 20:55 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, daniel, netdev, bpf, kernel-team

On Mon, Jun 29, 2020 at 09:33:39PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> bpf_free_used_maps() or close(map_fd) will trigger map_free callback.
> bpf_free_used_maps() is called after bpf prog is no longer executing:
> bpf_prog_put->call_rcu->bpf_prog_free->bpf_free_used_maps.
> Hence there is no need to call synchronize_rcu() to protect map elements.
> 
> Note that hash_of_maps and array_of_maps update/delete inner maps via
> sys_bpf() that calls maybe_wait_bpf_programs() and synchronize_rcu().
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Acked-by: Andrii Nakryiko <andriin@fb.com>

From an RCU perspective:

Acked-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  kernel/bpf/arraymap.c         | 9 ---------
>  kernel/bpf/hashtab.c          | 8 +++-----
>  kernel/bpf/lpm_trie.c         | 5 -----
>  kernel/bpf/queue_stack_maps.c | 7 -------
>  kernel/bpf/reuseport_array.c  | 2 --
>  kernel/bpf/ringbuf.c          | 7 -------
>  kernel/bpf/stackmap.c         | 3 ---
>  7 files changed, 3 insertions(+), 38 deletions(-)
> 
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index ec5cd11032aa..c66e8273fccd 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -386,13 +386,6 @@ static void array_map_free(struct bpf_map *map)
>  {
>  	struct bpf_array *array = container_of(map, struct bpf_array, map);
>  
> -	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
> -	 * so the programs (can be more than one that used this map) were
> -	 * disconnected from events. Wait for outstanding programs to complete
> -	 * and free the array
> -	 */
> -	synchronize_rcu();
> -
>  	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
>  		bpf_array_free_percpu(array);
>  
> @@ -546,8 +539,6 @@ static void fd_array_map_free(struct bpf_map *map)
>  	struct bpf_array *array = container_of(map, struct bpf_array, map);
>  	int i;
>  
> -	synchronize_rcu();
> -
>  	/* make sure it's empty */
>  	for (i = 0; i < array->map.max_entries; i++)
>  		BUG_ON(array->ptrs[i] != NULL);
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index acd06081d81d..d4378d7d442b 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -1290,12 +1290,10 @@ static void htab_map_free(struct bpf_map *map)
>  {
>  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>  
> -	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
> -	 * so the programs (can be more than one that used this map) were
> -	 * disconnected from events. Wait for outstanding critical sections in
> -	 * these programs to complete
> +	/* bpf_free_used_maps() or close(map_fd) will trigger this map_free callback.
> +	 * bpf_free_used_maps() is called after bpf prog is no longer executing.
> +	 * There is no need to synchronize_rcu() here to protect map elements.
>  	 */
> -	synchronize_rcu();
>  
>  	/* some of free_htab_elem() callbacks for elements of this map may
>  	 * not have executed. Wait for them.
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 1abd4e3f906d..44474bf3ab7a 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -589,11 +589,6 @@ static void trie_free(struct bpf_map *map)
>  	struct lpm_trie_node __rcu **slot;
>  	struct lpm_trie_node *node;
>  
> -	/* Wait for outstanding programs to complete
> -	 * update/lookup/delete/get_next_key and free the trie.
> -	 */
> -	synchronize_rcu();
> -
>  	/* Always start at the root and walk down to a node that has no
>  	 * children. Then free that node, nullify its reference in the parent
>  	 * and start over.
> diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
> index 80c66a6d7c54..44184f82916a 100644
> --- a/kernel/bpf/queue_stack_maps.c
> +++ b/kernel/bpf/queue_stack_maps.c
> @@ -101,13 +101,6 @@ static void queue_stack_map_free(struct bpf_map *map)
>  {
>  	struct bpf_queue_stack *qs = bpf_queue_stack(map);
>  
> -	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
> -	 * so the programs (can be more than one that used this map) were
> -	 * disconnected from events. Wait for outstanding critical sections in
> -	 * these programs to complete
> -	 */
> -	synchronize_rcu();
> -
>  	bpf_map_area_free(qs);
>  }
>  
> diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
> index a09922f656e4..3625c4fcc65c 100644
> --- a/kernel/bpf/reuseport_array.c
> +++ b/kernel/bpf/reuseport_array.c
> @@ -96,8 +96,6 @@ static void reuseport_array_free(struct bpf_map *map)
>  	struct sock *sk;
>  	u32 i;
>  
> -	synchronize_rcu();
> -
>  	/*
>  	 * ops->map_*_elem() will not be able to access this
>  	 * array now. Hence, this function only races with
> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
> index dbf37aff4827..13a8d3967e07 100644
> --- a/kernel/bpf/ringbuf.c
> +++ b/kernel/bpf/ringbuf.c
> @@ -215,13 +215,6 @@ static void ringbuf_map_free(struct bpf_map *map)
>  {
>  	struct bpf_ringbuf_map *rb_map;
>  
> -	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
> -	 * so the programs (can be more than one that used this map) were
> -	 * disconnected from events. Wait for outstanding critical sections in
> -	 * these programs to complete
> -	 */
> -	synchronize_rcu();
> -
>  	rb_map = container_of(map, struct bpf_ringbuf_map, map);
>  	bpf_ringbuf_free(rb_map->rb);
>  	kfree(rb_map);
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 27dc9b1b08a5..071f98d0f7c6 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -604,9 +604,6 @@ static void stack_map_free(struct bpf_map *map)
>  {
>  	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
>  
> -	/* wait for bpf programs to complete before freeing stack map */
> -	synchronize_rcu();
> -
>  	bpf_map_area_free(smap->elems);
>  	pcpu_freelist_destroy(&smap->freelist);
>  	bpf_map_area_free(smap);
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs
  2020-06-30  4:33 ` [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs Alexei Starovoitov
@ 2020-06-30 22:04   ` Paul E. McKenney
  2020-06-30 23:26   ` Daniel Borkmann
  1 sibling, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2020-06-30 22:04 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, daniel, netdev, bpf, kernel-team

On Mon, Jun 29, 2020 at 09:33:40PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce sleepable BPF programs that can request such property for themselves
> via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
> to use helpers like bpf_copy_from_user() that might sleep. At present only
> fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
> when they are attached to kernel functions that are known to allow sleeping.
> 
> The non-sleepable programs are relying on implicit rcu_read_lock() and
> migrate_disable() to protect life time of programs, maps that they use and
> per-cpu kernel structures used to pass info between bpf programs and the
> kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
> migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
> should not be enclosed in migrate_disable() as well. Therefore
> rcu_read_lock_trace is used to protect the life time of sleepable progs.
> 
> There are many networking and tracing program types. In many cases the
> 'struct bpf_prog *' pointer itself is rcu protected within some other kernel
> data structure and the kernel code is using rcu_dereference() to load that
> program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
> Instead sleepable bpf programs are allowed with bpf trampoline only. The
> program pointers are hard-coded into generated assembly of bpf trampoline and
> synchronize_rcu_tasks_trace() is used to protect the life time of the program.
> The same trampoline can hold both sleepable and non-sleepable progs.
> 
> When rcu_read_lock_trace is held it means that some sleepable bpf program is running
> from bpf trampoline. Those programs can use bpf arrays and preallocated hash/lru
> maps. These map types are waiting on programs to complete via
> synchronize_rcu_tasks_trace();
> 
> Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
> synchronize_rcu_tasks() to wait for sleepable progs to finish and for
> trampoline assembly to finish.
> 
> This is the first step of introducing sleepable progs.
> 
> After that dynamically allocated hash maps can be allowed. All map elements
> would have to be rcu_trace protected instead of normal rcu.
> per-cpu maps will be allowed. Either via the following pattern:
> void *elem = bpf_map_lookup_elem(map, key);
> if (elem) {
>    // access elem
>    bpf_map_release_elem(map, elem);
> }
> where modified lookup() helper will do migrate_disable() and
> new bpf_map_release_elem() will do corresponding migrate_enable().
> Or explicit bpf_migrate_disable/enable() helpers will be introduced.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Acked-by: Andrii Nakryiko <andriin@fb.com>

From an RCU viewpoint:

Acked-by: Paul E. McKenney <paulmck@kernel.org>

I was going to suggest that BPF maps could just use RCU throughout,
but it appears that this would require an asynchronous wait for both
an RCU and an RCU tasks trace grace period.  Such a thing is possible,
but the price is a reference count, multiple rcu_head structures, and
with each rcu_head structure, a pointer to the reference count.  Which
just might not be worth it.

> ---
>  arch/x86/net/bpf_jit_comp.c    | 32 ++++++++++++------
>  include/linux/bpf.h            |  3 ++
>  include/uapi/linux/bpf.h       |  8 +++++
>  init/Kconfig                   |  1 +
>  kernel/bpf/arraymap.c          |  1 +
>  kernel/bpf/hashtab.c           | 12 +++----
>  kernel/bpf/syscall.c           | 13 +++++--
>  kernel/bpf/trampoline.c        | 28 +++++++++++++--
>  kernel/bpf/verifier.c          | 62 +++++++++++++++++++++++++++++++++-
>  tools/include/uapi/linux/bpf.h |  8 +++++
>  10 files changed, 144 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 42b6709e6dc7..7d9ea7b41c71 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -1379,10 +1379,15 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog,
>  	u8 *prog = *pprog;
>  	int cnt = 0;
>  
> -	if (emit_call(&prog, __bpf_prog_enter, prog))
> -		return -EINVAL;
> -	/* remember prog start time returned by __bpf_prog_enter */
> -	emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0);
> +	if (p->aux->sleepable) {
> +		if (emit_call(&prog, __bpf_prog_enter_sleepable, prog))
> +			return -EINVAL;
> +	} else {
> +		if (emit_call(&prog, __bpf_prog_enter, prog))
> +			return -EINVAL;
> +		/* remember prog start time returned by __bpf_prog_enter */
> +		emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0);
> +	}
>  
>  	/* arg1: lea rdi, [rbp - stack_size] */
>  	EMIT4(0x48, 0x8D, 0x7D, -stack_size);
> @@ -1402,13 +1407,18 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog,
>  	if (mod_ret)
>  		emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
>  
> -	/* arg1: mov rdi, progs[i] */
> -	emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32,
> -		       (u32) (long) p);
> -	/* arg2: mov rsi, rbx <- start time in nsec */
> -	emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6);
> -	if (emit_call(&prog, __bpf_prog_exit, prog))
> -		return -EINVAL;
> +	if (p->aux->sleepable) {
> +		if (emit_call(&prog, __bpf_prog_exit_sleepable, prog))
> +			return -EINVAL;
> +	} else {
> +		/* arg1: mov rdi, progs[i] */
> +		emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32,
> +			       (u32) (long) p);
> +		/* arg2: mov rsi, rbx <- start time in nsec */
> +		emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6);
> +		if (emit_call(&prog, __bpf_prog_exit, prog))
> +			return -EINVAL;
> +	}
>  
>  	*pprog = prog;
>  	return 0;
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 3d2ade703a35..e2b1581b2195 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -495,6 +495,8 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end,
>  /* these two functions are called from generated trampoline */
>  u64 notrace __bpf_prog_enter(void);
>  void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start);
> +void notrace __bpf_prog_enter_sleepable(void);
> +void notrace __bpf_prog_exit_sleepable(void);
>  
>  struct bpf_ksym {
>  	unsigned long		 start;
> @@ -687,6 +689,7 @@ struct bpf_prog_aux {
>  	bool offload_requested;
>  	bool attach_btf_trace; /* true if attaching to BTF-enabled raw tp */
>  	bool func_proto_unreliable;
> +	bool sleepable;
>  	enum bpf_tramp_prog_type trampoline_prog_type;
>  	struct bpf_trampoline *trampoline;
>  	struct hlist_node tramp_hlist;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0cb8ec948816..73f9e3f84b77 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -332,6 +332,14 @@ enum bpf_link_type {
>  /* The verifier internal test flag. Behavior is undefined */
>  #define BPF_F_TEST_STATE_FREQ	(1U << 3)
>  
> +/* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will
> + * restrict map and helper usage for such programs. Sleepable BPF programs can
> + * only be attached to hooks where kernel execution context allows sleeping.
> + * Such programs are allowed to use helpers that may sleep like
> + * bpf_copy_from_user().
> + */
> +#define BPF_F_SLEEPABLE		(1U << 4)
> +
>  /* When BPF ldimm64's insn[0].src_reg != 0 then this can have
>   * two extensions:
>   *
> diff --git a/init/Kconfig b/init/Kconfig
> index a46aa8f3174d..62687583f822 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1663,6 +1663,7 @@ config BPF_SYSCALL
>  	bool "Enable bpf() system call"
>  	select BPF
>  	select IRQ_WORK
> +	select TASKS_TRACE_RCU
>  	default n
>  	help
>  	  Enable the bpf() system call that allows to manipulate eBPF
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index c66e8273fccd..b07abcd58785 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -10,6 +10,7 @@
>  #include <linux/filter.h>
>  #include <linux/perf_event.h>
>  #include <uapi/linux/btf.h>
> +#include <linux/rcupdate_trace.h>
>  
>  #include "map_in_map.h"
>  
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index d4378d7d442b..65a7919e189d 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -9,6 +9,7 @@
>  #include <linux/rculist_nulls.h>
>  #include <linux/random.h>
>  #include <uapi/linux/btf.h>
> +#include <linux/rcupdate_trace.h>
>  #include "percpu_freelist.h"
>  #include "bpf_lru_list.h"
>  #include "map_in_map.h"
> @@ -577,8 +578,7 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)
>  	struct htab_elem *l;
>  	u32 hash, key_size;
>  
> -	/* Must be called with rcu_read_lock. */
> -	WARN_ON_ONCE(!rcu_read_lock_held());
> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());
>  
>  	key_size = map->key_size;
>  
> @@ -935,7 +935,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>  		/* unknown flags */
>  		return -EINVAL;
>  
> -	WARN_ON_ONCE(!rcu_read_lock_held());
> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());
>  
>  	key_size = map->key_size;
>  
> @@ -1026,7 +1026,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
>  		/* unknown flags */
>  		return -EINVAL;
>  
> -	WARN_ON_ONCE(!rcu_read_lock_held());
> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());
>  
>  	key_size = map->key_size;
>  
> @@ -1214,7 +1214,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
>  	u32 hash, key_size;
>  	int ret = -ENOENT;
>  
> -	WARN_ON_ONCE(!rcu_read_lock_held());
> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());
>  
>  	key_size = map->key_size;
>  
> @@ -1246,7 +1246,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key)
>  	u32 hash, key_size;
>  	int ret = -ENOENT;
>  
> -	WARN_ON_ONCE(!rcu_read_lock_held());
> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());
>  
>  	key_size = map->key_size;
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8da159936bab..782b2b029539 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -29,6 +29,7 @@
>  #include <linux/bpf_lsm.h>
>  #include <linux/poll.h>
>  #include <linux/bpf-netns.h>
> +#include <linux/rcupdate_trace.h>
>  
>  #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
>  			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
> @@ -1728,10 +1729,14 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
>  	btf_put(prog->aux->btf);
>  	bpf_prog_free_linfo(prog);
>  
> -	if (deferred)
> -		call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
> -	else
> +	if (deferred) {
> +		if (prog->aux->sleepable)
> +			call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu);
> +		else
> +			call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
> +	} else {
>  		__bpf_prog_put_rcu(&prog->aux->rcu);
> +	}
>  }
>  
>  static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
> @@ -2096,6 +2101,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
>  	if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT |
>  				 BPF_F_ANY_ALIGNMENT |
>  				 BPF_F_TEST_STATE_FREQ |
> +				 BPF_F_SLEEPABLE |
>  				 BPF_F_TEST_RND_HI32))
>  		return -EINVAL;
>  
> @@ -2151,6 +2157,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
>  	}
>  
>  	prog->aux->offload_requested = !!attr->prog_ifindex;
> +	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
>  
>  	err = security_bpf_prog_alloc(prog->aux);
>  	if (err)
> diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
> index 9be85aa4ec5f..c2b76545153c 100644
> --- a/kernel/bpf/trampoline.c
> +++ b/kernel/bpf/trampoline.c
> @@ -7,6 +7,8 @@
>  #include <linux/rbtree_latch.h>
>  #include <linux/perf_event.h>
>  #include <linux/btf.h>
> +#include <linux/rcupdate_trace.h>
> +#include <linux/rcupdate_wait.h>
>  
>  /* dummy _ops. The verifier will operate on target program's ops. */
>  const struct bpf_verifier_ops bpf_extension_verifier_ops = {
> @@ -210,9 +212,12 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
>  	 * updates to trampoline would change the code from underneath the
>  	 * preempted task. Hence wait for tasks to voluntarily schedule or go
>  	 * to userspace.
> +	 * The same trampoline can hold both sleepable and non-sleepable progs.
> +	 * synchronize_rcu_tasks_trace() is needed to make sure all sleepable
> +	 * programs finish executing.
> +	 * Wait for these two grace periods together.
>  	 */
> -
> -	synchronize_rcu_tasks();
> +	synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
>  
>  	err = arch_prepare_bpf_trampoline(new_image, new_image + PAGE_SIZE / 2,
>  					  &tr->func.model, flags, tprogs,
> @@ -344,7 +349,14 @@ void bpf_trampoline_put(struct bpf_trampoline *tr)
>  	if (WARN_ON_ONCE(!hlist_empty(&tr->progs_hlist[BPF_TRAMP_FEXIT])))
>  		goto out;
>  	bpf_image_ksym_del(&tr->ksym);
> -	/* wait for tasks to get out of trampoline before freeing it */
> +	/* This code will be executed when all bpf progs (both sleepable and
> +	 * non-sleepable) went through
> +	 * bpf_prog_put()->call_rcu[_tasks_trace]()->bpf_prog_free_deferred().
> +	 * Hence no need for another synchronize_rcu_tasks_trace() here,
> +	 * but synchronize_rcu_tasks() is still needed, since trampoline
> +	 * may not have had any sleepable programs and we need to wait
> +	 * for tasks to get out of trampoline code before freeing it.
> +	 */
>  	synchronize_rcu_tasks();
>  	bpf_jit_free_exec(tr->image);
>  	hlist_del(&tr->hlist);
> @@ -394,6 +406,16 @@ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start)
>  	rcu_read_unlock();
>  }
>  
> +void notrace __bpf_prog_enter_sleepable(void)
> +{
> +	rcu_read_lock_trace();
> +}
> +
> +void notrace __bpf_prog_exit_sleepable(void)
> +{
> +	rcu_read_unlock_trace();
> +}
> +
>  int __weak
>  arch_prepare_bpf_trampoline(void *image, void *image_end,
>  			    const struct btf_func_model *m, u32 flags,
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 7de98906ddf4..05aa990ba9a4 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -9112,6 +9112,23 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
>  		return -EINVAL;
>  	}
>  
> +	if (prog->aux->sleepable)
> +		switch (map->map_type) {
> +		case BPF_MAP_TYPE_HASH:
> +		case BPF_MAP_TYPE_LRU_HASH:
> +		case BPF_MAP_TYPE_ARRAY:
> +			if (!is_preallocated_map(map)) {
> +				verbose(env,
> +					"Sleepable programs can only use preallocated hash maps\n");
> +				return -EINVAL;
> +			}
> +			break;
> +		default:
> +			verbose(env,
> +				"Sleepable programs can only use array and hash maps\n");
> +			return -EINVAL;
> +		}
> +
>  	return 0;
>  }
>  
> @@ -10722,6 +10739,22 @@ static int check_attach_modify_return(struct bpf_prog *prog, unsigned long addr)
>  	return -EINVAL;
>  }
>  
> +/* list of non-sleepable kernel functions that are otherwise
> + * available to attach by bpf_lsm or fmod_ret progs.
> + */
> +static int check_sleepable_blacklist(unsigned long addr)
> +{
> +#ifdef CONFIG_BPF_LSM
> +	if (addr == (long)bpf_lsm_task_free)
> +		return -EINVAL;
> +#endif
> +#ifdef CONFIG_SECURITY
> +	if (addr == (long)security_task_free)
> +		return -EINVAL;
> +#endif
> +	return 0;
> +}
> +
>  static int check_attach_btf_id(struct bpf_verifier_env *env)
>  {
>  	struct bpf_prog *prog = env->prog;
> @@ -10739,6 +10772,12 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
>  	long addr;
>  	u64 key;
>  
> +	if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING &&
> +	    prog->type != BPF_PROG_TYPE_LSM) {
> +		verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n");
> +		return -EINVAL;
> +	}
> +
>  	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS)
>  		return check_struct_ops_btf_id(env);
>  
> @@ -10952,8 +10991,29 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
>  			if (ret)
>  				verbose(env, "%s() is not modifiable\n",
>  					prog->aux->attach_func_name);
> +		} else if (prog->aux->sleepable) {
> +			switch (prog->type) {
> +			case BPF_PROG_TYPE_TRACING:
> +				/* fentry/fexit progs can be sleepable only if they are
> +				 * attached to ALLOW_ERROR_INJECTION or security_*() funcs.
> +				 */
> +				ret = check_attach_modify_return(prog, addr);
> +				if (!ret)
> +					ret = check_sleepable_blacklist(addr);
> +				break;
> +			case BPF_PROG_TYPE_LSM:
> +				/* LSM progs check that they are attached to bpf_lsm_*() funcs
> +				 * which are sleepable too.
> +				 */
> +				ret = check_sleepable_blacklist(addr);
> +				break;
> +			default:
> +				break;
> +			}
> +			if (ret)
> +				verbose(env, "%s is not sleepable\n",
> +					prog->aux->attach_func_name);
>  		}
> -
>  		if (ret)
>  			goto out;
>  		tr->func.addr = (void *)addr;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 0cb8ec948816..73f9e3f84b77 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -332,6 +332,14 @@ enum bpf_link_type {
>  /* The verifier internal test flag. Behavior is undefined */
>  #define BPF_F_TEST_STATE_FREQ	(1U << 3)
>  
> +/* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will
> + * restrict map and helper usage for such programs. Sleepable BPF programs can
> + * only be attached to hooks where kernel execution context allows sleeping.
> + * Such programs are allowed to use helpers that may sleep like
> + * bpf_copy_from_user().
> + */
> +#define BPF_F_SLEEPABLE		(1U << 4)
> +
>  /* When BPF ldimm64's insn[0].src_reg != 0 then this can have
>   * two extensions:
>   *
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs
  2020-06-30  4:33 ` [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs Alexei Starovoitov
  2020-06-30 22:04   ` Paul E. McKenney
@ 2020-06-30 23:26   ` Daniel Borkmann
  2020-06-30 23:41     ` Alexei Starovoitov
  1 sibling, 1 reply; 20+ messages in thread
From: Daniel Borkmann @ 2020-06-30 23:26 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: paulmck, netdev, bpf, kernel-team

On 6/30/20 6:33 AM, Alexei Starovoitov wrote:
[...]
>   
> +/* list of non-sleepable kernel functions that are otherwise
> + * available to attach by bpf_lsm or fmod_ret progs.
> + */
> +static int check_sleepable_blacklist(unsigned long addr)
> +{
> +#ifdef CONFIG_BPF_LSM
> +	if (addr == (long)bpf_lsm_task_free)
> +		return -EINVAL;
> +#endif
> +#ifdef CONFIG_SECURITY
> +	if (addr == (long)security_task_free)
> +		return -EINVAL;
> +#endif
> +	return 0;
> +}

Would be nice to have some sort of generic function annotation to describe
that code cannot sleep inside of it, and then filter based on that. Anyway,
is above from manual code inspection?

What about others like security_sock_rcv_skb() for example which could be
bh_lock_sock()'ed (or, generally hooks running in softirq context)?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs
  2020-06-30 23:26   ` Daniel Borkmann
@ 2020-06-30 23:41     ` Alexei Starovoitov
  2020-07-01  9:15       ` KP Singh
  2020-07-01  9:17       ` Daniel Borkmann
  0 siblings, 2 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2020-06-30 23:41 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, paulmck, netdev, bpf, kernel-team

On Wed, Jul 01, 2020 at 01:26:44AM +0200, Daniel Borkmann wrote:
> On 6/30/20 6:33 AM, Alexei Starovoitov wrote:
> [...]
> > +/* list of non-sleepable kernel functions that are otherwise
> > + * available to attach by bpf_lsm or fmod_ret progs.
> > + */
> > +static int check_sleepable_blacklist(unsigned long addr)
> > +{
> > +#ifdef CONFIG_BPF_LSM
> > +	if (addr == (long)bpf_lsm_task_free)
> > +		return -EINVAL;
> > +#endif
> > +#ifdef CONFIG_SECURITY
> > +	if (addr == (long)security_task_free)
> > +		return -EINVAL;
> > +#endif
> > +	return 0;
> > +}
> 
> Would be nice to have some sort of generic function annotation to describe
> that code cannot sleep inside of it, and then filter based on that. Anyway,
> is above from manual code inspection?

yep. all manual. I don't think there is a way to automate it.
At least I cannot think of one.

> What about others like security_sock_rcv_skb() for example which could be
> bh_lock_sock()'ed (or, generally hooks running in softirq context)?

ahh. it's in running in bh at that point? then it should be added to blacklist.

The rough idea I had is to try all lsm_* and security_* hooks with all
debug kernel flags and see which ones will complain. Then add them to blacklist.
Unfortunately I'm completely swamped and cannot promise to do that
in the coming months.
So either we wait for somebody to do due diligence or land it knowing
that blacklist is incomplete and fix it up one by one.
I think the folks who're waiting on sleepable work would prefer the latter.
I'm fine whichever way.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs
  2020-06-30 23:41     ` Alexei Starovoitov
@ 2020-07-01  9:15       ` KP Singh
  2020-07-01  9:34         ` Daniel Borkmann
  2020-07-01  9:17       ` Daniel Borkmann
  1 sibling, 1 reply; 20+ messages in thread
From: KP Singh @ 2020-07-01  9:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, David S. Miller, paulmck, Networking, bpf, Kernel Team

On Wed, Jul 1, 2020 at 1:41 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jul 01, 2020 at 01:26:44AM +0200, Daniel Borkmann wrote:
> > On 6/30/20 6:33 AM, Alexei Starovoitov wrote:
> > [...]
> > > +/* list of non-sleepable kernel functions that are otherwise
> > > + * available to attach by bpf_lsm or fmod_ret progs.
> > > + */
> > > +static int check_sleepable_blacklist(unsigned long addr)
> > > +{
> > > +#ifdef CONFIG_BPF_LSM
> > > +   if (addr == (long)bpf_lsm_task_free)
> > > +           return -EINVAL;
> > > +#endif
> > > +#ifdef CONFIG_SECURITY
> > > +   if (addr == (long)security_task_free)
> > > +           return -EINVAL;
> > > +#endif
> > > +   return 0;
> > > +}
> >
> > Would be nice to have some sort of generic function annotation to describe
> > that code cannot sleep inside of it, and then filter based on that. Anyway,
> > is above from manual code inspection?
>
> yep. all manual. I don't think there is a way to automate it.
> At least I cannot think of one.
>
> > What about others like security_sock_rcv_skb() for example which could be
> > bh_lock_sock()'ed (or, generally hooks running in softirq context)?
>
> ahh. it's in running in bh at that point? then it should be added to blacklist.
>
> The rough idea I had is to try all lsm_* and security_* hooks with all
> debug kernel flags and see which ones will complain. Then add them to blacklist.
> Unfortunately I'm completely swamped and cannot promise to do that
> in the coming months.
> So either we wait for somebody to do due diligence or land it knowing
> that blacklist is incomplete and fix it up one by one.
> I think the folks who're waiting on sleepable work would prefer the latter.
> I'm fine whichever way.

Chiming in since I belong to the folks who are waiting on sleepable BPF patches:

1. Let's obviously add security_sock_rcv_skb to the list.
2. I can help in combing through the LSM hooks (at least the comments)
     to look for any other obvious candidates.
3. I think it's okay (for us) for this list to be a WIP and build on it with
    proper warnings (in the changelog / comments).
4. To make it easier for figuring out which hooks cannot sleep,
     It would be nice if we could:

    * Have a helper say, bool bpf_cant_sleep(), available when
       DEBUG_ATOMIC_SLEEP is enabled.
    * Attach LSM programs to all hooks which call this helper and gather data.
    * Let this run on dev machines, run workloads which use the LSM hooks .

4. Finally, once we do the hard work. We can also think of augmenting the
    LSM_HOOK macro to have structured access to whether a hook is sleepable
    or not (instead of relying on comments).

- KP

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs
  2020-06-30 23:41     ` Alexei Starovoitov
  2020-07-01  9:15       ` KP Singh
@ 2020-07-01  9:17       ` Daniel Borkmann
  2020-07-01 15:14         ` Alexei Starovoitov
  1 sibling, 1 reply; 20+ messages in thread
From: Daniel Borkmann @ 2020-07-01  9:17 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, paulmck, netdev, bpf, kernel-team

On 7/1/20 1:41 AM, Alexei Starovoitov wrote:
> On Wed, Jul 01, 2020 at 01:26:44AM +0200, Daniel Borkmann wrote:
>> On 6/30/20 6:33 AM, Alexei Starovoitov wrote:
>> [...]
>>> +/* list of non-sleepable kernel functions that are otherwise
>>> + * available to attach by bpf_lsm or fmod_ret progs.
>>> + */
>>> +static int check_sleepable_blacklist(unsigned long addr)
>>> +{
>>> +#ifdef CONFIG_BPF_LSM
>>> +	if (addr == (long)bpf_lsm_task_free)
>>> +		return -EINVAL;
>>> +#endif
>>> +#ifdef CONFIG_SECURITY
>>> +	if (addr == (long)security_task_free)
>>> +		return -EINVAL;
>>> +#endif
>>> +	return 0;
>>> +}
>>
>> Would be nice to have some sort of generic function annotation to describe
>> that code cannot sleep inside of it, and then filter based on that. Anyway,
>> is above from manual code inspection?
> 
> yep. all manual. I don't think there is a way to automate it.
> At least I cannot think of one.

Automation might be hard, but maybe semi-automate: we have a cant_migrate()
assertion in __BPF_PROG_RUN() which asserts on cant_sleep() PREEMPT_RT kernels
at least. We originally just has the cant_sleep() there before 37e1d9202225
("bpf: Replace cant_sleep() with cant_migrate()"). So perhaps one way to catch
bugs for sleepable progs is to add a __might_sleep() into __bpf_prog_enter_sleepable()
in order to trigger the assertion generally for DEBUG_ATOMIC_SLEEP configured
kernels when we're in non-sleepable sections? Still not perfect since the code
needs to be exercised first but better than nothing at all.

>> What about others like security_sock_rcv_skb() for example which could be
>> bh_lock_sock()'ed (or, generally hooks running in softirq context)?
> 
> ahh. it's in running in bh at that point? then it should be added to blacklist.

Yep.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs
  2020-07-01  9:15       ` KP Singh
@ 2020-07-01  9:34         ` Daniel Borkmann
  2020-07-01 15:21           ` Alexei Starovoitov
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Borkmann @ 2020-07-01  9:34 UTC (permalink / raw)
  To: KP Singh, Alexei Starovoitov
  Cc: David S. Miller, paulmck, Networking, bpf, Kernel Team

On 7/1/20 11:15 AM, KP Singh wrote:
> On Wed, Jul 1, 2020 at 1:41 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Wed, Jul 01, 2020 at 01:26:44AM +0200, Daniel Borkmann wrote:
>>> On 6/30/20 6:33 AM, Alexei Starovoitov wrote:
>>> [...]
>>>> +/* list of non-sleepable kernel functions that are otherwise
>>>> + * available to attach by bpf_lsm or fmod_ret progs.
>>>> + */
>>>> +static int check_sleepable_blacklist(unsigned long addr)
>>>> +{
>>>> +#ifdef CONFIG_BPF_LSM
>>>> +   if (addr == (long)bpf_lsm_task_free)
>>>> +           return -EINVAL;
>>>> +#endif
>>>> +#ifdef CONFIG_SECURITY
>>>> +   if (addr == (long)security_task_free)
>>>> +           return -EINVAL;
>>>> +#endif
>>>> +   return 0;
>>>> +}
>>>
>>> Would be nice to have some sort of generic function annotation to describe
>>> that code cannot sleep inside of it, and then filter based on that. Anyway,
>>> is above from manual code inspection?
>>
>> yep. all manual. I don't think there is a way to automate it.
>> At least I cannot think of one.
>>
>>> What about others like security_sock_rcv_skb() for example which could be
>>> bh_lock_sock()'ed (or, generally hooks running in softirq context)?
>>
>> ahh. it's in running in bh at that point? then it should be added to blacklist.
>>
>> The rough idea I had is to try all lsm_* and security_* hooks with all
>> debug kernel flags and see which ones will complain. Then add them to blacklist.
>> Unfortunately I'm completely swamped and cannot promise to do that
>> in the coming months.
>> So either we wait for somebody to do due diligence or land it knowing
>> that blacklist is incomplete and fix it up one by one.
>> I think the folks who're waiting on sleepable work would prefer the latter.
>> I'm fine whichever way.
> 
> Chiming in since I belong to the folks who are waiting on sleepable BPF patches:
> 
> 1. Let's obviously add security_sock_rcv_skb to the list.
> 2. I can help in combing through the LSM hooks (at least the comments)
>       to look for any other obvious candidates.
> 3. I think it's okay (for us) for this list to be a WIP and build on it with
>      proper warnings (in the changelog / comments).
> 4. To make it easier for figuring out which hooks cannot sleep,
>       It would be nice if we could:
> 
>      * Have a helper say, bool bpf_cant_sleep(), available when
>         DEBUG_ATOMIC_SLEEP is enabled.
>      * Attach LSM programs to all hooks which call this helper and gather data.
>      * Let this run on dev machines, run workloads which use the LSM hooks .
> 
> 4. Finally, once we do the hard work. We can also think of augmenting the
>      LSM_HOOK macro to have structured access to whether a hook is sleepable
>      or not (instead of relying on comments).

+1, I think augmenting mid-term would be the best given check_sleepable_blacklist()
is rather a very fragile workaround^hack and it's also a generic lsm/sec hooks issue
(at least for BPF_PROG_TYPE_LSM type & for the sake of documenting it for other LSMs).
Perhaps there are function attributes that could be used and later retrieved via BTF?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs
  2020-07-01  9:17       ` Daniel Borkmann
@ 2020-07-01 15:14         ` Alexei Starovoitov
  2020-07-01 19:03           ` KP Singh
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2020-07-01 15:14 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Paul E. McKenney, Network Development, bpf, Kernel Team

On Wed, Jul 1, 2020 at 2:17 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> ("bpf: Replace cant_sleep() with cant_migrate()"). So perhaps one way to catch
> bugs for sleepable progs is to add a __might_sleep() into __bpf_prog_enter_sleepable()

that's a good idea.

> in order to trigger the assertion generally for DEBUG_ATOMIC_SLEEP configured
> kernels when we're in non-sleepable sections? Still not perfect since the code
> needs to be exercised first but better than nothing at all.
>
> >> What about others like security_sock_rcv_skb() for example which could be
> >> bh_lock_sock()'ed (or, generally hooks running in softirq context)?
> >
> > ahh. it's in running in bh at that point? then it should be added to blacklist.
>
> Yep.

I'm assuming KP will take care of it soon.
If not I'll come back to this set some time in August.

In the meantime I've pushed patch 1 that removes redundant sync_rcu to bpf-next,
since it's independent and it benefits from being in the tree as much
as possible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs
  2020-07-01  9:34         ` Daniel Borkmann
@ 2020-07-01 15:21           ` Alexei Starovoitov
  2020-07-10 17:00             ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2020-07-01 15:21 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: KP Singh, David S. Miller, Paul E. McKenney, Networking, bpf,
	Kernel Team

On Wed, Jul 1, 2020 at 2:34 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> +1, I think augmenting mid-term would be the best given check_sleepable_blacklist()
> is rather a very fragile workaround^hack and it's also a generic lsm/sec hooks issue

I tried to make that crystal clear back in march during bpf virtual conference.
imo whitelist is just as fragile as blacklist. Underlying
implementation can change.

> (at least for BPF_PROG_TYPE_LSM type & for the sake of documenting it for other LSMs).
> Perhaps there are function attributes that could be used and later retrieved via BTF?

Even if we convince gcc folks to add another function attribute it
won't appear in dwarf.
So we won't be able to convert it to BTF in pahole.
Looking at how we failed to extend address_space() attribute to
support existing __rcu
and __user annotations I don't have high hopes of achieving annotations
via compiler (either gcc or clang). So plan B is to engage with sparse folks and
make it emit BTF with __rcu, __user and other annotations.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs
  2020-07-01 15:14         ` Alexei Starovoitov
@ 2020-07-01 19:03           ` KP Singh
  0 siblings, 0 replies; 20+ messages in thread
From: KP Singh @ 2020-07-01 19:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, David S. Miller, Paul E. McKenney,
	Network Development, bpf, Kernel Team

On Wed, Jul 1, 2020 at 5:14 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jul 1, 2020 at 2:17 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > ("bpf: Replace cant_sleep() with cant_migrate()"). So perhaps one way to catch
> > bugs for sleepable progs is to add a __might_sleep() into __bpf_prog_enter_sleepable()
>
> that's a good idea.
>
> > in order to trigger the assertion generally for DEBUG_ATOMIC_SLEEP configured
> > kernels when we're in non-sleepable sections? Still not perfect since the code
> > needs to be exercised first but better than nothing at all.
> >
> > >> What about others like security_sock_rcv_skb() for example which could be
> > >> bh_lock_sock()'ed (or, generally hooks running in softirq context)?
> > >
> > > ahh. it's in running in bh at that point? then it should be added to blacklist.
> >
> > Yep.
>
> I'm assuming KP will take care of it soon.

I found one other hook, file_send_sigiotask, which mentions
"Note that this hook is sometimes called from interrupt." So I think
we should add it to the list as well.

Given some more due diligence done here
and Daniel's proposal of adding __might_sleep() to
the __bpf_prog_enter_sleepable() we should be able to
iterate on finding other non-sleepable hooks (if they exist)
and eventually augmenting the LSM_HOOK macro for a
more structured way to store this information.

- KP

> If not I'll come back to this set some time in August.
>
> In the meantime I've pushed patch 1 that removes redundant sync_rcu to bpf-next,
> since it's independent and it benefits from being in the tree as much
> as possible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs
  2020-07-01 15:21           ` Alexei Starovoitov
@ 2020-07-10 17:00             ` Arnaldo Carvalho de Melo
  2020-07-10 22:59               ` Alexei Starovoitov
  0 siblings, 1 reply; 20+ messages in thread
From: Arnaldo Carvalho de Melo @ 2020-07-10 17:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, KP Singh, David S. Miller, Paul E. McKenney,
	Networking, bpf, Kernel Team

Em Wed, Jul 01, 2020 at 08:21:13AM -0700, Alexei Starovoitov escreveu:
> On Wed, Jul 1, 2020 at 2:34 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >
> > +1, I think augmenting mid-term would be the best given check_sleepable_blacklist()
> > is rather a very fragile workaround^hack and it's also a generic lsm/sec hooks issue
> 
> I tried to make that crystal clear back in march during bpf virtual conference.
> imo whitelist is just as fragile as blacklist. Underlying
> implementation can change.
> 
> > (at least for BPF_PROG_TYPE_LSM type & for the sake of documenting it for other LSMs).
> > Perhaps there are function attributes that could be used and later retrieved via BTF?
> 
> Even if we convince gcc folks to add another function attribute it
> won't appear in dwarf.

Warning, hack ahead!

Perhaps we could do that with some sort of convention, i.e. define some
type and make a function returning that type to have the desired
attribute?

I.e.

typedef __attribute__foo__int int;

__attribute__foo__int function_bla(...)
{
}

?

> So we won't be able to convert it to BTF in pahole.
> Looking at how we failed to extend address_space() attribute to
> support existing __rcu
> and __user annotations I don't have high hopes of achieving annotations
> via compiler (either gcc or clang). So plan B is to engage with sparse folks and
> make it emit BTF with __rcu, __user and other annotations.

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs
  2020-07-10 17:00             ` Arnaldo Carvalho de Melo
@ 2020-07-10 22:59               ` Alexei Starovoitov
  0 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2020-07-10 22:59 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Daniel Borkmann, KP Singh, David S. Miller, Paul E. McKenney,
	Networking, bpf, Kernel Team

On Fri, Jul 10, 2020 at 10:00 AM Arnaldo Carvalho de Melo
<arnaldo.melo@gmail.com> wrote:
>
> Em Wed, Jul 01, 2020 at 08:21:13AM -0700, Alexei Starovoitov escreveu:
> > On Wed, Jul 1, 2020 at 2:34 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >
> > > +1, I think augmenting mid-term would be the best given check_sleepable_blacklist()
> > > is rather a very fragile workaround^hack and it's also a generic lsm/sec hooks issue
> >
> > I tried to make that crystal clear back in march during bpf virtual conference.
> > imo whitelist is just as fragile as blacklist. Underlying
> > implementation can change.
> >
> > > (at least for BPF_PROG_TYPE_LSM type & for the sake of documenting it for other LSMs).
> > > Perhaps there are function attributes that could be used and later retrieved via BTF?
> >
> > Even if we convince gcc folks to add another function attribute it
> > won't appear in dwarf.
>
> Warning, hack ahead!
>
> Perhaps we could do that with some sort of convention, i.e. define some
> type and make a function returning that type to have the desired
> attribute?
>
> I.e.
>
> typedef __attribute__foo__int int;
>
> __attribute__foo__int function_bla(...)
> {
> }
>
> ?

What about lsm hooks returning void ?
I guess for lsm we can hack something like that,
but for __rcu and __user that won't really work.
The kernel changes will be too massive.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, back to index

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-30  4:33 [PATCH v5 bpf-next 0/5] bpf: Introduce minimal support for sleepable progs Alexei Starovoitov
2020-06-30  4:33 ` [PATCH v5 bpf-next 1/5] bpf: Remove redundant synchronize_rcu Alexei Starovoitov
2020-06-30 20:55   ` Paul E. McKenney
2020-06-30  4:33 ` [PATCH v5 bpf-next 2/5] bpf: Introduce sleepable BPF programs Alexei Starovoitov
2020-06-30 22:04   ` Paul E. McKenney
2020-06-30 23:26   ` Daniel Borkmann
2020-06-30 23:41     ` Alexei Starovoitov
2020-07-01  9:15       ` KP Singh
2020-07-01  9:34         ` Daniel Borkmann
2020-07-01 15:21           ` Alexei Starovoitov
2020-07-10 17:00             ` Arnaldo Carvalho de Melo
2020-07-10 22:59               ` Alexei Starovoitov
2020-07-01  9:17       ` Daniel Borkmann
2020-07-01 15:14         ` Alexei Starovoitov
2020-07-01 19:03           ` KP Singh
2020-06-30  4:33 ` [PATCH v5 bpf-next 3/5] bpf: Add bpf_copy_from_user() helper Alexei Starovoitov
2020-06-30 11:41   ` KP Singh
2020-06-30  4:33 ` [PATCH v5 bpf-next 4/5] libbpf: support sleepable progs Alexei Starovoitov
2020-06-30  4:33 ` [PATCH v5 bpf-next 5/5] selftests/bpf: Add sleepable tests Alexei Starovoitov
2020-06-30  5:14 ` [PATCH v5 bpf-next 0/5] bpf: Introduce minimal support for sleepable progs Andrii Nakryiko

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git