bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers.
@ 2021-06-24  2:25 Alexei Starovoitov
  2021-06-24  2:25 ` [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers Alexei Starovoitov
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-24  2:25 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The first request to support timers in bpf was made in 2013 before sys_bpf syscall
was added. That use case was periodic sampling. It was address with attaching
bpf programs to perf_events. Then during XDP development the timers were requested
to do garbage collection and health checks. They were worked around by implementing
timers in user space and triggering progs with BPF_PROG_RUN command.
The user space timers and perf_event+bpf timers are not armed by the bpf program.
They're done asynchronously vs program execution. The XDP program cannot send a
packet and arm the timer at the same time. The tracing prog cannot record an
event and arm the timer right away. This large class of use cases remained
unaddressed. The jiffy based and hrtimer based timers are essential part of the
kernel development and with this patch set the hrtimer based timers will be
available to bpf programs.

TLDR: bpf timers is a wrapper of hrtimers with all the extra safety added
to make sure bpf progs cannot crash the kernel.

v2->v3:
The v2 approach attempted to bump bpf_prog refcnt when bpf_timer_start is
called to make sure callback code doesn't disappear when timer is active and
drop refcnt when timer cb is done. That led to a ton of race conditions between
callback running and concurrent bpf_timer_init/start/cancel on another cpu,
and concurrent bpf_map_update/delete_elem, and map destroy.

Then v2.5 approach skipped prog refcnt altogether. Instead it remembered all
timers that bpf prog armed in a link list and canceled them when prog refcnt
went to zero. The race conditions disappeared, but timers in map-in-map could
not be supported cleanly, since timers in inner maps have inner map's life time
and don't match prog's life time.

This v3 approach makes timers to be owned by maps. It allows timers in inner
maps to be supported from the start. This apporach relies on "user refcnt"
scheme used in prog_array that stores bpf programs for bpf_tail_call. The
bpf_timer_start() increments prog refcnt, but unlike 1st approach the timer
callback does decrement the refcnt. The ops->map_release_uref is
responsible for cancelling the timers and dropping prog refcnt when user space
reference to a map is dropped. That addressed all the races and simplified
locking.

Andrii presented a use case where specifying callback_fn in bpf_timer_init()
is inconvenient vs specifying in bpf_timer_start(). The bpf_timer_init()
typically is called outside for timer callback, while bpf_timer_start() most
likely will be called from the callback. 
timer_cb() { ... bpf_timer_start(timer_cb); ...} looks like recursion and as
infinite loop to the verifier. The verifier had to be made smarter to recognize
such async callbacks. Patches 4,5,6 addressed that.

Patch 1 implements bpf timer helpers and locking.
Patch 2 implements map side of bpf timer support.
Patch 3 adds support for BTF in inner maps.
Patch 4 teaches check_cfg() pass to understand async callbacks.
Patch 5 teaches do_check() pass to understand async callbacks.
Patch 6 teaches check_max_stack_depth() pass to understand async callbacks.
Patches 7 and 8 are the tests.

v1->v2:
- Addressed great feedback from Andrii and Toke.
- Fixed race between parallel bpf_timer_*() ops.
- Fixed deadlock between timer callback and LRU eviction or bpf_map_delete/update.
- Disallowed mmap and global timers.
- Allow spin_lock and bpf_timer in an element.
- Fixed memory leaks due to map destruction and LRU eviction.
- A ton more tests.

Alexei Starovoitov (8):
  bpf: Introduce bpf timers.
  bpf: Add map side support for bpf timers.
  bpf: Remember BTF of inner maps.
  bpf: Relax verifier recursion check.
  bpf: Implement verifier support for validation of async callbacks.
  bpf: Teach stack depth check about async callbacks.
  selftests/bpf: Add bpf_timer test.
  selftests/bpf: Add a test with bpf_timer in inner map.

 include/linux/bpf.h                           |  47 ++-
 include/linux/bpf_verifier.h                  |  10 +-
 include/linux/btf.h                           |   1 +
 include/uapi/linux/bpf.h                      |  55 ++++
 kernel/bpf/arraymap.c                         |  22 ++
 kernel/bpf/btf.c                              |  77 ++++-
 kernel/bpf/hashtab.c                          |  96 +++++-
 kernel/bpf/helpers.c                          | 279 ++++++++++++++++
 kernel/bpf/local_storage.c                    |   4 +-
 kernel/bpf/map_in_map.c                       |   7 +
 kernel/bpf/syscall.c                          |  21 +-
 kernel/bpf/verifier.c                         | 311 +++++++++++++++++-
 kernel/trace/bpf_trace.c                      |   2 +-
 scripts/bpf_doc.py                            |   2 +
 tools/include/uapi/linux/bpf.h                |  55 ++++
 .../testing/selftests/bpf/prog_tests/timer.c  |  55 ++++
 .../selftests/bpf/prog_tests/timer_mim.c      |  59 ++++
 tools/testing/selftests/bpf/progs/timer.c     | 293 +++++++++++++++++
 tools/testing/selftests/bpf/progs/timer_mim.c |  81 +++++
 19 files changed, 1425 insertions(+), 52 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer_mim.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer_mim.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-24  2:25 [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Alexei Starovoitov
@ 2021-06-24  2:25 ` Alexei Starovoitov
  2021-06-25  6:25   ` Yonghong Song
                     ` (2 more replies)
  2021-06-24  2:25 ` [PATCH v3 bpf-next 2/8] bpf: Add map side support for " Alexei Starovoitov
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-24  2:25 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
in hash/array/lru maps as a regular field and helpers to operate on it:

// Initialize the timer.
// First 4 bits of 'flags' specify clockid.
// Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
long bpf_timer_init(struct bpf_timer *timer, int flags);

// Arm the timer to call callback_fn static function and set its
// expiration 'nsec' nanoseconds from the current time.
long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsec);

// Cancel the timer and wait for callback_fn to finish if it was running.
long bpf_timer_cancel(struct bpf_timer *timer);

Here is how BPF program might look like:
struct map_elem {
    int counter;
    struct bpf_timer timer;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1000);
    __type(key, int);
    __type(value, struct map_elem);
} hmap SEC(".maps");

static int timer_cb(void *map, int *key, struct map_elem *val);
/* val points to particular map element that contains bpf_timer. */

SEC("fentry/bpf_fentry_test1")
int BPF_PROG(test1, int a)
{
    struct map_elem *val;
    int key = 0;

    val = bpf_map_lookup_elem(&hmap, &key);
    if (val) {
        bpf_timer_init(&val->timer, CLOCK_REALTIME);
        bpf_timer_start(&val->timer, timer_cb, 1000 /* call timer_cb2 in 1 usec */);
    }
}

This patch adds helper implementations that rely on hrtimers
to call bpf functions as timers expire.
The following patches add necessary safety checks.

Only programs with CAP_BPF are allowed to use bpf_timer.

The amount of timers used by the program is constrained by
the memcg recorded at map creation time.

The bpf_timer_init() helper is receiving hidden 'map' argument and
bpf_timer_start() is receiving hidden 'prog' argument supplied by the verifier.
The prog pointer is needed to do refcnting of bpf program to make sure that
program doesn't get freed while the timer is armed. This apporach relies on
"user refcnt" scheme used in prog_array that stores bpf programs for
bpf_tail_call. The bpf_timer_start() will increment the prog refcnt which is
paired with bpf_timer_cancel() that will drop the prog refcnt. The
ops->map_release_uref is responsible for cancelling the timers and dropping
prog refcnt when user space reference to a map reaches zero.
This uref approach is done to make sure that Ctrl-C of user space process will
not leave timers running forever unless the user space explicitly pinned a map
that contained timers in bpffs.

The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
and free the timer if given map element had it allocated.
"bpftool map update" command can be used to cancel timers.

The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
'__u64 :64' has 1 byte alignment of 8 byte padding.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h            |   3 +
 include/uapi/linux/bpf.h       |  55 +++++++
 kernel/bpf/helpers.c           | 281 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          | 138 ++++++++++++++++
 kernel/trace/bpf_trace.c       |   2 +-
 scripts/bpf_doc.py             |   2 +
 tools/include/uapi/linux/bpf.h |  55 +++++++
 7 files changed, 535 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f309fc1509f2..72da9d4d070c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -168,6 +168,7 @@ struct bpf_map {
 	u32 max_entries;
 	u32 map_flags;
 	int spin_lock_off; /* >=0 valid offset, <0 error */
+	int timer_off; /* >=0 valid offset, <0 error */
 	u32 id;
 	int numa_node;
 	u32 btf_key_type_id;
@@ -221,6 +222,7 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 }
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 			   bool lock_src);
+void bpf_timer_cancel_and_free(void *timer);
 int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
 
 struct bpf_offload_dev;
@@ -314,6 +316,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_FUNC,	/* pointer to a bpf program function */
 	ARG_PTR_TO_STACK_OR_NULL,	/* pointer to stack or NULL */
 	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
+	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
 	__BPF_ARG_TYPE_MAX,
 };
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index bf9252c7381e..5e0a2a40507e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4780,6 +4780,53 @@ union bpf_attr {
  * 		Execute close syscall for given FD.
  * 	Return
  * 		A syscall result.
+ *
+ * long bpf_timer_init(struct bpf_timer *timer, u64 flags)
+ *	Description
+ *		Initialize the timer.
+ *		First 4 bits of *flags* specify clockid.
+ *		Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
+ *		All other bits of *flags* are reserved.
+ *	Return
+ *		0 on success.
+ *		**-EBUSY** if *timer* is already initialized.
+ *		**-EINVAL** if invalid *flags* are passed.
+ *
+ * long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsecs)
+ *	Description
+ *		Configure the timer to call *callback_fn* static function and
+ *		set its expiration N nanoseconds from the current time.
+ *		The *callback_fn* will be invoked in soft irq context on some cpu
+ *		and will not repeat unless another bpf_timer_start() is made.
+ *		In such case the next invocation can migrate to a different cpu.
+ *		Since struct bpf_timer is a field inside map element the map
+ *		owns the timer. The bpf_timer_start() will increment refcnt
+ *		of BPF program to make sure that callback_fn code stays valid.
+ *		When user space reference to a map reaches zero all timers
+ *		in a map are cancelled and corresponding program's refcnts are
+ *		decremented. This is done to make sure that Ctrl-C of a user
+ *		process doesn't leave any timers running. If map is pinned in
+ *		bpffs the callback_fn can re-arm itself indefinitely.
+ *		bpf_map_update/delete_elem() helpers and user space sys_bpf commands
+ *		cancel and free the timer in the given map element.
+ *		The map can contain timers that invoke callback_fn-s from different
+ *		programs. The same callback_fn can serve different timers from
+ *		different maps if key/value layout matches across maps.
+ *		Every bpf_timer_start() can have different callback_fn.
+ *
+ *	Return
+ *		0 on success.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *
+ * long bpf_timer_cancel(struct bpf_timer *timer)
+ *	Description
+ *		Cancel the timer and wait for callback_fn to finish if it was running.
+ *	Return
+ *		0 if the timer was not active.
+ *		1 if the timer was active.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *		**-EDEADLK** if callback_fn tried to call bpf_timer_cancel() on its own timer
+ *		which would have led to a deadlock otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4951,6 +4998,9 @@ union bpf_attr {
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
 	FN(sys_close),			\
+	FN(timer_init),			\
+	FN(timer_start),		\
+	FN(timer_cancel),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -6077,6 +6127,11 @@ struct bpf_spin_lock {
 	__u32	val;
 };
 
+struct bpf_timer {
+	__u64 :64;
+	__u64 :64;
+} __attribute__((aligned(8)));
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index a2f1f15ce432..584a37a1b974 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -989,6 +989,281 @@ const struct bpf_func_proto bpf_snprintf_proto = {
 	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
 };
 
+/* BPF map elements can contain 'struct bpf_timer'.
+ * Such map owns all of its BPF timers.
+ * 'struct bpf_timer' is allocated as part of map element allocation
+ * and it's zero initialized.
+ * That space is used to keep 'struct bpf_timer_kern'.
+ * bpf_timer_init() allocates 'struct bpf_hrtimer', inits hrtimer, and
+ * remembers 'struct bpf_map *' pointer it's part of.
+ * bpf_timer_start() arms the timer and increments struct bpf_prog refcnt.
+ * If user space reference to a map goes to zero at this point
+ * ops->map_release_uref callback is responsible for cancelling the timers,
+ * freeing their memory, and decrementing prog's refcnts.
+ * bpf_timer_cancel() cancels the timer and decrements prog's refcnt.
+ *
+ * The timer callback bpf_timer_cb() is doing refcnt++ and -- around
+ * bpf subprogram invocation to make that bpf_map_delete_elem() done
+ * explicitly or implicitly in case of LRU maps will not free bpf_prog
+ * while the callback is running.
+ *
+ * Inner maps can contain bpf timers as well. ops->map_release_uref is
+ * freeing the timers when inner map is replaced or deleted by user space.
+ */
+struct bpf_hrtimer {
+	struct hrtimer timer;
+	struct bpf_map *map;
+	struct bpf_prog *prog;
+	void *callback_fn;
+	void *value;
+};
+
+/* the actual struct hidden inside uapi struct bpf_timer */
+struct bpf_timer_kern {
+	struct bpf_hrtimer *timer;
+	/* bpf_spin_lock is used here instead of spinlock_t to make
+	 * sure that it always fits into space resereved by struct bpf_timer
+	 * regardless of LOCKDEP and spinlock debug flags.
+	 */
+	struct bpf_spin_lock lock;
+} __attribute__((aligned(8)));
+
+static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
+
+static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
+{
+	struct bpf_hrtimer *t = container_of(hrtimer, struct bpf_hrtimer, timer);
+	struct bpf_map *map = t->map;
+	void *value = t->value;
+	struct bpf_timer_kern *timer = value + map->timer_off;
+	struct bpf_prog *prog;
+	void *callback_fn;
+	void *key;
+	u32 idx;
+	int ret;
+
+	____bpf_spin_lock(&timer->lock);
+	/* callback_fn and prog need to match. They're updated together
+	 * and have to be read under lock.
+	 */
+	prog = t->prog;
+	callback_fn = t->callback_fn;
+
+	/* wrap bpf subprog invocation with prog->refcnt++ and -- to make
+	 * sure that refcnt doesn't become zero when subprog is executing.
+	 * Do it under lock to make sure that bpf_timer_start doesn't drop
+	 * prev prog refcnt to zero before timer_cb has a chance to bump it.
+	 */
+	bpf_prog_inc(prog);
+	____bpf_spin_unlock(&timer->lock);
+
+	/* bpf_timer_cb() runs in hrtimer_run_softirq. It doesn't migrate and
+	 * cannot be preempted by another bpf_timer_cb() on the same cpu.
+	 * Remember the timer this callback is servicing to prevent
+	 * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
+	 */
+	this_cpu_write(hrtimer_running, t);
+	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
+		struct bpf_array *array = container_of(map, struct bpf_array, map);
+
+		/* compute the key */
+		idx = ((char *)value - array->value) / array->elem_size;
+		key = &idx;
+	} else { /* hash or lru */
+		key = value - round_up(map->key_size, 8);
+	}
+
+	ret = BPF_CAST_CALL(callback_fn)((u64)(long)map,
+					 (u64)(long)key,
+					 (u64)(long)value, 0, 0);
+	WARN_ON(ret != 0); /* Next patch moves this check into the verifier */
+	bpf_prog_put(prog);
+
+	this_cpu_write(hrtimer_running, NULL);
+	return HRTIMER_NORESTART;
+}
+
+BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, u64, flags,
+	   struct bpf_map *, map)
+{
+	clockid_t clockid = flags & (MAX_CLOCKS - 1);
+	struct bpf_hrtimer *t;
+	int ret = 0;
+
+	BUILD_BUG_ON(MAX_CLOCKS != 16);
+	BUILD_BUG_ON(sizeof(struct bpf_timer_kern) > sizeof(struct bpf_timer));
+	BUILD_BUG_ON(__alignof__(struct bpf_timer_kern) != __alignof__(struct bpf_timer));
+
+	if (flags >= MAX_CLOCKS ||
+	    /* similar to timerfd except _ALARM variants are not supported */
+	    (clockid != CLOCK_MONOTONIC &&
+	     clockid != CLOCK_REALTIME &&
+	     clockid != CLOCK_BOOTTIME))
+		return -EINVAL;
+	____bpf_spin_lock(&timer->lock);
+	t = timer->timer;
+	if (t) {
+		ret = -EBUSY;
+		goto out;
+	}
+	/* allocate hrtimer via map_kmalloc to use memcg accounting */
+	t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
+	if (!t) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	t->value = (void *)timer - map->timer_off;
+	t->map = map;
+	t->prog = NULL;
+	t->callback_fn = NULL;
+	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
+	t->timer.function = bpf_timer_cb;
+	timer->timer = t;
+out:
+	____bpf_spin_unlock(&timer->lock);
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_timer_init_proto = {
+	.func		= bpf_timer_init,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+	.arg2_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_4(bpf_timer_start, struct bpf_timer_kern *, timer, void *, callback_fn,
+	   u64, nsecs, struct bpf_prog *, prog)
+{
+	struct bpf_hrtimer *t;
+	struct bpf_prog *prev;
+	int ret = 0;
+
+	____bpf_spin_lock(&timer->lock);
+	t = timer->timer;
+	if (!t) {
+		ret = -EINVAL;
+		goto out;
+	}
+	prev = t->prog;
+	if (prev != prog) {
+		if (prev)
+			/* Drop pref prog refcnt when swapping with new prog */
+			bpf_prog_put(prev);
+		/* Dump prog refcnt once.
+		 * Every bpf_timer_start() can pick different callback_fn-s
+		 * within the same prog.
+		 */
+		bpf_prog_inc(prog);
+		t->prog = prog;
+	}
+	t->callback_fn = callback_fn;
+	hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
+out:
+	____bpf_spin_unlock(&timer->lock);
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_timer_start_proto = {
+	.func		= bpf_timer_start,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+	.arg2_type	= ARG_PTR_TO_FUNC,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static void drop_prog_refcnt(struct bpf_hrtimer *t)
+{
+	struct bpf_prog *prog = t->prog;
+
+	if (prog) {
+		/* If timer was armed with bpf_timer_start()
+		 * drop prog refcnt.
+		 */
+		bpf_prog_put(prog);
+		t->prog = NULL;
+		t->callback_fn = NULL;
+	}
+}
+
+BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
+{
+	struct bpf_hrtimer *t;
+	int ret = 0;
+
+	____bpf_spin_lock(&timer->lock);
+	t = timer->timer;
+	if (!t) {
+		ret = -EINVAL;
+		goto out;
+	}
+	if (this_cpu_read(hrtimer_running) == t) {
+		/* If bpf callback_fn is trying to bpf_timer_cancel()
+		 * its own timer the hrtimer_cancel() will deadlock
+		 * since it waits for callback_fn to finish
+		 */
+		ret = -EDEADLK;
+		goto out;
+	}
+	/* Cancel the timer and wait for associated callback to finish
+	 * if it was running.
+	 */
+	ret = hrtimer_cancel(&t->timer);
+	drop_prog_refcnt(t);
+out:
+	____bpf_spin_unlock(&timer->lock);
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_timer_cancel_proto = {
+	.func		= bpf_timer_cancel,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+};
+
+/* This function is called by map_delete/update_elem for individual element.
+ * By ops->map_release_uref when the user space reference to a map reaches zero
+ * and by ops->map_free when the kernel reference reaches zero.
+ */
+void bpf_timer_cancel_and_free(void *val)
+{
+	struct bpf_timer_kern *timer = val;
+	struct bpf_hrtimer *t;
+
+	/* Performance optimization: read timer->timer without lock first. */
+	if (!READ_ONCE(timer->timer))
+		return;
+
+	____bpf_spin_lock(&timer->lock);
+	/* re-read it under lock */
+	t = timer->timer;
+	if (!t)
+		goto out;
+	/* Cancel the timer and wait for callback to complete if it was running.
+	 * Check that bpf_map_delete/update_elem() wasn't called from timer callback_fn.
+	 * In such case don't call hrtimer_cancel() (since it will deadlock)
+	 * and don't call hrtimer_try_to_cancel() (since it will just return -1).
+	 * Instead free the timer and set timer->timer = NULL.
+	 * The subsequent bpf_timer_start/cancel() helpers won't be able to use it,
+	 * since it won't be initialized.
+	 * In preallocated maps it's safe to do timer->timer = NULL.
+	 * The memory could be reused for another map element while current
+	 * callback_fn can do bpf_timer_init() on it.
+	 * In non-preallocated maps bpf_timer_cancel_and_free and
+	 * timer->timer = NULL will happen after callback_fn completes, since
+	 * program execution is an RCU critical section.
+	 */
+	if (this_cpu_read(hrtimer_running) != t)
+		hrtimer_cancel(&t->timer);
+	drop_prog_refcnt(t);
+	kfree(t);
+	timer->timer = NULL;
+out:
+	____bpf_spin_unlock(&timer->lock);
+}
+
 const struct bpf_func_proto bpf_get_current_task_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
@@ -1055,6 +1330,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_per_cpu_ptr_proto;
 	case BPF_FUNC_this_cpu_ptr:
 		return &bpf_this_cpu_ptr_proto;
+	case BPF_FUNC_timer_init:
+		return &bpf_timer_init_proto;
+	case BPF_FUNC_timer_start:
+		return &bpf_timer_start_proto;
+	case BPF_FUNC_timer_cancel:
+		return &bpf_timer_cancel_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b7d51fc937c7..fa15bd30e331 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4656,6 +4656,38 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
 	return 0;
 }
 
+static int process_timer_func(struct bpf_verifier_env *env, int regno,
+			      struct bpf_call_arg_meta *meta)
+{
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	bool is_const = tnum_is_const(reg->var_off);
+	struct bpf_map *map = reg->map_ptr;
+	u64 val = reg->var_off.value;
+
+	if (!is_const) {
+		verbose(env,
+			"R%d doesn't have constant offset. bpf_timer has to be at the constant offset\n",
+			regno);
+		return -EINVAL;
+	}
+	if (!map->btf) {
+		verbose(env, "map '%s' has to have BTF in order to use bpf_timer\n",
+			map->name);
+		return -EINVAL;
+	}
+	if (val) {
+		/* This restriction will be removed in the next patch */
+		verbose(env, "bpf_timer field can only be first in the map value element\n");
+		return -EINVAL;
+	}
+	if (meta->map_ptr) {
+		verbose(env, "verifier bug. Two map pointers in a timer helper\n");
+		return -EFAULT;
+	}
+	meta->map_ptr = map;
+	return 0;
+}
+
 static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
 {
 	return type == ARG_PTR_TO_MEM ||
@@ -4788,6 +4820,7 @@ static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { PTR_TO_PER
 static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
 static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
 static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
+static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
 
 static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_MAP_KEY]		= &map_key_value_types,
@@ -4819,6 +4852,7 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_FUNC]		= &func_ptr_types,
 	[ARG_PTR_TO_STACK_OR_NULL]	= &stack_ptr_types,
 	[ARG_PTR_TO_CONST_STR]		= &const_str_ptr_types,
+	[ARG_PTR_TO_TIMER]		= &timer_types,
 };
 
 static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
@@ -5000,6 +5034,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			verbose(env, "verifier internal error\n");
 			return -EFAULT;
 		}
+	} else if (arg_type == ARG_PTR_TO_TIMER) {
+		if (process_timer_func(env, regno, meta))
+			return -EACCES;
 	} else if (arg_type == ARG_PTR_TO_FUNC) {
 		meta->subprogno = reg->subprogno;
 	} else if (arg_type_is_mem_ptr(arg_type)) {
@@ -5742,6 +5779,34 @@ static int set_map_elem_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_timer_start_callback_state(struct bpf_verifier_env *env,
+					  struct bpf_func_state *caller,
+					  struct bpf_func_state *callee,
+					  int insn_idx)
+{
+	struct bpf_map *map_ptr = caller->regs[BPF_REG_1].map_ptr;
+
+	/* bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsecs);
+	 * callback_fn(struct bpf_map *map, void *key, void *value);
+	 */
+	callee->regs[BPF_REG_1].type = CONST_PTR_TO_MAP;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
+	callee->regs[BPF_REG_1].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
+	callee->regs[BPF_REG_2].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
+	callee->regs[BPF_REG_3].map_ptr = map_ptr;
+
+	/* unused */
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+	return 0;
+}
+
 static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
 {
 	struct bpf_verifier_state *state = env->cur_state;
@@ -5837,6 +5902,8 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 	    func_id != BPF_FUNC_map_pop_elem &&
 	    func_id != BPF_FUNC_map_peek_elem &&
 	    func_id != BPF_FUNC_for_each_map_elem &&
+	    func_id != BPF_FUNC_timer_init &&
+	    func_id != BPF_FUNC_timer_start &&
 	    func_id != BPF_FUNC_redirect_map)
 		return 0;
 
@@ -6069,6 +6136,13 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			return -EINVAL;
 	}
 
+	if (func_id == BPF_FUNC_timer_start) {
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_timer_start_callback_state);
+		if (err < 0)
+			return -EINVAL;
+	}
+
 	if (func_id == BPF_FUNC_snprintf) {
 		err = check_bpf_snprintf_call(env, regs);
 		if (err < 0)
@@ -12533,6 +12607,70 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			continue;
 		}
 
+		if (insn->imm == BPF_FUNC_timer_init) {
+			aux = &env->insn_aux_data[i + delta];
+			if (bpf_map_ptr_poisoned(aux)) {
+				verbose(env, "bpf_timer_init abusing map_ptr\n");
+				return -EINVAL;
+			}
+			map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
+			{
+				struct bpf_insn ld_addrs[2] = {
+					BPF_LD_IMM64(BPF_REG_3, (long)map_ptr),
+				};
+
+				insn_buf[0] = ld_addrs[0];
+				insn_buf[1] = ld_addrs[1];
+			}
+			insn_buf[2] = *insn;
+			cnt = 3;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta    += cnt - 1;
+			env->prog = prog = new_prog;
+			insn      = new_prog->insnsi + i + delta;
+			goto patch_call_imm;
+		}
+
+		if (insn->imm == BPF_FUNC_timer_start) {
+			/* There is no need to do:
+			 *     aux = &env->insn_aux_data[i + delta];
+			 *     if (bpf_map_ptr_poisoned(aux)) return -EINVAL;
+			 * for bpf_timer_start(). If the same callback_fn is shared
+			 * by different timers in different maps the poisoned check
+			 * will return false positive.
+			 *
+			 * The verifier will process callback_fn as many times as necessary
+			 * with different maps and the register states prepared by
+			 * set_timer_start_callback_state will be accurate.
+			 *
+			 * There is no need for bpf_timer_start() to check in the
+			 * run-time that bpf_hrtimer->map stored during bpf_timer_init()
+			 * is the same map as in bpf_timer_start()
+			 * because it's the same map element value.
+			 */
+			struct bpf_insn ld_addrs[2] = {
+				BPF_LD_IMM64(BPF_REG_4, (long)prog),
+			};
+
+			insn_buf[0] = ld_addrs[0];
+			insn_buf[1] = ld_addrs[1];
+			insn_buf[2] = *insn;
+			cnt = 3;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta    += cnt - 1;
+			env->prog = prog = new_prog;
+			insn      = new_prog->insnsi + i + delta;
+			goto patch_call_imm;
+		}
+
 		/* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
 		 * and other inlining handlers are currently limited to 64 bit
 		 * only.
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 7a52bc172841..80f6e6dafd5e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1057,7 +1057,7 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_snprintf:
 		return &bpf_snprintf_proto;
 	default:
-		return NULL;
+		return bpf_base_func_proto(func_id);
 	}
 }
 
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 2d94025b38e9..00ac7b79cddb 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -547,6 +547,7 @@ COMMANDS
             'struct inode',
             'struct socket',
             'struct file',
+            'struct bpf_timer',
     ]
     known_types = {
             '...',
@@ -594,6 +595,7 @@ COMMANDS
             'struct inode',
             'struct socket',
             'struct file',
+            'struct bpf_timer',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index bf9252c7381e..5e0a2a40507e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4780,6 +4780,53 @@ union bpf_attr {
  * 		Execute close syscall for given FD.
  * 	Return
  * 		A syscall result.
+ *
+ * long bpf_timer_init(struct bpf_timer *timer, u64 flags)
+ *	Description
+ *		Initialize the timer.
+ *		First 4 bits of *flags* specify clockid.
+ *		Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
+ *		All other bits of *flags* are reserved.
+ *	Return
+ *		0 on success.
+ *		**-EBUSY** if *timer* is already initialized.
+ *		**-EINVAL** if invalid *flags* are passed.
+ *
+ * long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsecs)
+ *	Description
+ *		Configure the timer to call *callback_fn* static function and
+ *		set its expiration N nanoseconds from the current time.
+ *		The *callback_fn* will be invoked in soft irq context on some cpu
+ *		and will not repeat unless another bpf_timer_start() is made.
+ *		In such case the next invocation can migrate to a different cpu.
+ *		Since struct bpf_timer is a field inside map element the map
+ *		owns the timer. The bpf_timer_start() will increment refcnt
+ *		of BPF program to make sure that callback_fn code stays valid.
+ *		When user space reference to a map reaches zero all timers
+ *		in a map are cancelled and corresponding program's refcnts are
+ *		decremented. This is done to make sure that Ctrl-C of a user
+ *		process doesn't leave any timers running. If map is pinned in
+ *		bpffs the callback_fn can re-arm itself indefinitely.
+ *		bpf_map_update/delete_elem() helpers and user space sys_bpf commands
+ *		cancel and free the timer in the given map element.
+ *		The map can contain timers that invoke callback_fn-s from different
+ *		programs. The same callback_fn can serve different timers from
+ *		different maps if key/value layout matches across maps.
+ *		Every bpf_timer_start() can have different callback_fn.
+ *
+ *	Return
+ *		0 on success.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *
+ * long bpf_timer_cancel(struct bpf_timer *timer)
+ *	Description
+ *		Cancel the timer and wait for callback_fn to finish if it was running.
+ *	Return
+ *		0 if the timer was not active.
+ *		1 if the timer was active.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *		**-EDEADLK** if callback_fn tried to call bpf_timer_cancel() on its own timer
+ *		which would have led to a deadlock otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4951,6 +4998,9 @@ union bpf_attr {
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
 	FN(sys_close),			\
+	FN(timer_init),			\
+	FN(timer_start),		\
+	FN(timer_cancel),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -6077,6 +6127,11 @@ struct bpf_spin_lock {
 	__u32	val;
 };
 
+struct bpf_timer {
+	__u64 :64;
+	__u64 :64;
+} __attribute__((aligned(8)));
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
-- 
2.30.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 bpf-next 2/8] bpf: Add map side support for bpf timers.
  2021-06-24  2:25 [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Alexei Starovoitov
  2021-06-24  2:25 ` [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers Alexei Starovoitov
@ 2021-06-24  2:25 ` Alexei Starovoitov
  2021-06-25 19:46   ` Yonghong Song
  2021-06-24  2:25 ` [PATCH v3 bpf-next 3/8] bpf: Remember BTF of inner maps Alexei Starovoitov
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-24  2:25 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Restrict bpf timers to array, hash (both preallocated and kmalloced), and
lru map types. The per-cpu maps with timers don't make sense, since 'struct
bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
the number of timers depends on number of possible cpus and timers would not be
accessible from all cpus. lpm map support can be added in the future.
The timers in inner maps are supported.

The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
bpf_timer in a given map element.

Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
that map element indeed contains 'struct bpf_timer'.

Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
map element data is reused in preallocated htab and lru maps.

Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
map element. There could be one of each, but not more than one. Due to 'one
bpf_timer in one element' restriction do not support timers in global data,
since global data is a map of single element, but from bpf program side it's
seen as many global variables and restriction of single global timer would be
odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
timers, since user space could have corrupted mmap element and crashed the
kernel. The maps with timers cannot be readonly. Due to these restrictions
search for bpf_timer in datasec BTF in case it was placed in the global data to
report clear error.

The previous patch allowed 'struct bpf_timer' as a first field in a map
element only. Relax this restriction.

Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
the timer when lru map deletes an element as a part of it eviction algorithm.

Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
The timer operation are done through helpers only.
This is similar to 'struct bpf_spin_lock'.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h        | 44 ++++++++++++-----
 include/linux/btf.h        |  1 +
 kernel/bpf/arraymap.c      | 22 +++++++++
 kernel/bpf/btf.c           | 77 ++++++++++++++++++++++++------
 kernel/bpf/hashtab.c       | 96 +++++++++++++++++++++++++++++++++-----
 kernel/bpf/local_storage.c |  4 +-
 kernel/bpf/map_in_map.c    |  1 +
 kernel/bpf/syscall.c       | 21 +++++++--
 kernel/bpf/verifier.c      | 30 ++++++++++--
 9 files changed, 251 insertions(+), 45 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 72da9d4d070c..90e6c6d1404c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -198,24 +198,46 @@ static inline bool map_value_has_spin_lock(const struct bpf_map *map)
 	return map->spin_lock_off >= 0;
 }
 
-static inline void check_and_init_map_lock(struct bpf_map *map, void *dst)
+static inline bool map_value_has_timer(const struct bpf_map *map)
 {
-	if (likely(!map_value_has_spin_lock(map)))
-		return;
-	*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
-		(struct bpf_spin_lock){};
+	return map->timer_off >= 0;
 }
 
-/* copy everything but bpf_spin_lock */
+static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
+{
+	if (unlikely(map_value_has_spin_lock(map)))
+		*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
+			(struct bpf_spin_lock){};
+	if (unlikely(map_value_has_timer(map)))
+		*(struct bpf_timer *)(dst + map->timer_off) =
+			(struct bpf_timer){};
+}
+
+/* copy everything but bpf_spin_lock and bpf_timer. There could be one of each. */
 static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 {
+	u32 s_off = 0, s_sz = 0, t_off = 0, t_sz = 0;
+
 	if (unlikely(map_value_has_spin_lock(map))) {
-		u32 off = map->spin_lock_off;
+		s_off = map->spin_lock_off;
+		s_sz = sizeof(struct bpf_spin_lock);
+	} else if (unlikely(map_value_has_timer(map))) {
+		t_off = map->timer_off;
+		t_sz = sizeof(struct bpf_timer);
+	}
 
-		memcpy(dst, src, off);
-		memcpy(dst + off + sizeof(struct bpf_spin_lock),
-		       src + off + sizeof(struct bpf_spin_lock),
-		       map->value_size - off - sizeof(struct bpf_spin_lock));
+	if (unlikely(s_sz || t_sz)) {
+		if (s_off < t_off || !s_sz) {
+			swap(s_off, t_off);
+			swap(s_sz, t_sz);
+		}
+		memcpy(dst, src, t_off);
+		memcpy(dst + t_off + t_sz,
+		       src + t_off + t_sz,
+		       s_off - t_off - t_sz);
+		memcpy(dst + s_off + s_sz,
+		       src + s_off + s_sz,
+		       map->value_size - s_off - s_sz);
 	} else {
 		memcpy(dst, src, map->value_size);
 	}
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 94a0c976c90f..214fde93214b 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -99,6 +99,7 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   const struct btf_member *m,
 			   u32 expected_offset, u32 expected_size);
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
+int btf_find_timer(const struct btf *btf, const struct btf_type *t);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 3c4105603f9d..7f07bb7adf63 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -287,6 +287,12 @@ static int array_map_get_next_key(struct bpf_map *map, void *key, void *next_key
 	return 0;
 }
 
+static void check_and_free_timer_in_array(struct bpf_array *arr, void *val)
+{
+	if (unlikely(map_value_has_timer(&arr->map)))
+		bpf_timer_cancel_and_free(val + arr->map.timer_off);
+}
+
 /* Called from syscall or from eBPF program */
 static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 				 u64 map_flags)
@@ -321,6 +327,7 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 			copy_map_value_locked(map, val, value, false);
 		else
 			copy_map_value(map, val, value);
+		check_and_free_timer_in_array(array, val);
 	}
 	return 0;
 }
@@ -374,6 +381,19 @@ static void *array_map_vmalloc_addr(struct bpf_array *array)
 	return (void *)round_down((unsigned long)array, PAGE_SIZE);
 }
 
+static void array_map_free_timers(struct bpf_map *map)
+{
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	int i;
+
+	if (likely(!map_value_has_timer(map)))
+		return;
+
+	for (i = 0; i < array->map.max_entries; i++)
+		bpf_timer_cancel_and_free(array->value + array->elem_size * i +
+					  map->timer_off);
+}
+
 /* Called when map->refcnt goes to zero, either from workqueue or from syscall */
 static void array_map_free(struct bpf_map *map)
 {
@@ -382,6 +402,7 @@ static void array_map_free(struct bpf_map *map)
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
 		bpf_array_free_percpu(array);
 
+	array_map_free_timers(map);
 	if (array->map.map_flags & BPF_F_MMAPABLE)
 		bpf_map_area_free(array_map_vmalloc_addr(array));
 	else
@@ -668,6 +689,7 @@ const struct bpf_map_ops array_map_ops = {
 	.map_alloc = array_map_alloc,
 	.map_free = array_map_free,
 	.map_get_next_key = array_map_get_next_key,
+	.map_release_uref = array_map_free_timers,
 	.map_lookup_elem = array_map_lookup_elem,
 	.map_update_elem = array_map_update_elem,
 	.map_delete_elem = array_map_delete_elem,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index cb4b72997d9b..7780131f710e 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3046,43 +3046,92 @@ static void btf_struct_log(struct btf_verifier_env *env,
 	btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
 }
 
-/* find 'struct bpf_spin_lock' in map value.
- * return >= 0 offset if found
- * and < 0 in case of error
- */
-int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
+static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
+				 const char *name, int sz, int align)
 {
 	const struct btf_member *member;
 	u32 i, off = -ENOENT;
 
-	if (!__btf_type_is_struct(t))
-		return -EINVAL;
-
 	for_each_member(i, t, member) {
 		const struct btf_type *member_type = btf_type_by_id(btf,
 								    member->type);
 		if (!__btf_type_is_struct(member_type))
 			continue;
-		if (member_type->size != sizeof(struct bpf_spin_lock))
+		if (member_type->size != sz)
 			continue;
-		if (strcmp(__btf_name_by_offset(btf, member_type->name_off),
-			   "bpf_spin_lock"))
+		if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
 			continue;
 		if (off != -ENOENT)
-			/* only one 'struct bpf_spin_lock' is allowed */
+			/* only one such field is allowed */
 			return -E2BIG;
 		off = btf_member_bit_offset(t, member);
 		if (off % 8)
 			/* valid C code cannot generate such BTF */
 			return -EINVAL;
 		off /= 8;
-		if (off % __alignof__(struct bpf_spin_lock))
-			/* valid struct bpf_spin_lock will be 4 byte aligned */
+		if (off % align)
+			return -EINVAL;
+	}
+	return off;
+}
+
+static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
+				const char *name, int sz, int align)
+{
+	const struct btf_var_secinfo *vsi;
+	u32 i, off = -ENOENT;
+
+	for_each_vsi(i, t, vsi) {
+		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
+		const struct btf_type *var_type = btf_type_by_id(btf, var->type);
+
+		if (!__btf_type_is_struct(var_type))
+			continue;
+		if (var_type->size != sz)
+			continue;
+		if (vsi->size != sz)
+			continue;
+		if (strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
+			continue;
+		if (off != -ENOENT)
+			/* only one such field is allowed */
+			return -E2BIG;
+		off = vsi->offset;
+		if (off % align)
 			return -EINVAL;
 	}
 	return off;
 }
 
+static int btf_find_field(const struct btf *btf, const struct btf_type *t,
+			  const char *name, int sz, int align)
+{
+
+	if (__btf_type_is_struct(t))
+		return btf_find_struct_field(btf, t, name, sz, align);
+	else if (btf_type_is_datasec(t))
+		return btf_find_datasec_var(btf, t, name, sz, align);
+	return -EINVAL;
+}
+
+/* find 'struct bpf_spin_lock' in map value.
+ * return >= 0 offset if found
+ * and < 0 in case of error
+ */
+int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
+{
+	return btf_find_field(btf, t, "bpf_spin_lock",
+			      sizeof(struct bpf_spin_lock),
+			      __alignof__(struct bpf_spin_lock));
+}
+
+int btf_find_timer(const struct btf *btf, const struct btf_type *t)
+{
+	return btf_find_field(btf, t, "bpf_timer",
+			      sizeof(struct bpf_timer),
+			      __alignof__(struct bpf_timer));
+}
+
 static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
 			      u32 type_id, void *data, u8 bits_offset,
 			      struct btf_show *show)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 6f6681b07364..e85a5839ffe8 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -228,6 +228,28 @@ static struct htab_elem *get_htab_elem(struct bpf_htab *htab, int i)
 	return (struct htab_elem *) (htab->elems + i * (u64)htab->elem_size);
 }
 
+static void htab_free_prealloced_timers(struct bpf_htab *htab)
+{
+	u32 num_entries = htab->map.max_entries;
+	int i;
+
+	if (likely(!map_value_has_timer(&htab->map)))
+		return;
+	if (!htab_is_percpu(htab) && !htab_is_lru(htab))
+		/* htab has extra_elems */
+		num_entries += num_possible_cpus();
+
+	for (i = 0; i < num_entries; i++) {
+		struct htab_elem *elem;
+
+		elem = get_htab_elem(htab, i);
+		bpf_timer_cancel_and_free(elem->key +
+					  round_up(htab->map.key_size, 8) +
+					  htab->map.timer_off);
+		cond_resched();
+	}
+}
+
 static void htab_free_elems(struct bpf_htab *htab)
 {
 	int i;
@@ -244,6 +266,7 @@ static void htab_free_elems(struct bpf_htab *htab)
 		cond_resched();
 	}
 free_elems:
+	htab_free_prealloced_timers(htab);
 	bpf_map_area_free(htab->elems);
 }
 
@@ -265,8 +288,11 @@ static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key,
 	struct htab_elem *l;
 
 	if (node) {
+		u32 key_size = htab->map.key_size;
 		l = container_of(node, struct htab_elem, lru_node);
-		memcpy(l->key, key, htab->map.key_size);
+		memcpy(l->key, key, key_size);
+		check_and_init_map_value(&htab->map,
+					 l->key + round_up(key_size, 8));
 		return l;
 	}
 
@@ -694,6 +720,14 @@ static int htab_lru_map_gen_lookup(struct bpf_map *map,
 	return insn - insn_buf;
 }
 
+static void check_and_free_timer(struct bpf_htab *htab, struct htab_elem *elem)
+{
+	if (unlikely(map_value_has_timer(&htab->map)))
+		bpf_timer_cancel_and_free(elem->key +
+					  round_up(htab->map.key_size, 8) +
+					  htab->map.timer_off);
+}
+
 /* It is called from the bpf_lru_list when the LRU needs to delete
  * older elements from the htab.
  */
@@ -718,6 +752,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
 		if (l == tgt_l) {
 			hlist_nulls_del_rcu(&l->hash_node);
+			check_and_free_timer(htab, l);
 			break;
 		}
 
@@ -789,6 +824,7 @@ static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 {
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
+	check_and_free_timer(htab, l);
 	kfree(l);
 }
 
@@ -816,6 +852,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 	htab_put_fd_value(htab, l);
 
 	if (htab_is_prealloc(htab)) {
+		check_and_free_timer(htab, l);
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		atomic_dec(&htab->count);
@@ -919,8 +956,8 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			l_new = ERR_PTR(-ENOMEM);
 			goto dec_count;
 		}
-		check_and_init_map_lock(&htab->map,
-					l_new->key + round_up(key_size, 8));
+		check_and_init_map_value(&htab->map,
+					 l_new->key + round_up(key_size, 8));
 	}
 
 	memcpy(l_new->key, key, key_size);
@@ -1060,6 +1097,8 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		hlist_nulls_del_rcu(&l_old->hash_node);
 		if (!htab_is_prealloc(htab))
 			free_htab_elem(htab, l_old);
+		else
+			check_and_free_timer(htab, l_old);
 	}
 	ret = 0;
 err:
@@ -1067,6 +1106,12 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	return ret;
 }
 
+static void htab_lru_push_free(struct bpf_htab *htab, struct htab_elem *elem)
+{
+	check_and_free_timer(htab, elem);
+	bpf_lru_push_free(&htab->lru, &elem->lru_node);
+}
+
 static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 				    u64 map_flags)
 {
@@ -1099,7 +1144,8 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 	l_new = prealloc_lru_pop(htab, key, hash);
 	if (!l_new)
 		return -ENOMEM;
-	memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size);
+	copy_map_value(&htab->map,
+		       l_new->key + round_up(map->key_size, 8), value);
 
 	ret = htab_lock_bucket(htab, b, hash, &flags);
 	if (ret)
@@ -1125,9 +1171,9 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 	htab_unlock_bucket(htab, b, hash, flags);
 
 	if (ret)
-		bpf_lru_push_free(&htab->lru, &l_new->lru_node);
+		htab_lru_push_free(htab, l_new);
 	else if (l_old)
-		bpf_lru_push_free(&htab->lru, &l_old->lru_node);
+		htab_lru_push_free(htab, l_old);
 
 	return ret;
 }
@@ -1332,7 +1378,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 
 	htab_unlock_bucket(htab, b, hash, flags);
 	if (l)
-		bpf_lru_push_free(&htab->lru, &l->lru_node);
+		htab_lru_push_free(htab, l);
 	return ret;
 }
 
@@ -1352,6 +1398,32 @@ static void delete_all_elements(struct bpf_htab *htab)
 	}
 }
 
+static void htab_free_malloced_timers(struct bpf_htab *htab)
+{
+	int i;
+
+	for (i = 0; i < htab->n_buckets; i++) {
+		struct hlist_nulls_head *head = select_bucket(htab, i);
+		struct hlist_nulls_node *n;
+		struct htab_elem *l;
+
+		hlist_nulls_for_each_entry(l, n, head, hash_node)
+			check_and_free_timer(htab, l);
+	}
+}
+
+static void htab_map_free_timers(struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+	if (likely(!map_value_has_timer(&htab->map)))
+		return;
+	if (!htab_is_prealloc(htab))
+		htab_free_malloced_timers(htab);
+	else
+		htab_free_prealloced_timers(htab);
+}
+
 /* Called when map->refcnt goes to zero, either from workqueue or from syscall */
 static void htab_map_free(struct bpf_map *map)
 {
@@ -1449,7 +1521,7 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
 			else
 				copy_map_value(map, value, l->key +
 					       roundup_key_size);
-			check_and_init_map_lock(map, value);
+			check_and_init_map_value(map, value);
 		}
 
 		hlist_nulls_del_rcu(&l->hash_node);
@@ -1460,7 +1532,7 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
 	htab_unlock_bucket(htab, b, hash, bflags);
 
 	if (is_lru_map && l)
-		bpf_lru_push_free(&htab->lru, &l->lru_node);
+		htab_lru_push_free(htab, l);
 
 	return ret;
 }
@@ -1638,7 +1710,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 						      true);
 			else
 				copy_map_value(map, dst_val, value);
-			check_and_init_map_lock(map, dst_val);
+			check_and_init_map_value(map, dst_val);
 		}
 		if (do_delete) {
 			hlist_nulls_del_rcu(&l->hash_node);
@@ -1665,7 +1737,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 	while (node_to_free) {
 		l = node_to_free;
 		node_to_free = node_to_free->batch_flink;
-		bpf_lru_push_free(&htab->lru, &l->lru_node);
+		htab_lru_push_free(htab, l);
 	}
 
 next_batch:
@@ -2027,6 +2099,7 @@ const struct bpf_map_ops htab_map_ops = {
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
 	.map_get_next_key = htab_map_get_next_key,
+	.map_release_uref = htab_map_free_timers,
 	.map_lookup_elem = htab_map_lookup_elem,
 	.map_lookup_and_delete_elem = htab_map_lookup_and_delete_elem,
 	.map_update_elem = htab_map_update_elem,
@@ -2048,6 +2121,7 @@ const struct bpf_map_ops htab_lru_map_ops = {
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
 	.map_get_next_key = htab_map_get_next_key,
+	.map_release_uref = htab_map_free_timers,
 	.map_lookup_elem = htab_lru_map_lookup_elem,
 	.map_lookup_and_delete_elem = htab_lru_map_lookup_and_delete_elem,
 	.map_lookup_elem_sys_only = htab_lru_map_lookup_elem_sys,
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index bd11db9774c3..95d70a08325d 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -173,7 +173,7 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *key,
 		return -ENOMEM;
 
 	memcpy(&new->data[0], value, map->value_size);
-	check_and_init_map_lock(map, new->data);
+	check_and_init_map_value(map, new->data);
 
 	new = xchg(&storage->buf, new);
 	kfree_rcu(new, rcu);
@@ -509,7 +509,7 @@ struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog,
 						    map->numa_node);
 		if (!storage->buf)
 			goto enomem;
-		check_and_init_map_lock(map, storage->buf->data);
+		check_and_init_map_value(map, storage->buf->data);
 	} else {
 		storage->percpu_buf = bpf_map_alloc_percpu(map, size, 8, gfp);
 		if (!storage->percpu_buf)
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 39ab0b68cade..2f961b941159 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -50,6 +50,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 	inner_map_meta->map_flags = inner_map->map_flags;
 	inner_map_meta->max_entries = inner_map->max_entries;
 	inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
+	inner_map_meta->timer_off = inner_map->timer_off;
 
 	/* Misc members not needed in bpf_map_meta_equal() check. */
 	inner_map_meta->ops = inner_map->ops;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e343f158e556..29c1d2f8d94c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -260,8 +260,8 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 				copy_map_value_locked(map, value, ptr, true);
 			else
 				copy_map_value(map, value, ptr);
-			/* mask lock, since value wasn't zero inited */
-			check_and_init_map_lock(map, value);
+			/* mask lock and timer, since value wasn't zero inited */
+			check_and_init_map_value(map, value);
 		}
 		rcu_read_unlock();
 	}
@@ -623,7 +623,8 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
 	struct bpf_map *map = filp->private_data;
 	int err;
 
-	if (!map->ops->map_mmap || map_value_has_spin_lock(map))
+	if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
+	    map_value_has_timer(map))
 		return -ENOTSUPP;
 
 	if (!(vma->vm_flags & VM_SHARED))
@@ -793,6 +794,16 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		}
 	}
 
+	map->timer_off = btf_find_timer(btf, value_type);
+	if (map_value_has_timer(map)) {
+		if (map->map_flags & BPF_F_RDONLY_PROG)
+			return -EACCES;
+		if (map->map_type != BPF_MAP_TYPE_HASH &&
+		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
+		    map->map_type != BPF_MAP_TYPE_ARRAY)
+			return -EOPNOTSUPP;
+	}
+
 	if (map->ops->map_check_btf)
 		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
 
@@ -844,6 +855,7 @@ static int map_create(union bpf_attr *attr)
 	mutex_init(&map->freeze_mutex);
 
 	map->spin_lock_off = -EINVAL;
+	map->timer_off = -EINVAL;
 	if (attr->btf_key_type_id || attr->btf_value_type_id ||
 	    /* Even the map's value is a kernel's struct,
 	     * the bpf_prog.o must have BTF to begin with
@@ -1591,7 +1603,8 @@ static int map_freeze(const union bpf_attr *attr)
 	if (IS_ERR(map))
 		return PTR_ERR(map);
 
-	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
+	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
+	    map_value_has_timer(map)) {
 		fdput(f);
 		return -ENOTSUPP;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index fa15bd30e331..7868481af61c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3241,6 +3241,15 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			return -EACCES;
 		}
 	}
+	if (map_value_has_timer(map)) {
+		u32 t = map->timer_off;
+
+		if (reg->smin_value + off < t + sizeof(struct bpf_timer) &&
+		     t < reg->umax_value + off + size) {
+			verbose(env, "bpf_timer cannot be accessed directly by load/store\n");
+			return -EACCES;
+		}
+	}
 	return err;
 }
 
@@ -4675,9 +4684,24 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
 			map->name);
 		return -EINVAL;
 	}
-	if (val) {
-		/* This restriction will be removed in the next patch */
-		verbose(env, "bpf_timer field can only be first in the map value element\n");
+	if (!map_value_has_timer(map)) {
+		if (map->timer_off == -E2BIG)
+			verbose(env,
+				"map '%s' has more than one 'struct bpf_timer'\n",
+				map->name);
+		else if (map->timer_off == -ENOENT)
+			verbose(env,
+				"map '%s' doesn't have 'struct bpf_timer'\n",
+				map->name);
+		else
+			verbose(env,
+				"map '%s' is not a struct type or bpf_timer is mangled\n",
+				map->name);
+		return -EINVAL;
+	}
+	if (map->timer_off != val + reg->off) {
+		verbose(env, "off %lld doesn't point to 'struct bpf_timer' that is at %d\n",
+			val + reg->off, map->timer_off);
 		return -EINVAL;
 	}
 	if (meta->map_ptr) {
-- 
2.30.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 bpf-next 3/8] bpf: Remember BTF of inner maps.
  2021-06-24  2:25 [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Alexei Starovoitov
  2021-06-24  2:25 ` [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers Alexei Starovoitov
  2021-06-24  2:25 ` [PATCH v3 bpf-next 2/8] bpf: Add map side support for " Alexei Starovoitov
@ 2021-06-24  2:25 ` Alexei Starovoitov
  2021-06-29  1:45   ` Yonghong Song
  2021-06-24  2:25 ` [PATCH v3 bpf-next 4/8] bpf: Relax verifier recursion check Alexei Starovoitov
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-24  2:25 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

BTF is required for 'struct bpf_timer' to be recognized inside map value.
The bpf timers are supported inside inner maps.
Remember 'struct btf *' in inner_map_meta to make it available
to the verifier in the sequence:

struct bpf_map *inner_map = bpf_map_lookup_elem(&outer_map, ...);
if (inner_map)
    timer = bpf_map_lookup_elem(&inner_map, ...);

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/map_in_map.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 2f961b941159..d737cff90922 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -3,6 +3,7 @@
  */
 #include <linux/slab.h>
 #include <linux/bpf.h>
+#include <linux/btf.h>
 
 #include "map_in_map.h"
 
@@ -51,6 +52,10 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 	inner_map_meta->max_entries = inner_map->max_entries;
 	inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
 	inner_map_meta->timer_off = inner_map->timer_off;
+	if (inner_map->btf) {
+		btf_get(inner_map->btf);
+		inner_map_meta->btf = inner_map->btf;
+	}
 
 	/* Misc members not needed in bpf_map_meta_equal() check. */
 	inner_map_meta->ops = inner_map->ops;
@@ -66,6 +71,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 
 void bpf_map_meta_free(struct bpf_map *map_meta)
 {
+	btf_put(map_meta->btf);
 	kfree(map_meta);
 }
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 bpf-next 4/8] bpf: Relax verifier recursion check.
  2021-06-24  2:25 [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2021-06-24  2:25 ` [PATCH v3 bpf-next 3/8] bpf: Remember BTF of inner maps Alexei Starovoitov
@ 2021-06-24  2:25 ` Alexei Starovoitov
  2021-06-24  2:25 ` [PATCH v3 bpf-next 5/8] bpf: Implement verifier support for validation of async callbacks Alexei Starovoitov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-24  2:25 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

In the following bpf subprogram:
static int timer_cb(void *map, void *key, void *value)
{
    bpf_timer_start(.., timer_cb, ..);
}

the 'timer_cb' is a pointer to a function.
ld_imm64 insn is used to carry this pointer.
bpf_pseudo_func() returns true for such ld_imm64 insn.

Unlike bpf_for_each_map_elem() the bpf_timer_start() callback is asynchronous.
Relax control flow check to allow such "recursion" that is seen as an infinite
loop by check_cfg(). The distinction between bpf_for_each_map_elem() the
bpf_timer_start() is done in the follow up patch.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7868481af61c..c88caec4ad28 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9387,8 +9387,12 @@ static int visit_func_call_insn(int t, int insn_cnt,
 		init_explored_state(env, t + 1);
 	if (visit_callee) {
 		init_explored_state(env, t);
-		ret = push_insn(t, t + insns[t].imm + 1, BRANCH,
-				env, false);
+		ret = push_insn(t, t + insns[t].imm + 1, BRANCH, env,
+				/* It's ok to allow recursion from CFG point of
+				 * view. __check_func_call() will do the actual
+				 * check.
+				 */
+				bpf_pseudo_func(insns + t));
 	}
 	return ret;
 }
-- 
2.30.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 bpf-next 5/8] bpf: Implement verifier support for validation of async callbacks.
  2021-06-24  2:25 [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2021-06-24  2:25 ` [PATCH v3 bpf-next 4/8] bpf: Relax verifier recursion check Alexei Starovoitov
@ 2021-06-24  2:25 ` Alexei Starovoitov
  2021-06-24  2:25 ` [PATCH v3 bpf-next 6/8] bpf: Teach stack depth check about " Alexei Starovoitov
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-24  2:25 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

bpf_for_each_map_elem() and bpf_timer_start() helpers are relying on
PTR_TO_FUNC infra in the verifier to validate addresses to subprograms
and pass them into the helpers as function callbacks.
In case of bpf_for_each_map_elem() the callback is invoked synchronously
and the verifier treats it as a normal subprogram call by adding another
bpf_func_state and new frame in __check_func_call().
bpf_timer_start() doesn't invoke the callback directly.
The subprogram will be called asynchronously from bpf_timer_cb().
Teach the verifier to validate such async callbacks as special kind
of jump by pushing verifier state into stack and let pop_stack() process it.

Special care needs to be taken during state pruning.
The call insn doing bpf_timer_start has to be a prune_point. Otherwise
short timer callbacks might not have prune points in front of
bpf_timer_start() which means is_state_visited() will be called after
this call insn is processed in __check_func_call(). Which means that
another async_cb state will be pushed to be walked later and the verifier
will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit.
Since push_async_cb() looks like another push_stack() branch the
infinite loop detection will trigger false positive. To recognize
this case mark such states as in_async_callback_fn.
To distinguish infinite loop in async callback vs the same callback called
with different arguments for different map and timer add async_entry_cnt
to bpf_func_state.

Enforce return zero from async callbacks.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_verifier.h |   9 ++-
 kernel/bpf/helpers.c         |   8 +--
 kernel/bpf/verifier.c        | 123 ++++++++++++++++++++++++++++++++++-
 3 files changed, 131 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index e774ecc1cd1f..ce30c4ceaa6d 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -201,12 +201,19 @@ struct bpf_func_state {
 	 * zero == main subprog
 	 */
 	u32 subprogno;
+	/* Every bpf_timer_start will increment async_entry_cnt.
+	 * It's used to distinguish:
+	 * void foo(void) { for(;;); }
+	 * void foo(void) { bpf_timer_start(,foo,); }
+	 */
+	u32 async_entry_cnt;
+	bool in_callback_fn;
+	bool in_async_callback_fn;
 
 	/* The following fields should be last. See copy_func_state() */
 	int acquired_refs;
 	struct bpf_reference_state *refs;
 	int allocated_stack;
-	bool in_callback_fn;
 	struct bpf_stack_state *stack;
 };
 
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 584a37a1b974..cd5b22ab579c 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1040,7 +1040,6 @@ static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
 	void *callback_fn;
 	void *key;
 	u32 idx;
-	int ret;
 
 	____bpf_spin_lock(&timer->lock);
 	/* callback_fn and prog need to match. They're updated together
@@ -1073,10 +1072,9 @@ static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
 		key = value - round_up(map->key_size, 8);
 	}
 
-	ret = BPF_CAST_CALL(callback_fn)((u64)(long)map,
-					 (u64)(long)key,
-					 (u64)(long)value, 0, 0);
-	WARN_ON(ret != 0); /* Next patch moves this check into the verifier */
+	BPF_CAST_CALL(callback_fn)((u64)(long)map, (u64)(long)key,
+				   (u64)(long)value, 0, 0);
+	/* The verifier checked that return value is zero. */
 	bpf_prog_put(prog);
 
 	this_cpu_write(hrtimer_running, NULL);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c88caec4ad28..503338093184 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -734,6 +734,10 @@ static void print_verifier_state(struct bpf_verifier_env *env,
 			if (state->refs[i].id)
 				verbose(env, ",%d", state->refs[i].id);
 	}
+	if (state->in_callback_fn)
+		verbose(env, " cb");
+	if (state->in_async_callback_fn)
+		verbose(env, " async_cb");
 	verbose(env, "\n");
 }
 
@@ -1522,6 +1526,54 @@ static void init_func_state(struct bpf_verifier_env *env,
 	init_reg_state(env, state);
 }
 
+/* Similar to push_stack(), but for async callbacks */
+static struct bpf_verifier_state *push_async_cb(struct bpf_verifier_env *env,
+						int insn_idx, int prev_insn_idx,
+						int subprog)
+{
+	struct bpf_verifier_stack_elem *elem;
+	struct bpf_func_state *frame;
+
+	elem = kzalloc(sizeof(struct bpf_verifier_stack_elem), GFP_KERNEL);
+	if (!elem)
+		goto err;
+
+	elem->insn_idx = insn_idx;
+	elem->prev_insn_idx = prev_insn_idx;
+	elem->next = env->head;
+	elem->log_pos = env->log.len_used;
+	env->head = elem;
+	env->stack_size++;
+	if (env->stack_size > BPF_COMPLEXITY_LIMIT_JMP_SEQ) {
+		verbose(env,
+			"The sequence of %d jumps is too complex for async cb.\n",
+			env->stack_size);
+		goto err;
+	}
+	/* Unlike push_stack() do not copy_verifier_state().
+	 * The caller state doesn't matter.
+	 * This is async callback. It starts in a fresh stack.
+	 * Initialize it similar to do_check_common().
+	 */
+	elem->st.branches = 1;
+	frame = kzalloc(sizeof(*frame), GFP_KERNEL);
+	if (!frame)
+		goto err;
+	init_func_state(env, frame,
+			BPF_MAIN_FUNC /* callsite */,
+			0 /* frameno within this callchain */,
+			subprog /* subprog number within this prog */);
+	elem->st.frame[0] = frame;
+	return &elem->st;
+err:
+	free_verifier_state(env->cur_state, true);
+	env->cur_state = NULL;
+	/* pop all elements and return */
+	while (!pop_stack(env, NULL, NULL, false));
+	return NULL;
+}
+
+
 enum reg_arg_type {
 	SRC_OP,		/* register is used as source operand */
 	DST_OP,		/* register is used as destination operand */
@@ -5676,6 +5728,30 @@ static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		}
 	}
 
+	if (insn->code == (BPF_JMP | BPF_CALL) &&
+	    insn->imm == BPF_FUNC_timer_start) {
+		struct bpf_verifier_state *async_cb;
+
+		/* there is no real recursion here. timer callbacks are async */
+		async_cb = push_async_cb(env, env->subprog_info[subprog].start,
+					 *insn_idx, subprog);
+		if (!async_cb)
+			return -EFAULT;
+		callee = async_cb->frame[0];
+		callee->async_entry_cnt = caller->async_entry_cnt + 1;
+
+		/* Convert bpf_timer_start() args into timer callback args */
+		err = set_callee_state_cb(env, caller, callee, *insn_idx);
+		if (err)
+			return err;
+
+		clear_caller_saved_regs(env, caller->regs);
+		mark_reg_unknown(env, caller->regs, BPF_REG_0);
+		caller->regs[BPF_REG_0].subreg_def = DEF_NOT_SUBREG;
+		/* continue with next insn after call */
+		return 0;
+	}
+
 	callee = kzalloc(sizeof(*callee), GFP_KERNEL);
 	if (!callee)
 		return -ENOMEM;
@@ -5828,6 +5904,7 @@ static int set_timer_start_callback_state(struct bpf_verifier_env *env,
 	/* unused */
 	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
 	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+	callee->in_async_callback_fn = true;
 	return 0;
 }
 
@@ -9148,7 +9225,8 @@ static int check_return_code(struct bpf_verifier_env *env)
 	struct tnum range = tnum_range(0, 1);
 	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
 	int err;
-	const bool is_subprog = env->cur_state->frame[0]->subprogno;
+	struct bpf_func_state *frame = env->cur_state->frame[0];
+	const bool is_subprog = frame->subprogno;
 
 	/* LSM and struct_ops func-ptr's return type could be "void" */
 	if (!is_subprog &&
@@ -9173,6 +9251,22 @@ static int check_return_code(struct bpf_verifier_env *env)
 	}
 
 	reg = cur_regs(env) + BPF_REG_0;
+
+	if (frame->in_async_callback_fn) {
+		/* enforce return zero from async callbacks like timer */
+		if (reg->type != SCALAR_VALUE) {
+			verbose(env, "In async callback the register R0 is not a known value (%s)\n",
+				reg_type_str[reg->type]);
+			return -EINVAL;
+		}
+
+		if (!tnum_in(tnum_const(0), reg->var_off)) {
+			verbose_invalid_scalar(env, reg, &range, "async callback", "R0");
+			return -EINVAL;
+		}
+		return 0;
+	}
+
 	if (is_subprog) {
 		if (reg->type != SCALAR_VALUE) {
 			verbose(env, "At subprogram exit the register R0 is not a scalar value (%s)\n",
@@ -9420,6 +9514,13 @@ static int visit_insn(int t, int insn_cnt, struct bpf_verifier_env *env)
 		return DONE_EXPLORING;
 
 	case BPF_CALL:
+		if (insns[t].imm == BPF_FUNC_timer_start)
+			/* Mark this call insn to trigger is_state_visited() check
+			 * before call itself is processed by __check_func_call().
+			 * Otherwise new async state will be pushed for further
+			 * exploration.
+			 */
+			init_explored_state(env, t);
 		return visit_func_call_insn(t, insn_cnt, insns, env,
 					    insns[t].src_reg == BPF_PSEUDO_CALL);
 
@@ -10427,9 +10528,25 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
 		states_cnt++;
 		if (sl->state.insn_idx != insn_idx)
 			goto next;
+
 		if (sl->state.branches) {
-			if (states_maybe_looping(&sl->state, cur) &&
-			    states_equal(env, &sl->state, cur)) {
+			struct bpf_func_state *frame = sl->state.frame[sl->state.curframe];
+
+			if (frame->in_async_callback_fn &&
+			    frame->async_entry_cnt != cur->frame[cur->curframe]->async_entry_cnt) {
+				/* Different async_entry_cnt means that the verifier is
+				 * processing another entry into async callback.
+				 * Seeing the same state is not an indication of infinite
+				 * loop or infinite recursion.
+				 * But finding the same state doesn't mean that it's safe
+				 * to stop processing the current state. The previous state
+				 * hasn't yet reached bpf_exit, since state.branches > 0.
+				 * Checking in_async_callback_fn alone is not enough either.
+				 * Since the verifier still needs to catch infinite loops
+				 * inside async callbacks.
+				 */
+			} else if (states_maybe_looping(&sl->state, cur) &&
+				   states_equal(env, &sl->state, cur)) {
 				verbose_linfo(env, insn_idx, "; ");
 				verbose(env, "infinite loop detected at insn %d\n", insn_idx);
 				return -EINVAL;
-- 
2.30.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 bpf-next 6/8] bpf: Teach stack depth check about async callbacks.
  2021-06-24  2:25 [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2021-06-24  2:25 ` [PATCH v3 bpf-next 5/8] bpf: Implement verifier support for validation of async callbacks Alexei Starovoitov
@ 2021-06-24  2:25 ` Alexei Starovoitov
  2021-06-24  2:25 ` [PATCH v3 bpf-next 7/8] selftests/bpf: Add bpf_timer test Alexei Starovoitov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-24  2:25 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Teach max stack depth checking algorithm about async callbacks
that don't increase bpf program stack size.
Also add sanity check that bpf_tail_call didn't sneak into async cb.
It's impossible, since PTR_TO_CTX is not available in async cb,
hence the program cannot contain bpf_tail_call(ctx,...);

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_verifier.h |  1 +
 kernel/bpf/verifier.c        | 18 +++++++++++++++---
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index ce30c4ceaa6d..5001cdcc3677 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -399,6 +399,7 @@ struct bpf_subprog_info {
 	bool has_tail_call;
 	bool tail_call_reachable;
 	bool has_ld_abs;
+	bool is_async_cb;
 };
 
 /* single container for all structs
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 503338093184..fe5ba8a46ce9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3704,6 +3704,8 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 continue_func:
 	subprog_end = subprog[idx + 1].start;
 	for (; i < subprog_end; i++) {
+		int next_insn;
+
 		if (!bpf_pseudo_call(insn + i) && !bpf_pseudo_func(insn + i))
 			continue;
 		/* remember insn and function to return to */
@@ -3711,13 +3713,22 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 		ret_prog[frame] = idx;
 
 		/* find the callee */
-		i = i + insn[i].imm + 1;
-		idx = find_subprog(env, i);
+		next_insn = i + insn[i].imm + 1;
+		idx = find_subprog(env, next_insn);
 		if (idx < 0) {
 			WARN_ONCE(1, "verifier bug. No program starts at insn %d\n",
-				  i);
+				  next_insn);
 			return -EFAULT;
 		}
+		if (subprog[idx].is_async_cb) {
+			if (subprog[idx].has_tail_call) {
+				verbose(env, "verifier bug. subprog has tail_call and async cb\n");
+				return -EFAULT;
+			}
+			 /* async callbacks don't increase bpf prog stack size */
+			continue;
+		}
+		i = next_insn;
 
 		if (subprog[idx].has_tail_call)
 			tail_call_reachable = true;
@@ -5733,6 +5744,7 @@ static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		struct bpf_verifier_state *async_cb;
 
 		/* there is no real recursion here. timer callbacks are async */
+		env->subprog_info[subprog].is_async_cb = true;
 		async_cb = push_async_cb(env, env->subprog_info[subprog].start,
 					 *insn_idx, subprog);
 		if (!async_cb)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 bpf-next 7/8] selftests/bpf: Add bpf_timer test.
  2021-06-24  2:25 [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Alexei Starovoitov
                   ` (5 preceding siblings ...)
  2021-06-24  2:25 ` [PATCH v3 bpf-next 6/8] bpf: Teach stack depth check about " Alexei Starovoitov
@ 2021-06-24  2:25 ` Alexei Starovoitov
  2021-06-24  2:25 ` [PATCH v3 bpf-next 8/8] selftests/bpf: Add a test with bpf_timer in inner map Alexei Starovoitov
  2021-06-24 11:27 ` [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Toke Høiland-Jørgensen
  8 siblings, 0 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-24  2:25 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Add bpf_timer test that creates timers in preallocated and
non-preallocated hash, in array and in lru maps.
Let array timer expire once and then re-arm it for 35 seconds.
Arm lru timer into the same callback.
Then arm and re-arm hash timers 10 times each.
At the last invocation of prealloc hash timer cancel the array timer.
Force timer free via LRU eviction and direct bpf_map_delete_elem.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 .../testing/selftests/bpf/prog_tests/timer.c  |  55 ++++
 tools/testing/selftests/bpf/progs/timer.c     | 293 ++++++++++++++++++
 2 files changed, 348 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer.c

diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
new file mode 100644
index 000000000000..25f40e1b9967
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/timer.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <test_progs.h>
+#include "timer.skel.h"
+
+static int timer(struct timer *timer_skel)
+{
+	int err, prog_fd;
+	__u32 duration = 0, retval;
+
+	err = timer__attach(timer_skel);
+	if (!ASSERT_OK(err, "timer_attach"))
+		return err;
+
+	ASSERT_EQ(timer_skel->data->callback_check, 52, "callback_check1");
+	ASSERT_EQ(timer_skel->data->callback2_check, 52, "callback2_check1");
+
+	prog_fd = bpf_program__fd(timer_skel->progs.test1);
+	err = bpf_prog_test_run(prog_fd, 1, NULL, 0,
+				NULL, NULL, &retval, &duration);
+	ASSERT_OK(err, "test_run");
+	ASSERT_EQ(retval, 0, "test_run");
+	timer__detach(timer_skel);
+
+	usleep(50); /* 10 usecs should be enough, but give it extra */
+	/* check that timer_cb1() was executed 10+10 times */
+	ASSERT_EQ(timer_skel->data->callback_check, 42, "callback_check2");
+	ASSERT_EQ(timer_skel->data->callback2_check, 42, "callback2_check2");
+
+	/* check that timer_cb2() was executed twice */
+	ASSERT_EQ(timer_skel->bss->bss_data, 10, "bss_data");
+
+	/* check that there were no errors in timer execution */
+	ASSERT_EQ(timer_skel->bss->err, 0, "err");
+
+	/* check that code paths completed */
+	ASSERT_EQ(timer_skel->bss->ok, 1 | 2 | 4, "ok");
+
+	return 0;
+}
+
+void test_timer(void)
+{
+	struct timer *timer_skel = NULL;
+	int err;
+
+	timer_skel = timer__open_and_load();
+	if (!ASSERT_OK_PTR(timer_skel, "timer_skel_load"))
+		goto cleanup;
+
+	err = timer(timer_skel);
+	ASSERT_OK(err, "timer");
+cleanup:
+	timer__destroy(timer_skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c
new file mode 100644
index 000000000000..5d96a12c91a2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/timer.c
@@ -0,0 +1,293 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <linux/bpf.h>
+#include <time.h>
+#include <errno.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_tcp_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+struct hmap_elem {
+	int counter;
+	struct bpf_timer timer;
+	struct bpf_spin_lock lock; /* unused */
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1000);
+	__type(key, int);
+	__type(value, struct hmap_elem);
+} hmap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__uint(max_entries, 1000);
+	__type(key, int);
+	__type(value, struct hmap_elem);
+} hmap_malloc SEC(".maps");
+
+struct elem {
+	struct bpf_timer t;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 2);
+	__type(key, int);
+	__type(value, struct elem);
+} array SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_LRU_HASH);
+	__uint(max_entries, 4);
+	__type(key, int);
+	__type(value, struct elem);
+} lru SEC(".maps");
+
+__u64 bss_data;
+__u64 err;
+__u64 ok;
+__u64 callback_check = 52;
+__u64 callback2_check = 52;
+
+#define ARRAY 1
+#define HTAB 2
+#define HTAB_MALLOC 3
+#define LRU 4
+
+/* callback for array and lru timers */
+static int timer_cb1(void *map, int *key, struct bpf_timer *timer)
+{
+	/* increment bss variable twice.
+	 * Once via array timer callback and once via lru timer callback
+	 */
+	bss_data += 5;
+
+	/* *key == 0 - the callback was called for array timer.
+	 * *key == 4 - the callback was called from lru timer.
+	 */
+	if (*key == ARRAY) {
+		struct bpf_timer *lru_timer;
+		int lru_key = LRU;
+
+		/* rearm array timer to be called again in ~35 seconds */
+		if (bpf_timer_start(timer, timer_cb1, 1ull << 35) != 0)
+			err |= 1;
+
+		lru_timer = bpf_map_lookup_elem(&lru, &lru_key);
+		if (!lru_timer)
+			return 0;
+		if (bpf_timer_start(lru_timer, timer_cb1, 0) != 0)
+			err |= 2;
+	} else if (*key == LRU) {
+		int lru_key, i;
+
+		for (i = LRU + 1;
+		     i <= 100  /* for current LRU eviction algorithm this number
+				* should be larger than ~ lru->max_entries * 2
+				*/;
+		     i++) {
+			struct elem init = {};
+
+			/* lru_key cannot be used as loop induction variable
+			 * otherwise the loop will be unbounded.
+			 */
+			lru_key = i;
+
+			/* add more elements into lru map to push out current
+			 * element and force deletion of this timer
+			 */
+			bpf_map_update_elem(map, &lru_key, &init, 0);
+			/* look it up to bump it into active list */
+			bpf_map_lookup_elem(map, &lru_key);
+
+			/* keep adding until *key changes underneath,
+			 * which means that key/timer memory was reused
+			 */
+			if (*key != LRU)
+				break;
+		}
+
+		/* check that the timer was removed */
+		if (bpf_timer_cancel(timer) != -EINVAL)
+			err |= 4;
+		ok |= 1;
+	}
+	return 0;
+}
+
+SEC("fentry/bpf_fentry_test1")
+int BPF_PROG(test1, int a)
+{
+	struct bpf_timer *arr_timer, *lru_timer;
+	struct elem init = {};
+	int lru_key = LRU;
+	int array_key = ARRAY;
+
+	arr_timer = bpf_map_lookup_elem(&array, &array_key);
+	if (!arr_timer)
+		return 0;
+	bpf_timer_init(arr_timer, CLOCK_MONOTONIC);
+
+	bpf_map_update_elem(&lru, &lru_key, &init, 0);
+	lru_timer = bpf_map_lookup_elem(&lru, &lru_key);
+	if (!lru_timer)
+		return 0;
+	bpf_timer_init(lru_timer, CLOCK_MONOTONIC);
+
+	bpf_timer_start(arr_timer, timer_cb1, 0 /* call timer_cb1 asap */);
+
+	/* init more timers to check that array destruction
+	 * doesn't leak timer memory.
+	 */
+	array_key = 0;
+	arr_timer = bpf_map_lookup_elem(&array, &array_key);
+	if (!arr_timer)
+		return 0;
+	bpf_timer_init(arr_timer, CLOCK_MONOTONIC);
+	return 0;
+}
+
+/* callback for prealloc and non-prealloca hashtab timers */
+static int timer_cb2(void *map, int *key, struct hmap_elem *val)
+{
+	if (*key == HTAB)
+		callback_check--;
+	else
+		callback2_check--;
+	if (val->counter > 0 && --val->counter) {
+		/* re-arm the timer again to execute after 1 usec */
+		bpf_timer_start(&val->timer, timer_cb2, 1000);
+	} else if (*key == HTAB) {
+		struct bpf_timer *arr_timer;
+		int array_key = ARRAY;
+
+		/* cancel arr_timer otherwise bpf_fentry_test1 prog
+		 * will stay alive forever.
+		 */
+		arr_timer = bpf_map_lookup_elem(&array, &array_key);
+		if (!arr_timer)
+			return 0;
+		if (bpf_timer_cancel(arr_timer) != 1)
+			/* bpf_timer_cancel should return 1 to indicate
+			 * that arr_timer was active at this time
+			 */
+			err |= 8;
+
+		/* try to cancel ourself. It shouldn't deadlock. */
+		if (bpf_timer_cancel(&val->timer) != -EDEADLK)
+			err |= 16;
+
+		/* delete this key and this timer anyway.
+		 * It shouldn't deadlock either.
+		 */
+		bpf_map_delete_elem(map, key);
+
+		/* in preallocated hashmap both 'key' and 'val' could have been
+		 * reused to store another map element (like in LRU above),
+		 * but in controlled test environment the below test works.
+		 * It's not a use-after-free. The memory is owned by the map.
+		 */
+		if (bpf_timer_start(&val->timer, timer_cb2, 1000) != -EINVAL)
+			err |= 32;
+		ok |= 2;
+	} else {
+		if (*key != HTAB_MALLOC)
+			err |= 64;
+
+		/* try to cancel ourself. It shouldn't deadlock. */
+		if (bpf_timer_cancel(&val->timer) != -EDEADLK)
+			err |= 128;
+
+		/* delete this key and this timer anyway.
+		 * It shouldn't deadlock either.
+		 */
+		bpf_map_delete_elem(map, key);
+
+		/* in non-preallocated hashmap both 'key' and 'val' are RCU
+		 * protected and still valid though this element was deleted
+		 * from the map. Arm this timer for ~35 seconds. When callback
+		 * finishes the call_rcu will invoke:
+		 * htab_elem_free_rcu
+		 *   check_and_free_timer
+		 *     bpf_timer_cancel_and_free
+		 * to cancel this 35 second sleep and delete the timer for real.
+		 */
+		if (bpf_timer_start(&val->timer, timer_cb2, 1ull << 35) != 0)
+			err |= 256;
+		ok |= 4;
+	}
+	return 0;
+}
+
+int bpf_timer_test(void)
+{
+	struct hmap_elem *val;
+	int key = HTAB, key_malloc = HTAB_MALLOC;
+
+	val = bpf_map_lookup_elem(&hmap, &key);
+	if (val) {
+		if (bpf_timer_init(&val->timer, CLOCK_BOOTTIME) != 0)
+			err |= 512;
+		bpf_timer_start(&val->timer, timer_cb2, 1000);
+	}
+	val = bpf_map_lookup_elem(&hmap_malloc, &key_malloc);
+	if (val) {
+		if (bpf_timer_init(&val->timer, CLOCK_BOOTTIME) != 0)
+			err |= 1024;
+		bpf_timer_start(&val->timer, timer_cb2, 1000);
+	}
+	return 0;
+}
+
+SEC("fentry/bpf_fentry_test2")
+int BPF_PROG(test2, int a, int b)
+{
+	struct hmap_elem init = {}, *val;
+	int key = HTAB, key_malloc = HTAB_MALLOC;
+
+	init.counter = 10; /* number of times to trigger timer_cb2 */
+	bpf_map_update_elem(&hmap, &key, &init, 0);
+	val = bpf_map_lookup_elem(&hmap, &key);
+	if (val)
+		bpf_timer_init(&val->timer, CLOCK_BOOTTIME);
+	/* update the same key to free the timer */
+	bpf_map_update_elem(&hmap, &key, &init, 0);
+
+	bpf_map_update_elem(&hmap_malloc, &key_malloc, &init, 0);
+	val = bpf_map_lookup_elem(&hmap_malloc, &key_malloc);
+	if (val)
+		bpf_timer_init(&val->timer, CLOCK_BOOTTIME);
+	/* update the same key to free the timer */
+	bpf_map_update_elem(&hmap_malloc, &key_malloc, &init, 0);
+
+	/* init more timers to check that htab operations
+	 * don't leak timer memory.
+	 */
+	key = 0;
+	bpf_map_update_elem(&hmap, &key, &init, 0);
+	val = bpf_map_lookup_elem(&hmap, &key);
+	if (val)
+		bpf_timer_init(&val->timer, CLOCK_BOOTTIME);
+	bpf_map_delete_elem(&hmap, &key);
+	bpf_map_update_elem(&hmap, &key, &init, 0);
+	val = bpf_map_lookup_elem(&hmap, &key);
+	if (val)
+		bpf_timer_init(&val->timer, CLOCK_BOOTTIME);
+
+	/* and with non-prealloc htab */
+	key_malloc = 0;
+	bpf_map_update_elem(&hmap_malloc, &key_malloc, &init, 0);
+	val = bpf_map_lookup_elem(&hmap_malloc, &key_malloc);
+	if (val)
+		bpf_timer_init(&val->timer, CLOCK_BOOTTIME);
+	bpf_map_delete_elem(&hmap_malloc, &key_malloc);
+	bpf_map_update_elem(&hmap_malloc, &key_malloc, &init, 0);
+	val = bpf_map_lookup_elem(&hmap_malloc, &key_malloc);
+	if (val)
+		bpf_timer_init(&val->timer, CLOCK_BOOTTIME);
+
+	return bpf_timer_test();
+}
-- 
2.30.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 bpf-next 8/8] selftests/bpf: Add a test with bpf_timer in inner map.
  2021-06-24  2:25 [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Alexei Starovoitov
                   ` (6 preceding siblings ...)
  2021-06-24  2:25 ` [PATCH v3 bpf-next 7/8] selftests/bpf: Add bpf_timer test Alexei Starovoitov
@ 2021-06-24  2:25 ` Alexei Starovoitov
  2021-06-24 11:27 ` [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Toke Høiland-Jørgensen
  8 siblings, 0 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-24  2:25 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Check that map-in-map supports bpf timers.

Check that indirect "recursion" of timer callbacks works:
timer_cb1() { bpf_timer_start(timer_cb2); }
timer_cb2() { bpf_timer_start(timer_cb1); }

Check that
  bpf_map_release
    htab_free_prealloced_timers
      bpf_timer_cancel_and_free
        hrtimer_cancel
works while timer cb is running.
"while true; do ./test_progs -t timer_mim; done"
is a great stress test. It caught missing timer cancel in htab->extra_elems.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 .../selftests/bpf/prog_tests/timer_mim.c      | 59 ++++++++++++++
 tools/testing/selftests/bpf/progs/timer_mim.c | 81 +++++++++++++++++++
 2 files changed, 140 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer_mim.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer_mim.c

diff --git a/tools/testing/selftests/bpf/prog_tests/timer_mim.c b/tools/testing/selftests/bpf/prog_tests/timer_mim.c
new file mode 100644
index 000000000000..d54b16a3d9ea
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/timer_mim.c
@@ -0,0 +1,59 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <test_progs.h>
+#include "timer_mim.skel.h"
+
+static int timer_mim(struct timer_mim *timer_skel)
+{
+	__u32 duration = 0, retval;
+	__u64 cnt1, cnt2;
+	int err, prog_fd, key1 = 1;
+
+	err = timer_mim__attach(timer_skel);
+	if (!ASSERT_OK(err, "timer_attach"))
+		return err;
+
+	prog_fd = bpf_program__fd(timer_skel->progs.test1);
+	err = bpf_prog_test_run(prog_fd, 1, NULL, 0,
+				NULL, NULL, &retval, &duration);
+	ASSERT_OK(err, "test_run");
+	ASSERT_EQ(retval, 0, "test_run");
+	timer_mim__detach(timer_skel);
+
+	/* check that timer_cb[12] are incrementing 'cnt' */
+	cnt1 = READ_ONCE(timer_skel->bss->cnt);
+	usleep(2);
+	cnt2 = READ_ONCE(timer_skel->bss->cnt);
+	ASSERT_GT(cnt2, cnt1, "cnt");
+
+	ASSERT_EQ(timer_skel->bss->err, 0, "err");
+	/* check that code paths completed */
+	ASSERT_EQ(timer_skel->bss->ok, 1 | 2, "ok");
+
+	close(bpf_map__fd(timer_skel->maps.inner_map));
+	err = bpf_map_delete_elem(bpf_map__fd(timer_skel->maps.outer_arr), &key1);
+	ASSERT_EQ(err, 0, "delete inner map");
+
+	/* check that timer_cb[12] are no longer running */
+	cnt1 = READ_ONCE(timer_skel->bss->cnt);
+	usleep(2);
+	cnt2 = READ_ONCE(timer_skel->bss->cnt);
+	ASSERT_EQ(cnt2, cnt1, "cnt");
+
+	return 0;
+}
+
+void test_timer_mim(void)
+{
+	struct timer_mim *timer_skel = NULL;
+	int err;
+
+	timer_skel = timer_mim__open_and_load();
+	if (!ASSERT_OK_PTR(timer_skel, "timer_skel_load"))
+		goto cleanup;
+
+	err = timer_mim(timer_skel);
+	ASSERT_OK(err, "timer_mim");
+cleanup:
+	timer_mim__destroy(timer_skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/timer_mim.c b/tools/testing/selftests/bpf/progs/timer_mim.c
new file mode 100644
index 000000000000..4d1d785d8d26
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/timer_mim.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <linux/bpf.h>
+#include <time.h>
+#include <errno.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_tcp_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+struct hmap_elem {
+	int pad; /* unused */
+	struct bpf_timer timer;
+};
+
+struct inner_map {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1024);
+	__type(key, int);
+	__type(value, struct hmap_elem);
+} inner_map SEC(".maps");
+
+#define ARRAY_KEY 1
+#define HASH_KEY 1234
+
+struct outer_arr {
+	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+	__uint(max_entries, 2);
+	__uint(key_size, sizeof(int));
+	__uint(value_size, sizeof(int));
+	__array(values, struct inner_map);
+} outer_arr SEC(".maps") = {
+	.values = { [ARRAY_KEY] = &inner_map },
+};
+
+__u64 err;
+__u64 ok;
+__u64 cnt;
+
+static int timer_cb1(void *map, int *key, struct hmap_elem *val);
+
+static int timer_cb2(void *map, int *key, struct hmap_elem *val)
+{
+	cnt++;
+	if (bpf_timer_start(&val->timer, timer_cb1, 1000) != 0)
+		err |= 1;
+	ok |= 1;
+	return 0;
+}
+
+/* callback for inner hash map */
+static int timer_cb1(void *map, int *key, struct hmap_elem *val)
+{
+	cnt++;
+	if (bpf_timer_start(&val->timer, timer_cb2, 1000) != 0)
+		err |= 2;
+	ok |= 2;
+	return 0;
+}
+
+SEC("fentry/bpf_fentry_test1")
+int BPF_PROG(test1, int a)
+{
+	struct hmap_elem init = {};
+	struct bpf_map *inner_map;
+	struct hmap_elem *val;
+	int array_key = ARRAY_KEY;
+	int hash_key = HASH_KEY;
+
+	inner_map = bpf_map_lookup_elem(&outer_arr, &array_key);
+	if (!inner_map)
+		return 0;
+
+	bpf_map_update_elem(inner_map, &hash_key, &init, 0);
+	val = bpf_map_lookup_elem(inner_map, &hash_key);
+	if (!val)
+		return 0;
+
+	bpf_timer_init(&val->timer, CLOCK_MONOTONIC);
+	bpf_timer_start(&val->timer, timer_cb1, 0);
+	return 0;
+}
-- 
2.30.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers.
  2021-06-24  2:25 [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Alexei Starovoitov
                   ` (7 preceding siblings ...)
  2021-06-24  2:25 ` [PATCH v3 bpf-next 8/8] selftests/bpf: Add a test with bpf_timer in inner map Alexei Starovoitov
@ 2021-06-24 11:27 ` Toke Høiland-Jørgensen
  8 siblings, 0 replies; 28+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-06-24 11:27 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> From: Alexei Starovoitov <ast@kernel.org>
>
> The first request to support timers in bpf was made in 2013 before sys_bpf syscall
> was added. That use case was periodic sampling. It was address with attaching
> bpf programs to perf_events. Then during XDP development the timers were requested
> to do garbage collection and health checks. They were worked around by implementing
> timers in user space and triggering progs with BPF_PROG_RUN command.
> The user space timers and perf_event+bpf timers are not armed by the bpf program.
> They're done asynchronously vs program execution. The XDP program cannot send a
> packet and arm the timer at the same time. The tracing prog cannot record an
> event and arm the timer right away. This large class of use cases remained
> unaddressed. The jiffy based and hrtimer based timers are essential part of the
> kernel development and with this patch set the hrtimer based timers will be
> available to bpf programs.
>
> TLDR: bpf timers is a wrapper of hrtimers with all the extra safety added
> to make sure bpf progs cannot crash the kernel.
>
> v2->v3:
> The v2 approach attempted to bump bpf_prog refcnt when bpf_timer_start is
> called to make sure callback code doesn't disappear when timer is active and
> drop refcnt when timer cb is done. That led to a ton of race conditions between
> callback running and concurrent bpf_timer_init/start/cancel on another cpu,
> and concurrent bpf_map_update/delete_elem, and map destroy.
>
> Then v2.5 approach skipped prog refcnt altogether. Instead it remembered all
> timers that bpf prog armed in a link list and canceled them when prog refcnt
> went to zero. The race conditions disappeared, but timers in map-in-map could
> not be supported cleanly, since timers in inner maps have inner map's life time
> and don't match prog's life time.
>
> This v3 approach makes timers to be owned by maps. It allows timers in inner
> maps to be supported from the start. This apporach relies on "user refcnt"
> scheme used in prog_array that stores bpf programs for bpf_tail_call. The
> bpf_timer_start() increments prog refcnt, but unlike 1st approach the timer
> callback does decrement the refcnt. The ops->map_release_uref is
> responsible for cancelling the timers and dropping prog refcnt when user space
> reference to a map is dropped. That addressed all the races and simplified
> locking.

Great to see this! I missed v2, but the "owned by map + uref" approach
makes sense.

For the series:

Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-24  2:25 ` [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers Alexei Starovoitov
@ 2021-06-25  6:25   ` Yonghong Song
  2021-06-25 14:57     ` Alexei Starovoitov
  2021-06-25 16:54   ` Yonghong Song
  2021-07-01  5:40   ` Alexei Starovoitov
  2 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-06-25  6:25 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, andrii, netdev, bpf, kernel-team



On 6/23/21 7:25 PM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> in hash/array/lru maps as a regular field and helpers to operate on it:
> 
> // Initialize the timer.
> // First 4 bits of 'flags' specify clockid.
> // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> long bpf_timer_init(struct bpf_timer *timer, int flags);
> 
> // Arm the timer to call callback_fn static function and set its
> // expiration 'nsec' nanoseconds from the current time.
> long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsec);
> 
> // Cancel the timer and wait for callback_fn to finish if it was running.
> long bpf_timer_cancel(struct bpf_timer *timer);
> 
> Here is how BPF program might look like:
> struct map_elem {
>      int counter;
>      struct bpf_timer timer;
> };
> 
> struct {
>      __uint(type, BPF_MAP_TYPE_HASH);
>      __uint(max_entries, 1000);
>      __type(key, int);
>      __type(value, struct map_elem);
> } hmap SEC(".maps");
> 
> static int timer_cb(void *map, int *key, struct map_elem *val);
> /* val points to particular map element that contains bpf_timer. */
> 
> SEC("fentry/bpf_fentry_test1")
> int BPF_PROG(test1, int a)
> {
>      struct map_elem *val;
>      int key = 0;
> 
>      val = bpf_map_lookup_elem(&hmap, &key);
>      if (val) {
>          bpf_timer_init(&val->timer, CLOCK_REALTIME);
>          bpf_timer_start(&val->timer, timer_cb, 1000 /* call timer_cb2 in 1 usec */);
>      }
> }
> 
> This patch adds helper implementations that rely on hrtimers
> to call bpf functions as timers expire.
> The following patches add necessary safety checks.
> 
> Only programs with CAP_BPF are allowed to use bpf_timer.
> 
> The amount of timers used by the program is constrained by
> the memcg recorded at map creation time.
> 
> The bpf_timer_init() helper is receiving hidden 'map' argument and
> bpf_timer_start() is receiving hidden 'prog' argument supplied by the verifier.
> The prog pointer is needed to do refcnting of bpf program to make sure that
> program doesn't get freed while the timer is armed. This apporach relies on

apporach -> approach

> "user refcnt" scheme used in prog_array that stores bpf programs for
> bpf_tail_call. The bpf_timer_start() will increment the prog refcnt which is
> paired with bpf_timer_cancel() that will drop the prog refcnt. The
> ops->map_release_uref is responsible for cancelling the timers and dropping
> prog refcnt when user space reference to a map reaches zero.
> This uref approach is done to make sure that Ctrl-C of user space process will
> not leave timers running forever unless the user space explicitly pinned a map
> that contained timers in bpffs.
> 
> The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
> and free the timer if given map element had it allocated.
> "bpftool map update" command can be used to cancel timers.
> 
> The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
> '__u64 :64' has 1 byte alignment of 8 byte padding.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   include/linux/bpf.h            |   3 +
>   include/uapi/linux/bpf.h       |  55 +++++++
>   kernel/bpf/helpers.c           | 281 +++++++++++++++++++++++++++++++++
>   kernel/bpf/verifier.c          | 138 ++++++++++++++++
>   kernel/trace/bpf_trace.c       |   2 +-
>   scripts/bpf_doc.py             |   2 +
>   tools/include/uapi/linux/bpf.h |  55 +++++++
>   7 files changed, 535 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f309fc1509f2..72da9d4d070c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -168,6 +168,7 @@ struct bpf_map {
>   	u32 max_entries;
>   	u32 map_flags;
>   	int spin_lock_off; /* >=0 valid offset, <0 error */
> +	int timer_off; /* >=0 valid offset, <0 error */
>   	u32 id;
>   	int numa_node;
>   	u32 btf_key_type_id;
> @@ -221,6 +222,7 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>   }
>   void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>   			   bool lock_src);
> +void bpf_timer_cancel_and_free(void *timer);
>   int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
>   
>   struct bpf_offload_dev;
> @@ -314,6 +316,7 @@ enum bpf_arg_type {
>   	ARG_PTR_TO_FUNC,	/* pointer to a bpf program function */
>   	ARG_PTR_TO_STACK_OR_NULL,	/* pointer to stack or NULL */
>   	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
> +	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
>   	__BPF_ARG_TYPE_MAX,
>   };
>   
[...]
> +
> +static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
> +
> +static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
> +{
> +	struct bpf_hrtimer *t = container_of(hrtimer, struct bpf_hrtimer, timer);
> +	struct bpf_map *map = t->map;
> +	void *value = t->value;
> +	struct bpf_timer_kern *timer = value + map->timer_off;
> +	struct bpf_prog *prog;
> +	void *callback_fn;
> +	void *key;
> +	u32 idx;
> +	int ret;
> +
> +	____bpf_spin_lock(&timer->lock);

I think we may still have some issues.
Case 1:
   1. one bpf program is running in process context,
      bpf_timer_start() is called and timer->lock is taken
   2. timer softirq is triggered and this callback is called

Case 2:
   1. this callback is called, timer->lock is taken
   2. a nmi happens and some bpf program is called (kprobe, tracepoint,
      fentry/fexit or perf_event, etc.) and that program calls
      bpf_timer_start()

So we could have deadlock in both above cases?

> +	/* callback_fn and prog need to match. They're updated together
> +	 * and have to be read under lock.
> +	 */
> +	prog = t->prog;
> +	callback_fn = t->callback_fn;
> +
> +	/* wrap bpf subprog invocation with prog->refcnt++ and -- to make
> +	 * sure that refcnt doesn't become zero when subprog is executing.
> +	 * Do it under lock to make sure that bpf_timer_start doesn't drop
> +	 * prev prog refcnt to zero before timer_cb has a chance to bump it.
> +	 */
> +	bpf_prog_inc(prog);
> +	____bpf_spin_unlock(&timer->lock);
> +
> +	/* bpf_timer_cb() runs in hrtimer_run_softirq. It doesn't migrate and
> +	 * cannot be preempted by another bpf_timer_cb() on the same cpu.
> +	 * Remember the timer this callback is servicing to prevent
> +	 * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
> +	 */
> +	this_cpu_write(hrtimer_running, t);

This is not protected by spinlock, in bpf_timer_cancel() and
bpf_timer_cancel_and_free(), we have spinlock protected read, so
there is potential race conditions if callback function and 
helper/bpf_timer_cancel_and_free run in different context?

> +	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> +		struct bpf_array *array = container_of(map, struct bpf_array, map);
> +
> +		/* compute the key */
> +		idx = ((char *)value - array->value) / array->elem_size;
> +		key = &idx;
> +	} else { /* hash or lru */
> +		key = value - round_up(map->key_size, 8);
> +	}
> +
> +	ret = BPF_CAST_CALL(callback_fn)((u64)(long)map,
> +					 (u64)(long)key,
> +					 (u64)(long)value, 0, 0);
> +	WARN_ON(ret != 0); /* Next patch moves this check into the verifier */
> +	bpf_prog_put(prog);
> +
> +	this_cpu_write(hrtimer_running, NULL);
> +	return HRTIMER_NORESTART;
> +}
> +
> +BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, u64, flags,
> +	   struct bpf_map *, map)
> +{
> +	clockid_t clockid = flags & (MAX_CLOCKS - 1);
> +	struct bpf_hrtimer *t;
> +	int ret = 0;
> +
> +	BUILD_BUG_ON(MAX_CLOCKS != 16);
> +	BUILD_BUG_ON(sizeof(struct bpf_timer_kern) > sizeof(struct bpf_timer));
> +	BUILD_BUG_ON(__alignof__(struct bpf_timer_kern) != __alignof__(struct bpf_timer));
> +
> +	if (flags >= MAX_CLOCKS ||
> +	    /* similar to timerfd except _ALARM variants are not supported */
> +	    (clockid != CLOCK_MONOTONIC &&
> +	     clockid != CLOCK_REALTIME &&
> +	     clockid != CLOCK_BOOTTIME))
> +		return -EINVAL;
> +	____bpf_spin_lock(&timer->lock);
> +	t = timer->timer;
> +	if (t) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +	/* allocate hrtimer via map_kmalloc to use memcg accounting */
> +	t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
> +	if (!t) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	t->value = (void *)timer - map->timer_off;
> +	t->map = map;
> +	t->prog = NULL;
> +	t->callback_fn = NULL;
> +	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
> +	t->timer.function = bpf_timer_cb;
> +	timer->timer = t;
> +out:
> +	____bpf_spin_unlock(&timer->lock);
> +	return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_init_proto = {
> +	.func		= bpf_timer_init,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_TIMER,
> +	.arg2_type	= ARG_ANYTHING,
> +};
> +
> +BPF_CALL_4(bpf_timer_start, struct bpf_timer_kern *, timer, void *, callback_fn,
> +	   u64, nsecs, struct bpf_prog *, prog)
> +{
> +	struct bpf_hrtimer *t;
> +	struct bpf_prog *prev;
> +	int ret = 0;
> +
> +	____bpf_spin_lock(&timer->lock);
> +	t = timer->timer;
> +	if (!t) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	prev = t->prog;
> +	if (prev != prog) {
> +		if (prev)
> +			/* Drop pref prog refcnt when swapping with new prog */

pref -> prev

> +			bpf_prog_put(prev);

Maybe we want to put the above two lines with {}?

> +		/* Dump prog refcnt once.
> +		 * Every bpf_timer_start() can pick different callback_fn-s
> +		 * within the same prog.
> +		 */
> +		bpf_prog_inc(prog);
> +		t->prog = prog;
> +	}
> +	t->callback_fn = callback_fn;
> +	hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
> +out:
> +	____bpf_spin_unlock(&timer->lock);
> +	return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_start_proto = {
> +	.func		= bpf_timer_start,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_TIMER,
> +	.arg2_type	= ARG_PTR_TO_FUNC,
> +	.arg3_type	= ARG_ANYTHING,
> +};
> +
> +static void drop_prog_refcnt(struct bpf_hrtimer *t)
> +{
> +	struct bpf_prog *prog = t->prog;
> +
> +	if (prog) {
> +		/* If timer was armed with bpf_timer_start()
> +		 * drop prog refcnt.
> +		 */
> +		bpf_prog_put(prog);
> +		t->prog = NULL;
> +		t->callback_fn = NULL;
> +	}
> +}
> +
> +BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> +{
> +	struct bpf_hrtimer *t;
> +	int ret = 0;
> +
> +	____bpf_spin_lock(&timer->lock);
> +	t = timer->timer;
> +	if (!t) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	if (this_cpu_read(hrtimer_running) == t) {
> +		/* If bpf callback_fn is trying to bpf_timer_cancel()
> +		 * its own timer the hrtimer_cancel() will deadlock
> +		 * since it waits for callback_fn to finish
> +		 */
> +		ret = -EDEADLK;
> +		goto out;
> +	}
> +	/* Cancel the timer and wait for associated callback to finish
> +	 * if it was running.
> +	 */
> +	ret = hrtimer_cancel(&t->timer);
> +	drop_prog_refcnt(t);
> +out:
> +	____bpf_spin_unlock(&timer->lock);
> +	return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_cancel_proto = {
> +	.func		= bpf_timer_cancel,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_TIMER,
> +};
> +
> +/* This function is called by map_delete/update_elem for individual element.
> + * By ops->map_release_uref when the user space reference to a map reaches zero
> + * and by ops->map_free when the kernel reference reaches zero.
> + */
> +void bpf_timer_cancel_and_free(void *val)
> +{
> +	struct bpf_timer_kern *timer = val;
> +	struct bpf_hrtimer *t;
> +
> +	/* Performance optimization: read timer->timer without lock first. */
> +	if (!READ_ONCE(timer->timer))
> +		return;
> +
> +	____bpf_spin_lock(&timer->lock);
> +	/* re-read it under lock */
> +	t = timer->timer;
> +	if (!t)
> +		goto out;
> +	/* Cancel the timer and wait for callback to complete if it was running.
> +	 * Check that bpf_map_delete/update_elem() wasn't called from timer callback_fn.
> +	 * In such case don't call hrtimer_cancel() (since it will deadlock)
> +	 * and don't call hrtimer_try_to_cancel() (since it will just return -1).
> +	 * Instead free the timer and set timer->timer = NULL.
> +	 * The subsequent bpf_timer_start/cancel() helpers won't be able to use it,
> +	 * since it won't be initialized.
> +	 * In preallocated maps it's safe to do timer->timer = NULL.
> +	 * The memory could be reused for another map element while current
> +	 * callback_fn can do bpf_timer_init() on it.
> +	 * In non-preallocated maps bpf_timer_cancel_and_free and
> +	 * timer->timer = NULL will happen after callback_fn completes, since
> +	 * program execution is an RCU critical section.
> +	 */
> +	if (this_cpu_read(hrtimer_running) != t)
> +		hrtimer_cancel(&t->timer);

We could still have race conditions here when 
bpf_timer_cancel_and_free() runs in process context and callback in
softirq context. I guess we might be okay.

But if bpf_timer_cancel_and_free() in nmi context, not 100% sure
whether we have issues or not.

> +	drop_prog_refcnt(t);
> +	kfree(t);
> +	timer->timer = NULL;
> +out:
> +	____bpf_spin_unlock(&timer->lock);
> +}
> +
>   const struct bpf_func_proto bpf_get_current_task_proto __weak;
>   const struct bpf_func_proto bpf_probe_read_user_proto __weak;
>   const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
> @@ -1055,6 +1330,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
[...]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-25  6:25   ` Yonghong Song
@ 2021-06-25 14:57     ` Alexei Starovoitov
  2021-06-25 15:54       ` Yonghong Song
  0 siblings, 1 reply; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-25 14:57 UTC (permalink / raw)
  To: Yonghong Song, Alexei Starovoitov, davem
  Cc: daniel, andrii, netdev, bpf, kernel-team

On 6/24/21 11:25 PM, Yonghong Song wrote:
> 
>> +
>> +    ____bpf_spin_lock(&timer->lock);
> 
> I think we may still have some issues.
> Case 1:
>    1. one bpf program is running in process context,
>       bpf_timer_start() is called and timer->lock is taken
>    2. timer softirq is triggered and this callback is called

___bpf_spin_lock is actually irqsave version of spin_lock.
So this race is not possible.

> Case 2:
>    1. this callback is called, timer->lock is taken
>    2. a nmi happens and some bpf program is called (kprobe, tracepoint,
>       fentry/fexit or perf_event, etc.) and that program calls
>       bpf_timer_start()
> 
> So we could have deadlock in both above cases?

Shouldn't be possible either because bpf timers are not allowed
in nmi-bpf-progs. I'll double check that it's the case.
Pretty much the same restrictions are with bpf_spin_lock.

> 
>> +    /* callback_fn and prog need to match. They're updated together
>> +     * and have to be read under lock.
>> +     */
>> +    prog = t->prog;
>> +    callback_fn = t->callback_fn;
>> +
>> +    /* wrap bpf subprog invocation with prog->refcnt++ and -- to make
>> +     * sure that refcnt doesn't become zero when subprog is executing.
>> +     * Do it under lock to make sure that bpf_timer_start doesn't drop
>> +     * prev prog refcnt to zero before timer_cb has a chance to bump it.
>> +     */
>> +    bpf_prog_inc(prog);
>> +    ____bpf_spin_unlock(&timer->lock);
>> +
>> +    /* bpf_timer_cb() runs in hrtimer_run_softirq. It doesn't migrate 
>> and
>> +     * cannot be preempted by another bpf_timer_cb() on the same cpu.
>> +     * Remember the timer this callback is servicing to prevent
>> +     * deadlock if callback_fn() calls bpf_timer_cancel() on the same 
>> timer.
>> +     */
>> +    this_cpu_write(hrtimer_running, t);
> 
> This is not protected by spinlock, in bpf_timer_cancel() and
> bpf_timer_cancel_and_free(), we have spinlock protected read, so
> there is potential race conditions if callback function and 
> helper/bpf_timer_cancel_and_free run in different context?

what kind of race do you see?
This is per-cpu var and bpf_timer_cb is in softirq
while timer_cancel/cancel_and_free are calling it under
spin_lock_irqsave... so they cannot race because softirq
and bpf_timer_cb will run after start/canel/cancel_free
will do unlock_irqrestore.

>> +    prev = t->prog;
>> +    if (prev != prog) {
>> +        if (prev)
>> +            /* Drop pref prog refcnt when swapping with new prog */
> 
> pref -> prev
> 
>> +            bpf_prog_put(prev);
> 
> Maybe we want to put the above two lines with {}?

you mean add {} because there is a comment ?
I don't think the kernel coding style considers comment as a statement.

>> +    if (this_cpu_read(hrtimer_running) != t)
>> +        hrtimer_cancel(&t->timer);
> 
> We could still have race conditions here when 
> bpf_timer_cancel_and_free() runs in process context and callback in
> softirq context. I guess we might be okay.

No, since this check is under spin_lock_irsave.

> But if bpf_timer_cancel_and_free() in nmi context, not 100% sure
> whether we have issues or not.

timers shouldn't be available to nmi-bpf progs.
There will be all sorts of issues.
The underlying hrtimer implementation cannot deal with nmi either.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-25 14:57     ` Alexei Starovoitov
@ 2021-06-25 15:54       ` Yonghong Song
  2021-06-29  1:39         ` Alexei Starovoitov
  0 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-06-25 15:54 UTC (permalink / raw)
  To: Alexei Starovoitov, Alexei Starovoitov, davem
  Cc: daniel, andrii, netdev, bpf, kernel-team



On 6/25/21 7:57 AM, Alexei Starovoitov wrote:
> On 6/24/21 11:25 PM, Yonghong Song wrote:
>>
>>> +
>>> +    ____bpf_spin_lock(&timer->lock);
>>
>> I think we may still have some issues.
>> Case 1:
>>    1. one bpf program is running in process context,
>>       bpf_timer_start() is called and timer->lock is taken
>>    2. timer softirq is triggered and this callback is called
> 
> ___bpf_spin_lock is actually irqsave version of spin_lock.
> So this race is not possible.

Sorry I missed that ____bpf_spin_lock() has local_irq_save(),
so yes. the above situation cannot happen.

> 
>> Case 2:
>>    1. this callback is called, timer->lock is taken
>>    2. a nmi happens and some bpf program is called (kprobe, tracepoint,
>>       fentry/fexit or perf_event, etc.) and that program calls
>>       bpf_timer_start()
>>
>> So we could have deadlock in both above cases?
> 
> Shouldn't be possible either because bpf timers are not allowed
> in nmi-bpf-progs. I'll double check that it's the case.
> Pretty much the same restrictions are with bpf_spin_lock.

The patch added bpf_base_func_proto() to bpf_tracing_func_proto:

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 7a52bc172841..80f6e6dafd5e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1057,7 +1057,7 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, 
const struct bpf_prog *prog)
  	case BPF_FUNC_snprintf:
  		return &bpf_snprintf_proto;
  	default:
-		return NULL;
+		return bpf_base_func_proto(func_id);
  	}
  }

and timer helpers are added to bpf_base_func_proto:
@@ -1055,6 +1330,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
  		return &bpf_per_cpu_ptr_proto;
  	case BPF_FUNC_this_cpu_ptr:
  		return &bpf_this_cpu_ptr_proto;
+	case BPF_FUNC_timer_init:
+		return &bpf_timer_init_proto;
+	case BPF_FUNC_timer_start:
+		return &bpf_timer_start_proto;
+	case BPF_FUNC_timer_cancel:
+		return &bpf_timer_cancel_proto;
  	default:
  		break;
  	}

static const struct bpf_func_proto *
pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
         switch (func_id) {
...
         default:
                 return bpf_tracing_func_proto(func_id, prog);
         }
}

static const struct bpf_func_proto *
kprobe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog 
*prog)
{
...
         default:
                 return bpf_tracing_func_proto(func_id, prog);
         }
}

Also, we have some functions inside ____bpf_spin_lock() e.g., 
bpf_prog_inc(), hrtimer_start(), etc. If we want to be absolutely safe,
we need to mark them not tracable for kprobe/kretprobe/fentry/fexit/...
But I am not sure whether this is really needed or not.

> 
>>
>>> +    /* callback_fn and prog need to match. They're updated together
>>> +     * and have to be read under lock.
>>> +     */
>>> +    prog = t->prog;
>>> +    callback_fn = t->callback_fn;
>>> +
>>> +    /* wrap bpf subprog invocation with prog->refcnt++ and -- to make
>>> +     * sure that refcnt doesn't become zero when subprog is executing.
>>> +     * Do it under lock to make sure that bpf_timer_start doesn't drop
>>> +     * prev prog refcnt to zero before timer_cb has a chance to bump 
>>> it.
>>> +     */
>>> +    bpf_prog_inc(prog);
>>> +    ____bpf_spin_unlock(&timer->lock);
>>> +
>>> +    /* bpf_timer_cb() runs in hrtimer_run_softirq. It doesn't 
>>> migrate and
>>> +     * cannot be preempted by another bpf_timer_cb() on the same cpu.
>>> +     * Remember the timer this callback is servicing to prevent
>>> +     * deadlock if callback_fn() calls bpf_timer_cancel() on the 
>>> same timer.
>>> +     */
>>> +    this_cpu_write(hrtimer_running, t);
>>
>> This is not protected by spinlock, in bpf_timer_cancel() and
>> bpf_timer_cancel_and_free(), we have spinlock protected read, so
>> there is potential race conditions if callback function and 
>> helper/bpf_timer_cancel_and_free run in different context?
> 
> what kind of race do you see?
> This is per-cpu var and bpf_timer_cb is in softirq
> while timer_cancel/cancel_and_free are calling it under
> spin_lock_irqsave... so they cannot race because softirq
> and bpf_timer_cb will run after start/canel/cancel_free
> will do unlock_irqrestore.

Again, I missed local_irq_save(). With irqsave, this indeed
won't happen. The same for a few comments below.

> 
>>> +    prev = t->prog;
>>> +    if (prev != prog) {
>>> +        if (prev)
>>> +            /* Drop pref prog refcnt when swapping with new prog */
>>
>> pref -> prev
>>
>>> +            bpf_prog_put(prev);
>>
>> Maybe we want to put the above two lines with {}?
> 
> you mean add {} because there is a comment ?
> I don't think the kernel coding style considers comment as a statement.
> 
>>> +    if (this_cpu_read(hrtimer_running) != t)
>>> +        hrtimer_cancel(&t->timer);
>>
>> We could still have race conditions here when 
>> bpf_timer_cancel_and_free() runs in process context and callback in
>> softirq context. I guess we might be okay.
> 
> No, since this check is under spin_lock_irsave.
> 
>> But if bpf_timer_cancel_and_free() in nmi context, not 100% sure
>> whether we have issues or not.
> 
> timers shouldn't be available to nmi-bpf progs.
> There will be all sorts of issues.
> The underlying hrtimer implementation cannot deal with nmi either.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-24  2:25 ` [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers Alexei Starovoitov
  2021-06-25  6:25   ` Yonghong Song
@ 2021-06-25 16:54   ` Yonghong Song
  2021-06-29  1:46     ` Alexei Starovoitov
  2021-07-01  5:40   ` Alexei Starovoitov
  2 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-06-25 16:54 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, andrii, netdev, bpf, kernel-team



On 6/23/21 7:25 PM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> in hash/array/lru maps as a regular field and helpers to operate on it:
> 
> // Initialize the timer.
> // First 4 bits of 'flags' specify clockid.
> // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> long bpf_timer_init(struct bpf_timer *timer, int flags);
> 
> // Arm the timer to call callback_fn static function and set its
> // expiration 'nsec' nanoseconds from the current time.
> long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsec);
> 
> // Cancel the timer and wait for callback_fn to finish if it was running.
> long bpf_timer_cancel(struct bpf_timer *timer);
> 
> Here is how BPF program might look like:
> struct map_elem {
>      int counter;
>      struct bpf_timer timer;
> };
> 
> struct {
>      __uint(type, BPF_MAP_TYPE_HASH);
>      __uint(max_entries, 1000);
>      __type(key, int);
>      __type(value, struct map_elem);
> } hmap SEC(".maps");
> 
> static int timer_cb(void *map, int *key, struct map_elem *val);
> /* val points to particular map element that contains bpf_timer. */
> 
> SEC("fentry/bpf_fentry_test1")
> int BPF_PROG(test1, int a)
> {
>      struct map_elem *val;
>      int key = 0;
> 
>      val = bpf_map_lookup_elem(&hmap, &key);
>      if (val) {
>          bpf_timer_init(&val->timer, CLOCK_REALTIME);
>          bpf_timer_start(&val->timer, timer_cb, 1000 /* call timer_cb2 in 1 usec */);
>      }
> }
> 
> This patch adds helper implementations that rely on hrtimers
> to call bpf functions as timers expire.
> The following patches add necessary safety checks.
> 
> Only programs with CAP_BPF are allowed to use bpf_timer.
> 
> The amount of timers used by the program is constrained by
> the memcg recorded at map creation time.
> 
> The bpf_timer_init() helper is receiving hidden 'map' argument and
> bpf_timer_start() is receiving hidden 'prog' argument supplied by the verifier.
> The prog pointer is needed to do refcnting of bpf program to make sure that
> program doesn't get freed while the timer is armed. This apporach relies on
> "user refcnt" scheme used in prog_array that stores bpf programs for
> bpf_tail_call. The bpf_timer_start() will increment the prog refcnt which is
> paired with bpf_timer_cancel() that will drop the prog refcnt. The
> ops->map_release_uref is responsible for cancelling the timers and dropping
> prog refcnt when user space reference to a map reaches zero.
> This uref approach is done to make sure that Ctrl-C of user space process will
> not leave timers running forever unless the user space explicitly pinned a map
> that contained timers in bpffs.
> 
> The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
> and free the timer if given map element had it allocated.
> "bpftool map update" command can be used to cancel timers.
> 
> The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
> '__u64 :64' has 1 byte alignment of 8 byte padding.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   include/linux/bpf.h            |   3 +
>   include/uapi/linux/bpf.h       |  55 +++++++
>   kernel/bpf/helpers.c           | 281 +++++++++++++++++++++++++++++++++
>   kernel/bpf/verifier.c          | 138 ++++++++++++++++
>   kernel/trace/bpf_trace.c       |   2 +-
>   scripts/bpf_doc.py             |   2 +
>   tools/include/uapi/linux/bpf.h |  55 +++++++
>   7 files changed, 535 insertions(+), 1 deletion(-)
> 
[...]
> @@ -12533,6 +12607,70 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>   			continue;
>   		}
>   
> +		if (insn->imm == BPF_FUNC_timer_init) {
> +			aux = &env->insn_aux_data[i + delta];
> +			if (bpf_map_ptr_poisoned(aux)) {
> +				verbose(env, "bpf_timer_init abusing map_ptr\n");
> +				return -EINVAL;
> +			}
> +			map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
> +			{
> +				struct bpf_insn ld_addrs[2] = {
> +					BPF_LD_IMM64(BPF_REG_3, (long)map_ptr),
> +				};
> +
> +				insn_buf[0] = ld_addrs[0];
> +				insn_buf[1] = ld_addrs[1];
> +			}
> +			insn_buf[2] = *insn;
> +			cnt = 3;
> +
> +			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +			if (!new_prog)
> +				return -ENOMEM;
> +
> +			delta    += cnt - 1;
> +			env->prog = prog = new_prog;
> +			insn      = new_prog->insnsi + i + delta;
> +			goto patch_call_imm;
> +		}
> +
> +		if (insn->imm == BPF_FUNC_timer_start) {
> +			/* There is no need to do:
> +			 *     aux = &env->insn_aux_data[i + delta];
> +			 *     if (bpf_map_ptr_poisoned(aux)) return -EINVAL;
> +			 * for bpf_timer_start(). If the same callback_fn is shared
> +			 * by different timers in different maps the poisoned check
> +			 * will return false positive.
> +			 *
> +			 * The verifier will process callback_fn as many times as necessary
> +			 * with different maps and the register states prepared by
> +			 * set_timer_start_callback_state will be accurate.
> +			 *
> +			 * There is no need for bpf_timer_start() to check in the
> +			 * run-time that bpf_hrtimer->map stored during bpf_timer_init()
> +			 * is the same map as in bpf_timer_start()
> +			 * because it's the same map element value.

I am puzzled by above comments. Maybe you could explain more?
bpf_timer_start() checked whether timer is initialized with timer->timer 
NULL check. It will proceed only if a valid timer has been
initialized. I think the following scenarios are also supported:
   1. map1 is shared by prog1 and prog2
   2. prog1 call bpf_timer_init for all map1 elements
   3. prog2 call bpf_timer_start for some or all map1 elements.
So for prog2 verification, bpf_timer_init() is not even called.

> +			 */
> +			struct bpf_insn ld_addrs[2] = {
> +				BPF_LD_IMM64(BPF_REG_4, (long)prog),
> +			};
> +
> +			insn_buf[0] = ld_addrs[0];
> +			insn_buf[1] = ld_addrs[1];
> +			insn_buf[2] = *insn;
> +			cnt = 3;
> +
> +			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +			if (!new_prog)
> +				return -ENOMEM;
> +
> +			delta    += cnt - 1;
> +			env->prog = prog = new_prog;
> +			insn      = new_prog->insnsi + i + delta;
> +			goto patch_call_imm;
> +		}
> +
>   		/* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
>   		 * and other inlining handlers are currently limited to 64 bit
>   		 * only.
[...]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 2/8] bpf: Add map side support for bpf timers.
  2021-06-24  2:25 ` [PATCH v3 bpf-next 2/8] bpf: Add map side support for " Alexei Starovoitov
@ 2021-06-25 19:46   ` Yonghong Song
  2021-06-29  1:49     ` Alexei Starovoitov
  0 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-06-25 19:46 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, andrii, netdev, bpf, kernel-team



On 6/23/21 7:25 PM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Restrict bpf timers to array, hash (both preallocated and kmalloced), and
> lru map types. The per-cpu maps with timers don't make sense, since 'struct
> bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
> the number of timers depends on number of possible cpus and timers would not be
> accessible from all cpus. lpm map support can be added in the future.
> The timers in inner maps are supported.
> 
> The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
> bpf_timer in a given map element.
> 
> Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
> that map element indeed contains 'struct bpf_timer'.
> 
> Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
> map element data is reused in preallocated htab and lru maps.
> 
> Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
> map element. There could be one of each, but not more than one. Due to 'one
> bpf_timer in one element' restriction do not support timers in global data,
> since global data is a map of single element, but from bpf program side it's
> seen as many global variables and restriction of single global timer would be
> odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
> timers, since user space could have corrupted mmap element and crashed the
> kernel. The maps with timers cannot be readonly. Due to these restrictions
> search for bpf_timer in datasec BTF in case it was placed in the global data to
> report clear error.
> 
> The previous patch allowed 'struct bpf_timer' as a first field in a map
> element only. Relax this restriction.
> 
> Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
> the timer when lru map deletes an element as a part of it eviction algorithm.
> 
> Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
> The timer operation are done through helpers only.
> This is similar to 'struct bpf_spin_lock'.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

I didn't find major issues. Only one minor comment below. I do a race
during map_update where updated map elements will have timer removed
but at the same time the timer might still be used after update. But
irq spinlock should handle this properly.

Acked-by: Yonghong Song <yhs@fb.com>

> ---
>   include/linux/bpf.h        | 44 ++++++++++++-----
>   include/linux/btf.h        |  1 +
>   kernel/bpf/arraymap.c      | 22 +++++++++
>   kernel/bpf/btf.c           | 77 ++++++++++++++++++++++++------
>   kernel/bpf/hashtab.c       | 96 +++++++++++++++++++++++++++++++++-----
>   kernel/bpf/local_storage.c |  4 +-
>   kernel/bpf/map_in_map.c    |  1 +
>   kernel/bpf/syscall.c       | 21 +++++++--
>   kernel/bpf/verifier.c      | 30 ++++++++++--
>   9 files changed, 251 insertions(+), 45 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 72da9d4d070c..90e6c6d1404c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -198,24 +198,46 @@ static inline bool map_value_has_spin_lock(const struct bpf_map *map)
>   	return map->spin_lock_off >= 0;
>   }
>   
> -static inline void check_and_init_map_lock(struct bpf_map *map, void *dst)
> +static inline bool map_value_has_timer(const struct bpf_map *map)
>   {
> -	if (likely(!map_value_has_spin_lock(map)))
> -		return;
> -	*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
> -		(struct bpf_spin_lock){};
> +	return map->timer_off >= 0;
>   }
>   
> -/* copy everything but bpf_spin_lock */
> +static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> +{
> +	if (unlikely(map_value_has_spin_lock(map)))
> +		*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
> +			(struct bpf_spin_lock){};
> +	if (unlikely(map_value_has_timer(map)))
> +		*(struct bpf_timer *)(dst + map->timer_off) =
> +			(struct bpf_timer){};
> +}
> +
[...]
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 6f6681b07364..e85a5839ffe8 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -228,6 +228,28 @@ static struct htab_elem *get_htab_elem(struct bpf_htab *htab, int i)
>   	return (struct htab_elem *) (htab->elems + i * (u64)htab->elem_size);
>   }
>   
> +static void htab_free_prealloced_timers(struct bpf_htab *htab)
> +{
> +	u32 num_entries = htab->map.max_entries;
> +	int i;
> +
> +	if (likely(!map_value_has_timer(&htab->map)))
> +		return;
> +	if (!htab_is_percpu(htab) && !htab_is_lru(htab))

Is !htab_is_percpu(htab) needed? I think we already checked
if map_value has timer it can only be hash/lruhash/array?

> +		/* htab has extra_elems */
> +		num_entries += num_possible_cpus();
> +
> +	for (i = 0; i < num_entries; i++) {
> +		struct htab_elem *elem;
> +
> +		elem = get_htab_elem(htab, i);
> +		bpf_timer_cancel_and_free(elem->key +
> +					  round_up(htab->map.key_size, 8) +
> +					  htab->map.timer_off);
> +		cond_resched();
> +	}
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-25 15:54       ` Yonghong Song
@ 2021-06-29  1:39         ` Alexei Starovoitov
  0 siblings, 0 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-29  1:39 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, davem, daniel, andrii, netdev, bpf, kernel-team

On Fri, Jun 25, 2021 at 08:54:55AM -0700, Yonghong Song wrote:
> 
> 
> On 6/25/21 7:57 AM, Alexei Starovoitov wrote:
> > On 6/24/21 11:25 PM, Yonghong Song wrote:
> > > 
> > > > +
> > > > +    ____bpf_spin_lock(&timer->lock);
> > > 
> > > I think we may still have some issues.
> > > Case 1:
> > >    1. one bpf program is running in process context,
> > >       bpf_timer_start() is called and timer->lock is taken
> > >    2. timer softirq is triggered and this callback is called
> > 
> > ___bpf_spin_lock is actually irqsave version of spin_lock.
> > So this race is not possible.
> 
> Sorry I missed that ____bpf_spin_lock() has local_irq_save(),
> so yes. the above situation cannot happen.

Yeah. It was confusing. I'll add a comment.

> > 
> > > Case 2:
> > >    1. this callback is called, timer->lock is taken
> > >    2. a nmi happens and some bpf program is called (kprobe, tracepoint,
> > >       fentry/fexit or perf_event, etc.) and that program calls
> > >       bpf_timer_start()
> > > 
> > > So we could have deadlock in both above cases?
> > 
> > Shouldn't be possible either because bpf timers are not allowed
> > in nmi-bpf-progs. I'll double check that it's the case.
> > Pretty much the same restrictions are with bpf_spin_lock.
> 
> The patch added bpf_base_func_proto() to bpf_tracing_func_proto:
> 
> Also, we have some functions inside ____bpf_spin_lock() e.g.,
> bpf_prog_inc(), hrtimer_start(), etc. If we want to be absolutely safe,
> we need to mark them not tracable for kprobe/kretprobe/fentry/fexit/...
> But I am not sure whether this is really needed or not.

Probably not.
I'll add in_nmi() runtime check to prevent nmi and kprobes.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 3/8] bpf: Remember BTF of inner maps.
  2021-06-24  2:25 ` [PATCH v3 bpf-next 3/8] bpf: Remember BTF of inner maps Alexei Starovoitov
@ 2021-06-29  1:45   ` Yonghong Song
  0 siblings, 0 replies; 28+ messages in thread
From: Yonghong Song @ 2021-06-29  1:45 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, andrii, netdev, bpf, kernel-team



On 6/23/21 7:25 PM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> BTF is required for 'struct bpf_timer' to be recognized inside map value.
> The bpf timers are supported inside inner maps.
> Remember 'struct btf *' in inner_map_meta to make it available
> to the verifier in the sequence:
> 
> struct bpf_map *inner_map = bpf_map_lookup_elem(&outer_map, ...);
> if (inner_map)
>      timer = bpf_map_lookup_elem(&inner_map, ...);
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-25 16:54   ` Yonghong Song
@ 2021-06-29  1:46     ` Alexei Starovoitov
  2021-06-29  2:24       ` Yonghong Song
  2021-06-29  6:34       ` Andrii Nakryiko
  0 siblings, 2 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-29  1:46 UTC (permalink / raw)
  To: Yonghong Song; +Cc: davem, daniel, andrii, netdev, bpf, kernel-team

On Fri, Jun 25, 2021 at 09:54:11AM -0700, Yonghong Song wrote:
> 
> 
> On 6/23/21 7:25 PM, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> > 
> > Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> > in hash/array/lru maps as a regular field and helpers to operate on it:
> > 
> > // Initialize the timer.
> > // First 4 bits of 'flags' specify clockid.
> > // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> > long bpf_timer_init(struct bpf_timer *timer, int flags);
> > 
> > // Arm the timer to call callback_fn static function and set its
> > // expiration 'nsec' nanoseconds from the current time.
> > long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsec);
> > 
> > // Cancel the timer and wait for callback_fn to finish if it was running.
> > long bpf_timer_cancel(struct bpf_timer *timer);
> > 
> > Here is how BPF program might look like:
> > struct map_elem {
> >      int counter;
> >      struct bpf_timer timer;
> > };
> > 
> > struct {
> >      __uint(type, BPF_MAP_TYPE_HASH);
> >      __uint(max_entries, 1000);
> >      __type(key, int);
> >      __type(value, struct map_elem);
> > } hmap SEC(".maps");
> > 
> > static int timer_cb(void *map, int *key, struct map_elem *val);
> > /* val points to particular map element that contains bpf_timer. */
> > 
> > SEC("fentry/bpf_fentry_test1")
> > int BPF_PROG(test1, int a)
> > {
> >      struct map_elem *val;
> >      int key = 0;
> > 
> >      val = bpf_map_lookup_elem(&hmap, &key);
> >      if (val) {
> >          bpf_timer_init(&val->timer, CLOCK_REALTIME);
> >          bpf_timer_start(&val->timer, timer_cb, 1000 /* call timer_cb2 in 1 usec */);
> >      }
> > }
> > 
> > This patch adds helper implementations that rely on hrtimers
> > to call bpf functions as timers expire.
> > The following patches add necessary safety checks.
> > 
> > Only programs with CAP_BPF are allowed to use bpf_timer.
> > 
> > The amount of timers used by the program is constrained by
> > the memcg recorded at map creation time.
> > 
> > The bpf_timer_init() helper is receiving hidden 'map' argument and
> > bpf_timer_start() is receiving hidden 'prog' argument supplied by the verifier.
> > The prog pointer is needed to do refcnting of bpf program to make sure that
> > program doesn't get freed while the timer is armed. This apporach relies on
> > "user refcnt" scheme used in prog_array that stores bpf programs for
> > bpf_tail_call. The bpf_timer_start() will increment the prog refcnt which is
> > paired with bpf_timer_cancel() that will drop the prog refcnt. The
> > ops->map_release_uref is responsible for cancelling the timers and dropping
> > prog refcnt when user space reference to a map reaches zero.
> > This uref approach is done to make sure that Ctrl-C of user space process will
> > not leave timers running forever unless the user space explicitly pinned a map
> > that contained timers in bpffs.
> > 
> > The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
> > and free the timer if given map element had it allocated.
> > "bpftool map update" command can be used to cancel timers.
> > 
> > The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
> > '__u64 :64' has 1 byte alignment of 8 byte padding.
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >   include/linux/bpf.h            |   3 +
> >   include/uapi/linux/bpf.h       |  55 +++++++
> >   kernel/bpf/helpers.c           | 281 +++++++++++++++++++++++++++++++++
> >   kernel/bpf/verifier.c          | 138 ++++++++++++++++
> >   kernel/trace/bpf_trace.c       |   2 +-
> >   scripts/bpf_doc.py             |   2 +
> >   tools/include/uapi/linux/bpf.h |  55 +++++++
> >   7 files changed, 535 insertions(+), 1 deletion(-)
> > 
> [...]
> > @@ -12533,6 +12607,70 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> >   			continue;
> >   		}
> > +		if (insn->imm == BPF_FUNC_timer_init) {
> > +			aux = &env->insn_aux_data[i + delta];
> > +			if (bpf_map_ptr_poisoned(aux)) {
> > +				verbose(env, "bpf_timer_init abusing map_ptr\n");
> > +				return -EINVAL;
> > +			}
> > +			map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
> > +			{
> > +				struct bpf_insn ld_addrs[2] = {
> > +					BPF_LD_IMM64(BPF_REG_3, (long)map_ptr),
> > +				};
> > +
> > +				insn_buf[0] = ld_addrs[0];
> > +				insn_buf[1] = ld_addrs[1];
> > +			}
> > +			insn_buf[2] = *insn;
> > +			cnt = 3;
> > +
> > +			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > +			if (!new_prog)
> > +				return -ENOMEM;
> > +
> > +			delta    += cnt - 1;
> > +			env->prog = prog = new_prog;
> > +			insn      = new_prog->insnsi + i + delta;
> > +			goto patch_call_imm;
> > +		}
> > +
> > +		if (insn->imm == BPF_FUNC_timer_start) {
> > +			/* There is no need to do:
> > +			 *     aux = &env->insn_aux_data[i + delta];
> > +			 *     if (bpf_map_ptr_poisoned(aux)) return -EINVAL;
> > +			 * for bpf_timer_start(). If the same callback_fn is shared
> > +			 * by different timers in different maps the poisoned check
> > +			 * will return false positive.
> > +			 *
> > +			 * The verifier will process callback_fn as many times as necessary
> > +			 * with different maps and the register states prepared by
> > +			 * set_timer_start_callback_state will be accurate.
> > +			 *
> > +			 * There is no need for bpf_timer_start() to check in the
> > +			 * run-time that bpf_hrtimer->map stored during bpf_timer_init()
> > +			 * is the same map as in bpf_timer_start()
> > +			 * because it's the same map element value.
> 
> I am puzzled by above comments. Maybe you could explain more?
> bpf_timer_start() checked whether timer is initialized with timer->timer
> NULL check. It will proceed only if a valid timer has been
> initialized. I think the following scenarios are also supported:
>   1. map1 is shared by prog1 and prog2
>   2. prog1 call bpf_timer_init for all map1 elements
>   3. prog2 call bpf_timer_start for some or all map1 elements.
> So for prog2 verification, bpf_timer_init() is not even called.

Right. Such timer sharing between two progs is supported.
From prog2 pov the bpf_timer_init() was not called, but it certainly
had to be called by this or ther other prog.
I'll rephrase the last paragraph.

While talking to Martin about the api he pointed out that
callback_fn in timer_start() doesn't achieve the full use case
of replacing a prog. So in the next spin I'll split it into
bpf_timer_set_callback(timer, callback_fn);
bpf_timer_start(timer, nsec);
This way callback and prog can be replaced without resetting
timer expiry which could be useful.

Also Daniel and Andrii reminded that cpu pinning would be next
feature request. The api extensibility allows to add it in the future.
I'm going to delay implementing it until bpf_smp_call_single()
implications are understood.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 2/8] bpf: Add map side support for bpf timers.
  2021-06-25 19:46   ` Yonghong Song
@ 2021-06-29  1:49     ` Alexei Starovoitov
  0 siblings, 0 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-29  1:49 UTC (permalink / raw)
  To: Yonghong Song; +Cc: davem, daniel, andrii, netdev, bpf, kernel-team

On Fri, Jun 25, 2021 at 12:46:03PM -0700, Yonghong Song wrote:
> 
> 
> On 6/23/21 7:25 PM, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> > 
> > Restrict bpf timers to array, hash (both preallocated and kmalloced), and
> > lru map types. The per-cpu maps with timers don't make sense, since 'struct
> > bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
> > the number of timers depends on number of possible cpus and timers would not be
> > accessible from all cpus. lpm map support can be added in the future.
> > The timers in inner maps are supported.
> > 
> > The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
> > bpf_timer in a given map element.
> > 
> > Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
> > that map element indeed contains 'struct bpf_timer'.
> > 
> > Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
> > map element data is reused in preallocated htab and lru maps.
> > 
> > Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
> > map element. There could be one of each, but not more than one. Due to 'one
> > bpf_timer in one element' restriction do not support timers in global data,
> > since global data is a map of single element, but from bpf program side it's
> > seen as many global variables and restriction of single global timer would be
> > odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
> > timers, since user space could have corrupted mmap element and crashed the
> > kernel. The maps with timers cannot be readonly. Due to these restrictions
> > search for bpf_timer in datasec BTF in case it was placed in the global data to
> > report clear error.
> > 
> > The previous patch allowed 'struct bpf_timer' as a first field in a map
> > element only. Relax this restriction.
> > 
> > Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
> > the timer when lru map deletes an element as a part of it eviction algorithm.
> > 
> > Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
> > The timer operation are done through helpers only.
> > This is similar to 'struct bpf_spin_lock'.
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> 
> I didn't find major issues. Only one minor comment below. I do a race
> during map_update where updated map elements will have timer removed
> but at the same time the timer might still be used after update. But
> irq spinlock should handle this properly.

Right. It's safe from kernel pov. But non-atomic from bpf prog pov.
If one prog doing map_update_elem() it replaced the other fields
except timer, but the other fields are also not atomic.
So not a new concern for bpf authors.

> Acked-by: Yonghong Song <yhs@fb.com>
> 
> > ---
> >   include/linux/bpf.h        | 44 ++++++++++++-----
> >   include/linux/btf.h        |  1 +
> >   kernel/bpf/arraymap.c      | 22 +++++++++
> >   kernel/bpf/btf.c           | 77 ++++++++++++++++++++++++------
> >   kernel/bpf/hashtab.c       | 96 +++++++++++++++++++++++++++++++++-----
> >   kernel/bpf/local_storage.c |  4 +-
> >   kernel/bpf/map_in_map.c    |  1 +
> >   kernel/bpf/syscall.c       | 21 +++++++--
> >   kernel/bpf/verifier.c      | 30 ++++++++++--
> >   9 files changed, 251 insertions(+), 45 deletions(-)
> > 
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 72da9d4d070c..90e6c6d1404c 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -198,24 +198,46 @@ static inline bool map_value_has_spin_lock(const struct bpf_map *map)
> >   	return map->spin_lock_off >= 0;
> >   }
> > -static inline void check_and_init_map_lock(struct bpf_map *map, void *dst)
> > +static inline bool map_value_has_timer(const struct bpf_map *map)
> >   {
> > -	if (likely(!map_value_has_spin_lock(map)))
> > -		return;
> > -	*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
> > -		(struct bpf_spin_lock){};
> > +	return map->timer_off >= 0;
> >   }
> > -/* copy everything but bpf_spin_lock */
> > +static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> > +{
> > +	if (unlikely(map_value_has_spin_lock(map)))
> > +		*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
> > +			(struct bpf_spin_lock){};
> > +	if (unlikely(map_value_has_timer(map)))
> > +		*(struct bpf_timer *)(dst + map->timer_off) =
> > +			(struct bpf_timer){};
> > +}
> > +
> [...]
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index 6f6681b07364..e85a5839ffe8 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> > @@ -228,6 +228,28 @@ static struct htab_elem *get_htab_elem(struct bpf_htab *htab, int i)
> >   	return (struct htab_elem *) (htab->elems + i * (u64)htab->elem_size);
> >   }
> > +static void htab_free_prealloced_timers(struct bpf_htab *htab)
> > +{
> > +	u32 num_entries = htab->map.max_entries;
> > +	int i;
> > +
> > +	if (likely(!map_value_has_timer(&htab->map)))
> > +		return;
> > +	if (!htab_is_percpu(htab) && !htab_is_lru(htab))
> 
> Is !htab_is_percpu(htab) needed? I think we already checked
> if map_value has timer it can only be hash/lruhash/array?

Technically not, but it's good to keep this part the same as
in prealloc_init(). I'll refactor these two checks into a small
helper function and will use in both places.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-29  1:46     ` Alexei Starovoitov
@ 2021-06-29  2:24       ` Yonghong Song
  2021-06-29  3:32         ` Alexei Starovoitov
  2021-06-29  6:34       ` Andrii Nakryiko
  1 sibling, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-06-29  2:24 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, daniel, andrii, netdev, bpf, kernel-team



On 6/28/21 6:46 PM, Alexei Starovoitov wrote:
> On Fri, Jun 25, 2021 at 09:54:11AM -0700, Yonghong Song wrote:
>>
>>
>> On 6/23/21 7:25 PM, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov <ast@kernel.org>
>>>
>>> Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
>>> in hash/array/lru maps as a regular field and helpers to operate on it:
>>>
>>> // Initialize the timer.
>>> // First 4 bits of 'flags' specify clockid.
>>> // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
>>> long bpf_timer_init(struct bpf_timer *timer, int flags);
>>>
>>> // Arm the timer to call callback_fn static function and set its
>>> // expiration 'nsec' nanoseconds from the current time.
>>> long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsec);
>>>
>>> // Cancel the timer and wait for callback_fn to finish if it was running.
>>> long bpf_timer_cancel(struct bpf_timer *timer);
>>>
>>> Here is how BPF program might look like:
>>> struct map_elem {
>>>       int counter;
>>>       struct bpf_timer timer;
>>> };
>>>
>>> struct {
>>>       __uint(type, BPF_MAP_TYPE_HASH);
>>>       __uint(max_entries, 1000);
>>>       __type(key, int);
>>>       __type(value, struct map_elem);
>>> } hmap SEC(".maps");
>>>
>>> static int timer_cb(void *map, int *key, struct map_elem *val);
>>> /* val points to particular map element that contains bpf_timer. */
>>>
>>> SEC("fentry/bpf_fentry_test1")
>>> int BPF_PROG(test1, int a)
>>> {
>>>       struct map_elem *val;
>>>       int key = 0;
>>>
>>>       val = bpf_map_lookup_elem(&hmap, &key);
>>>       if (val) {
>>>           bpf_timer_init(&val->timer, CLOCK_REALTIME);
>>>           bpf_timer_start(&val->timer, timer_cb, 1000 /* call timer_cb2 in 1 usec */);
>>>       }
>>> }
>>>
>>> This patch adds helper implementations that rely on hrtimers
>>> to call bpf functions as timers expire.
>>> The following patches add necessary safety checks.
>>>
>>> Only programs with CAP_BPF are allowed to use bpf_timer.
>>>
>>> The amount of timers used by the program is constrained by
>>> the memcg recorded at map creation time.
>>>
>>> The bpf_timer_init() helper is receiving hidden 'map' argument and
>>> bpf_timer_start() is receiving hidden 'prog' argument supplied by the verifier.
>>> The prog pointer is needed to do refcnting of bpf program to make sure that
>>> program doesn't get freed while the timer is armed. This apporach relies on
>>> "user refcnt" scheme used in prog_array that stores bpf programs for
>>> bpf_tail_call. The bpf_timer_start() will increment the prog refcnt which is
>>> paired with bpf_timer_cancel() that will drop the prog refcnt. The
>>> ops->map_release_uref is responsible for cancelling the timers and dropping
>>> prog refcnt when user space reference to a map reaches zero.
>>> This uref approach is done to make sure that Ctrl-C of user space process will
>>> not leave timers running forever unless the user space explicitly pinned a map
>>> that contained timers in bpffs.
>>>
>>> The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
>>> and free the timer if given map element had it allocated.
>>> "bpftool map update" command can be used to cancel timers.
>>>
>>> The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
>>> '__u64 :64' has 1 byte alignment of 8 byte padding.
>>>
>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>> ---
>>>    include/linux/bpf.h            |   3 +
>>>    include/uapi/linux/bpf.h       |  55 +++++++
>>>    kernel/bpf/helpers.c           | 281 +++++++++++++++++++++++++++++++++
>>>    kernel/bpf/verifier.c          | 138 ++++++++++++++++
>>>    kernel/trace/bpf_trace.c       |   2 +-
>>>    scripts/bpf_doc.py             |   2 +
>>>    tools/include/uapi/linux/bpf.h |  55 +++++++
>>>    7 files changed, 535 insertions(+), 1 deletion(-)
>>>
>> [...]
>>> @@ -12533,6 +12607,70 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>>>    			continue;
>>>    		}
>>> +		if (insn->imm == BPF_FUNC_timer_init) {
>>> +			aux = &env->insn_aux_data[i + delta];
>>> +			if (bpf_map_ptr_poisoned(aux)) {
>>> +				verbose(env, "bpf_timer_init abusing map_ptr\n");
>>> +				return -EINVAL;
>>> +			}
>>> +			map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
>>> +			{
>>> +				struct bpf_insn ld_addrs[2] = {
>>> +					BPF_LD_IMM64(BPF_REG_3, (long)map_ptr),
>>> +				};
>>> +
>>> +				insn_buf[0] = ld_addrs[0];
>>> +				insn_buf[1] = ld_addrs[1];
>>> +			}
>>> +			insn_buf[2] = *insn;
>>> +			cnt = 3;
>>> +
>>> +			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
>>> +			if (!new_prog)
>>> +				return -ENOMEM;
>>> +
>>> +			delta    += cnt - 1;
>>> +			env->prog = prog = new_prog;
>>> +			insn      = new_prog->insnsi + i + delta;
>>> +			goto patch_call_imm;
>>> +		}
>>> +
>>> +		if (insn->imm == BPF_FUNC_timer_start) {
>>> +			/* There is no need to do:
>>> +			 *     aux = &env->insn_aux_data[i + delta];
>>> +			 *     if (bpf_map_ptr_poisoned(aux)) return -EINVAL;
>>> +			 * for bpf_timer_start(). If the same callback_fn is shared
>>> +			 * by different timers in different maps the poisoned check
>>> +			 * will return false positive.
>>> +			 *
>>> +			 * The verifier will process callback_fn as many times as necessary
>>> +			 * with different maps and the register states prepared by
>>> +			 * set_timer_start_callback_state will be accurate.
>>> +			 *
>>> +			 * There is no need for bpf_timer_start() to check in the
>>> +			 * run-time that bpf_hrtimer->map stored during bpf_timer_init()
>>> +			 * is the same map as in bpf_timer_start()
>>> +			 * because it's the same map element value.
>>
>> I am puzzled by above comments. Maybe you could explain more?
>> bpf_timer_start() checked whether timer is initialized with timer->timer
>> NULL check. It will proceed only if a valid timer has been
>> initialized. I think the following scenarios are also supported:
>>    1. map1 is shared by prog1 and prog2
>>    2. prog1 call bpf_timer_init for all map1 elements
>>    3. prog2 call bpf_timer_start for some or all map1 elements.
>> So for prog2 verification, bpf_timer_init() is not even called.
> 
> Right. Such timer sharing between two progs is supported.
>>From prog2 pov the bpf_timer_init() was not called, but it certainly
> had to be called by this or ther other prog.
> I'll rephrase the last paragraph.

okay.

> 
> While talking to Martin about the api he pointed out that
> callback_fn in timer_start() doesn't achieve the full use case
> of replacing a prog. So in the next spin I'll split it into
> bpf_timer_set_callback(timer, callback_fn);
> bpf_timer_start(timer, nsec);
> This way callback and prog can be replaced without resetting
> timer expiry which could be useful.

I took a brief look for patch 4-6 and it looks okay. But since
you will change helper signatures I will hold and check next
revision instead.

BTW, does this mean the following scenario will be supported?
   prog1: bpf_timer_set_callback(time, callback_fn)
   prog2: bpf_timer_start(timer, nsec)
so here prog2 can start the timer which call prog1's callback_fn?

> 
> Also Daniel and Andrii reminded that cpu pinning would be next
> feature request. The api extensibility allows to add it in the future.
> I'm going to delay implementing it until bpf_smp_call_single()
> implications are understood.

Do we need to any a 'flags' parameter for bpf_timer_start() helper
so we can encode target cpu in 'flags'?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-29  2:24       ` Yonghong Song
@ 2021-06-29  3:32         ` Alexei Starovoitov
  0 siblings, 0 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-29  3:32 UTC (permalink / raw)
  To: Yonghong Song; +Cc: davem, daniel, andrii, netdev, bpf, kernel-team

On Mon, Jun 28, 2021 at 07:24:28PM -0700, Yonghong Song wrote:
> 
> > 
> > While talking to Martin about the api he pointed out that
> > callback_fn in timer_start() doesn't achieve the full use case
> > of replacing a prog. So in the next spin I'll split it into
> > bpf_timer_set_callback(timer, callback_fn);
> > bpf_timer_start(timer, nsec);
> > This way callback and prog can be replaced without resetting
> > timer expiry which could be useful.
> 
> I took a brief look for patch 4-6 and it looks okay. But since
> you will change helper signatures I will hold and check next
> revision instead.

Thanks. The verifier patches won't change though.

> 
> BTW, does this mean the following scenario will be supported?
>   prog1: bpf_timer_set_callback(time, callback_fn)
>   prog2: bpf_timer_start(timer, nsec)
> so here prog2 can start the timer which call prog1's callback_fn?

right.

> > 
> > Also Daniel and Andrii reminded that cpu pinning would be next
> > feature request. The api extensibility allows to add it in the future.
> > I'm going to delay implementing it until bpf_smp_call_single()
> > implications are understood.
> 
> Do we need to any a 'flags' parameter for bpf_timer_start() helper
> so we can encode target cpu in 'flags'?

I thought that bpf_timer_init will handle the cpu selection,
but you're right that having cpu in bpf_timer_start is more flexible.
So I'll add a flag.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-29  1:46     ` Alexei Starovoitov
  2021-06-29  2:24       ` Yonghong Song
@ 2021-06-29  6:34       ` Andrii Nakryiko
  2021-06-29 13:28         ` Alexei Starovoitov
  1 sibling, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2021-06-29  6:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Networking, bpf, Kernel Team

On Tue, Jun 29, 2021 at 4:46 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jun 25, 2021 at 09:54:11AM -0700, Yonghong Song wrote:
> >
> >
> > On 6/23/21 7:25 PM, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> > > in hash/array/lru maps as a regular field and helpers to operate on it:
> > >
> > > // Initialize the timer.
> > > // First 4 bits of 'flags' specify clockid.
> > > // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> > > long bpf_timer_init(struct bpf_timer *timer, int flags);
> > >
> > > // Arm the timer to call callback_fn static function and set its
> > > // expiration 'nsec' nanoseconds from the current time.
> > > long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsec);
> > >
> > > // Cancel the timer and wait for callback_fn to finish if it was running.
> > > long bpf_timer_cancel(struct bpf_timer *timer);
> > >
> > > Here is how BPF program might look like:
> > > struct map_elem {
> > >      int counter;
> > >      struct bpf_timer timer;
> > > };
> > >
> > > struct {
> > >      __uint(type, BPF_MAP_TYPE_HASH);
> > >      __uint(max_entries, 1000);
> > >      __type(key, int);
> > >      __type(value, struct map_elem);
> > > } hmap SEC(".maps");
> > >
> > > static int timer_cb(void *map, int *key, struct map_elem *val);
> > > /* val points to particular map element that contains bpf_timer. */
> > >
> > > SEC("fentry/bpf_fentry_test1")
> > > int BPF_PROG(test1, int a)
> > > {
> > >      struct map_elem *val;
> > >      int key = 0;
> > >
> > >      val = bpf_map_lookup_elem(&hmap, &key);
> > >      if (val) {
> > >          bpf_timer_init(&val->timer, CLOCK_REALTIME);
> > >          bpf_timer_start(&val->timer, timer_cb, 1000 /* call timer_cb2 in 1 usec */);
> > >      }
> > > }
> > >
> > > This patch adds helper implementations that rely on hrtimers
> > > to call bpf functions as timers expire.
> > > The following patches add necessary safety checks.
> > >
> > > Only programs with CAP_BPF are allowed to use bpf_timer.
> > >
> > > The amount of timers used by the program is constrained by
> > > the memcg recorded at map creation time.
> > >
> > > The bpf_timer_init() helper is receiving hidden 'map' argument and
> > > bpf_timer_start() is receiving hidden 'prog' argument supplied by the verifier.
> > > The prog pointer is needed to do refcnting of bpf program to make sure that
> > > program doesn't get freed while the timer is armed. This apporach relies on
> > > "user refcnt" scheme used in prog_array that stores bpf programs for
> > > bpf_tail_call. The bpf_timer_start() will increment the prog refcnt which is
> > > paired with bpf_timer_cancel() that will drop the prog refcnt. The
> > > ops->map_release_uref is responsible for cancelling the timers and dropping
> > > prog refcnt when user space reference to a map reaches zero.
> > > This uref approach is done to make sure that Ctrl-C of user space process will
> > > not leave timers running forever unless the user space explicitly pinned a map
> > > that contained timers in bpffs.
> > >
> > > The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
> > > and free the timer if given map element had it allocated.
> > > "bpftool map update" command can be used to cancel timers.
> > >
> > > The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
> > > '__u64 :64' has 1 byte alignment of 8 byte padding.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > ---
> > >   include/linux/bpf.h            |   3 +
> > >   include/uapi/linux/bpf.h       |  55 +++++++
> > >   kernel/bpf/helpers.c           | 281 +++++++++++++++++++++++++++++++++
> > >   kernel/bpf/verifier.c          | 138 ++++++++++++++++
> > >   kernel/trace/bpf_trace.c       |   2 +-
> > >   scripts/bpf_doc.py             |   2 +
> > >   tools/include/uapi/linux/bpf.h |  55 +++++++
> > >   7 files changed, 535 insertions(+), 1 deletion(-)
> > >
> > [...]
> > > @@ -12533,6 +12607,70 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> > >                     continue;
> > >             }
> > > +           if (insn->imm == BPF_FUNC_timer_init) {
> > > +                   aux = &env->insn_aux_data[i + delta];
> > > +                   if (bpf_map_ptr_poisoned(aux)) {
> > > +                           verbose(env, "bpf_timer_init abusing map_ptr\n");
> > > +                           return -EINVAL;
> > > +                   }
> > > +                   map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
> > > +                   {
> > > +                           struct bpf_insn ld_addrs[2] = {
> > > +                                   BPF_LD_IMM64(BPF_REG_3, (long)map_ptr),
> > > +                           };
> > > +
> > > +                           insn_buf[0] = ld_addrs[0];
> > > +                           insn_buf[1] = ld_addrs[1];
> > > +                   }
> > > +                   insn_buf[2] = *insn;
> > > +                   cnt = 3;
> > > +
> > > +                   new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > > +                   if (!new_prog)
> > > +                           return -ENOMEM;
> > > +
> > > +                   delta    += cnt - 1;
> > > +                   env->prog = prog = new_prog;
> > > +                   insn      = new_prog->insnsi + i + delta;
> > > +                   goto patch_call_imm;
> > > +           }
> > > +
> > > +           if (insn->imm == BPF_FUNC_timer_start) {
> > > +                   /* There is no need to do:
> > > +                    *     aux = &env->insn_aux_data[i + delta];
> > > +                    *     if (bpf_map_ptr_poisoned(aux)) return -EINVAL;
> > > +                    * for bpf_timer_start(). If the same callback_fn is shared
> > > +                    * by different timers in different maps the poisoned check
> > > +                    * will return false positive.
> > > +                    *
> > > +                    * The verifier will process callback_fn as many times as necessary
> > > +                    * with different maps and the register states prepared by
> > > +                    * set_timer_start_callback_state will be accurate.
> > > +                    *
> > > +                    * There is no need for bpf_timer_start() to check in the
> > > +                    * run-time that bpf_hrtimer->map stored during bpf_timer_init()
> > > +                    * is the same map as in bpf_timer_start()
> > > +                    * because it's the same map element value.
> >
> > I am puzzled by above comments. Maybe you could explain more?
> > bpf_timer_start() checked whether timer is initialized with timer->timer
> > NULL check. It will proceed only if a valid timer has been
> > initialized. I think the following scenarios are also supported:
> >   1. map1 is shared by prog1 and prog2
> >   2. prog1 call bpf_timer_init for all map1 elements
> >   3. prog2 call bpf_timer_start for some or all map1 elements.
> > So for prog2 verification, bpf_timer_init() is not even called.
>
> Right. Such timer sharing between two progs is supported.
> From prog2 pov the bpf_timer_init() was not called, but it certainly
> had to be called by this or ther other prog.
> I'll rephrase the last paragraph.
>
> While talking to Martin about the api he pointed out that
> callback_fn in timer_start() doesn't achieve the full use case
> of replacing a prog. So in the next spin I'll split it into
> bpf_timer_set_callback(timer, callback_fn);
> bpf_timer_start(timer, nsec);
> This way callback and prog can be replaced without resetting
> timer expiry which could be useful.

Have you considered alternatively to implement something like
bpf_ringbuf_query() for BPF ringbuf that will allow to query various
things about the timer (e.g., whether it is active or not, and, of
course, remaining expiry time). That will be more general, easier to
extend, and will cover this use case:

long exp = bpf_timer_query(&t->timer, BPF_TIMER_EXPIRY);
bpf_timer_start(&t->timer, new_callback, exp);

This will keep common timer scenarios to just two steps, init + start,
but won't prevent more complicated ones. Things like extending
expiration by one second relative that what was remaining will be
possible as well.

>
> Also Daniel and Andrii reminded that cpu pinning would be next


Awesome!

> feature request. The api extensibility allows to add it in the future.
> I'm going to delay implementing it until bpf_smp_call_single()
> implications are understood.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-29  6:34       ` Andrii Nakryiko
@ 2021-06-29 13:28         ` Alexei Starovoitov
  2021-06-30 10:08           ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-29 13:28 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Networking, bpf, Kernel Team

On Mon, Jun 28, 2021 at 11:34 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> Have you considered alternatively to implement something like
> bpf_ringbuf_query() for BPF ringbuf that will allow to query various
> things about the timer (e.g., whether it is active or not, and, of
> course, remaining expiry time). That will be more general, easier to
> extend, and will cover this use case:
>
> long exp = bpf_timer_query(&t->timer, BPF_TIMER_EXPIRY);
> bpf_timer_start(&t->timer, new_callback, exp);

yes, but...
hrtimer_get_remaining + timer_start to that value is racy
and not accurate.
hrtimer_get_expires_ns + timer_start(MODE_ABS)
would be accurate, but that's an unnecessary complication.
To live replace old bpf prog with new one
bpf_for_each_map_elem() { bpf_timer_set_callback(new_prog); }
is much faster, since timers don't need to be dequeue, enqueue.
No need to worry about hrtimer machinery internal changes, etc.
bpf prog being replaced shouldn't be affecting the rest of the system.

> This will keep common timer scenarios to just two steps, init + start,
> but won't prevent more complicated ones. Things like extending
> expiration by one second relative that what was remaining will be
> possible as well.

Extending expiration would be more accurate with hrtimer_forward_now().

All of the above points are minor compared to the verifier advantage.
bpf_timer_set_callback() typically won't be called from the callback.
So verifier's insn_procssed will be drastically lower.
The combinatorial explosion of states even for this small
selftests/bpf/progs/timer.c is significant.
With bpf_timer_set_callback() is done outside of callback the verifier
behavior will be predictable.
To some degree patches 4-6 could have been delayed, but since the
the algo is understood and it's working, I'm going to keep them.
It's nice to have that flexibility, but the less pressure on the
verifier the better.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-29 13:28         ` Alexei Starovoitov
@ 2021-06-30 10:08           ` Andrii Nakryiko
  2021-06-30 17:38             ` Alexei Starovoitov
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2021-06-30 10:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Networking, bpf, Kernel Team

On Tue, Jun 29, 2021 at 4:28 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jun 28, 2021 at 11:34 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > Have you considered alternatively to implement something like
> > bpf_ringbuf_query() for BPF ringbuf that will allow to query various
> > things about the timer (e.g., whether it is active or not, and, of
> > course, remaining expiry time). That will be more general, easier to
> > extend, and will cover this use case:
> >
> > long exp = bpf_timer_query(&t->timer, BPF_TIMER_EXPIRY);
> > bpf_timer_start(&t->timer, new_callback, exp);
>
> yes, but...
> hrtimer_get_remaining + timer_start to that value is racy
> and not accurate.

yes, but even though we specify expiration in nanosecond precision, no
one should expect that precision w.r.t. when callback is actually
fired. So fetching current expiration, adding new one, and re-setting
it shouldn't be a problem in practice, IMO.

I just think the most common case is to set a timer once, so ideally
usability is optimized for that (so taken to extreme it would be just
bpf_timer_start without any bpf_timer_init, but we've already
discussed this, no need to do that again here). Needing bpf_timer_init
+ bpf_timer_set_callbcack + bpf_timer_start for a common case feels
suboptimal usability-wise.

There is also a new race with bpf_timer_set_callback +
bpf_timer_start. Callback can fire inbetween those two operations, so
we could get new callback at old expiration or old callback with new
expiration. To do full update reliably, you'd need to explicitly
bpf_timer_cancel() first, at which point separate
bpf_timer_set_callback() doesn't help at all.

> hrtimer_get_expires_ns + timer_start(MODE_ABS)
> would be accurate, but that's an unnecessary complication.
> To live replace old bpf prog with new one
> bpf_for_each_map_elem() { bpf_timer_set_callback(new_prog); }
> is much faster, since timers don't need to be dequeue, enqueue.
> No need to worry about hrtimer machinery internal changes, etc.
> bpf prog being replaced shouldn't be affecting the rest of the system.

That's a good property, but if it was done as a
bpf_timer_set_callback() in addition to current
bpf_timer_start(callback_fn) it would still allow to have a simple
typical use.

Another usability consideration. With mandatory
bpf_timer_set_callback(), bpf_timer_start() will need to return some
error code if the callback wasn't set yet, right? I'm afraid that in
practice it will be the situation similar to bpf_trace_printk() where
people expect that it always succeeds and will never check the return
code. It's obviously debuggable, but a friction point nevertheless.

>
> > This will keep common timer scenarios to just two steps, init + start,
> > but won't prevent more complicated ones. Things like extending
> > expiration by one second relative that what was remaining will be
> > possible as well.
>
> Extending expiration would be more accurate with hrtimer_forward_now().
>
> All of the above points are minor compared to the verifier advantage.
> bpf_timer_set_callback() typically won't be called from the callback.
> So verifier's insn_procssed will be drastically lower.
> The combinatorial explosion of states even for this small
> selftests/bpf/progs/timer.c is significant.
> With bpf_timer_set_callback() is done outside of callback the verifier
> behavior will be predictable.
> To some degree patches 4-6 could have been delayed, but since the
> the algo is understood and it's working, I'm going to keep them.
> It's nice to have that flexibility, but the less pressure on the
> verifier the better.

I haven't had time to understand those new patches yet, sorry, so not
sure where the state explosion is coming from. I'll get to it for real
next week. But improving verifier internals can be done transparently,
while changing/fixing BPF UAPI is much harder and more disruptive.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-30 10:08           ` Andrii Nakryiko
@ 2021-06-30 17:38             ` Alexei Starovoitov
  0 siblings, 0 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-06-30 17:38 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Networking, bpf, Kernel Team

On Wed, Jun 30, 2021 at 01:08:08PM +0300, Andrii Nakryiko wrote:
> On Tue, Jun 29, 2021 at 4:28 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Jun 28, 2021 at 11:34 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > Have you considered alternatively to implement something like
> > > bpf_ringbuf_query() for BPF ringbuf that will allow to query various
> > > things about the timer (e.g., whether it is active or not, and, of
> > > course, remaining expiry time). That will be more general, easier to
> > > extend, and will cover this use case:
> > >
> > > long exp = bpf_timer_query(&t->timer, BPF_TIMER_EXPIRY);
> > > bpf_timer_start(&t->timer, new_callback, exp);
> >
> > yes, but...
> > hrtimer_get_remaining + timer_start to that value is racy
> > and not accurate.
> 
> yes, but even though we specify expiration in nanosecond precision, no
> one should expect that precision w.r.t. when callback is actually
> fired. So fetching current expiration, adding new one, and re-setting
> it shouldn't be a problem in practice, IMO.

Just because we're dealing with time? Sounds sloppy. I suspect RT
folks take pride to make nsec precision as accurate as possible.

> I just think the most common case is to set a timer once, so ideally
> usability is optimized for that (so taken to extreme it would be just
> bpf_timer_start without any bpf_timer_init, but we've already
> discussed this, no need to do that again here). Needing bpf_timer_init
> + bpf_timer_set_callbcack + bpf_timer_start for a common case feels
> suboptimal usability-wise.

init+set+start could be one helper. See more below.

> 
> There is also a new race with bpf_timer_set_callback +
> bpf_timer_start. Callback can fire inbetween those two operations, so
> we could get new callback at old expiration or old callback with new
> expiration. 

sure, but that's a different issue.
When XDP prog is being replaced some packets might hit old one
even though there is an atomic replace of the pointer to a prog.
Two XDP progs and two timer callbacks running on different cpus
is an inevitable situation.

> To do full update reliably, you'd need to explicitly
> bpf_timer_cancel() first, at which point separate
> bpf_timer_set_callback() doesn't help at all.
> 
> > hrtimer_get_expires_ns + timer_start(MODE_ABS)
> > would be accurate, but that's an unnecessary complication.
> > To live replace old bpf prog with new one
> > bpf_for_each_map_elem() { bpf_timer_set_callback(new_prog); }
> > is much faster, since timers don't need to be dequeue, enqueue.
> > No need to worry about hrtimer machinery internal changes, etc.
> > bpf prog being replaced shouldn't be affecting the rest of the system.
> 
> That's a good property, but if it was done as a
> bpf_timer_set_callback() in addition to current
> bpf_timer_start(callback_fn) it would still allow to have a simple
> typical use.
> 
> Another usability consideration. With mandatory
> bpf_timer_set_callback(), bpf_timer_start() will need to return some
> error code if the callback wasn't set yet, right? I'm afraid that in
> practice it will be the situation similar to bpf_trace_printk() where
> people expect that it always succeeds and will never check the return
> code. It's obviously debuggable, but a friction point nevertheless.

It sucks that you had this printk experience. We screwed up.
It's definitely something to watch out for in the future.
But this analogy doesn't apply here.
bpf_timer_init/set_callback/start/cancel matches one to one to hrtimer api.
In earlier patches the callback setting was part of 'init' and then later
it was part of 'start'. imo that was 'reinvent the wheel' moment.
Not sure why such simple and elegant solution as indepdent
bpf_timer_set_callback wasn't obvious earlier.
There are tons of examples of hrtimer usage in the kernel
and it's safe to assume that bpf usage will be similar.
Typically the kernel does init + set_callback once and then start/cancel a lot.
Including doing 'start' from the callback.
I found one driver where callback is being selected dynamically.
So even without 'bpf prog live replace' use case there could be
use cases for dynamic set_callback for bpf timers too.
In all cases I've examined the 'start' is the most used function.
Coupling it with setting callback looks quite wrong to me from api pov.
Like there are examples in the kernel where 'start' is done in one .c file
while callback is defined in a different .c file.
Doing 'extern .. callback();' just to pass it into hrimter_start()
would have been ugly. Same thing we can expect to happen with bpf timers.

But if you really really want bpf_timer_start with callback I wouldn't
mind to have:
static inline int bpf_timer_start2(timer, callback, nsec)
{
  int err = bpf_timer_set_callback(timer, callback);
  if (err)...
  err = bpf_timer_start(timer, nsec, 0);
  ...
}
to be defined in libbpf's bpf_helpers.h file.
That could be a beginning of new way of defining helpers.
But based on the kernel usage of the hrimter I suspect that this helper
won't be used too much and people will stick to independent
steps to set callback and start it.

There could be a helper that does init+set+start too.

> >
> > > This will keep common timer scenarios to just two steps, init + start,
> > > but won't prevent more complicated ones. Things like extending
> > > expiration by one second relative that what was remaining will be
> > > possible as well.
> >
> > Extending expiration would be more accurate with hrtimer_forward_now().
> >
> > All of the above points are minor compared to the verifier advantage.
> > bpf_timer_set_callback() typically won't be called from the callback.
> > So verifier's insn_procssed will be drastically lower.
> > The combinatorial explosion of states even for this small
> > selftests/bpf/progs/timer.c is significant.
> > With bpf_timer_set_callback() is done outside of callback the verifier
> > behavior will be predictable.
> > To some degree patches 4-6 could have been delayed, but since the
> > the algo is understood and it's working, I'm going to keep them.
> > It's nice to have that flexibility, but the less pressure on the
> > verifier the better.
> 
> I haven't had time to understand those new patches yet, sorry, so not
> sure where the state explosion is coming from. I'll get to it for real
> next week. But improving verifier internals can be done transparently,
> while changing/fixing BPF UAPI is much harder and more disruptive.

It's not about the inadequate implementation of the async callback
verification in patches 4-6 (as you're hinting).
It's path aware property of the verifier that requires more passes
when callback is set from callback. Even with brand new verifier 2.0
the more passes issue will remain the same (unless it's not path
aware and can merge different branches and states).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-06-24  2:25 ` [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers Alexei Starovoitov
  2021-06-25  6:25   ` Yonghong Song
  2021-06-25 16:54   ` Yonghong Song
@ 2021-07-01  5:40   ` Alexei Starovoitov
  2021-07-01 11:51     ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 28+ messages in thread
From: Alexei Starovoitov @ 2021-07-01  5:40 UTC (permalink / raw)
  To: David S. Miller, Martin KaFai Lau, Yonghong Song
  Cc: Daniel Borkmann, Andrii Nakryiko, Network Development, bpf, Kernel Team

On Wed, Jun 23, 2021 at 7:25 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> The bpf_timer_init() helper is receiving hidden 'map' argument and
...
> +               if (insn->imm == BPF_FUNC_timer_init) {
> +                       aux = &env->insn_aux_data[i + delta];
> +                       if (bpf_map_ptr_poisoned(aux)) {
> +                               verbose(env, "bpf_timer_init abusing map_ptr\n");
> +                               return -EINVAL;
> +                       }
> +                       map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
> +                       {
> +                               struct bpf_insn ld_addrs[2] = {
> +                                       BPF_LD_IMM64(BPF_REG_3, (long)map_ptr),
> +                               };

After a couple of hours of ohh so painful debugging I realized that this
approach doesn't work for inner maps. Duh.
For inner maps it remembers inner_map_meta which is a template
of inner map.
Then bpf_timer_cb() passes map ptr into timer callback and if it tries
to do map operations on it the inner_map_meta->ops will be valid,
but the struct bpf_map doesn't have the actual data.
So to support map-in-map we need to require users to pass map pointer
explicitly into bpf_timer_init().
Unfortunately the verifier cannot guarantee that bpf timer field inside
map element is from the same map that is passed as a map ptr.
The verifier can check that they're equivalent from safety pov
via bpf_map_meta_equal(), so common user mistakes will be caught by it.
Still not pretty that it's partially on the user to do:
bpf_timer_init(timer, CLOCK, map);
with 'timer' matching the 'map'.
Another option is to drop 'map' arg from timer callback,
but the usability of the callback will suffer. The inner maps
will be quite painful to use from it.
Anyway I'm going with explicit 'map' arg in the next respin.
Other ideas?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-07-01  5:40   ` Alexei Starovoitov
@ 2021-07-01 11:51     ` Toke Høiland-Jørgensen
  2021-07-01 15:34       ` Alexei Starovoitov
  0 siblings, 1 reply; 28+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-07-01 11:51 UTC (permalink / raw)
  To: Alexei Starovoitov, David S. Miller, Martin KaFai Lau, Yonghong Song
  Cc: Daniel Borkmann, Andrii Nakryiko, Network Development, bpf, Kernel Team

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Wed, Jun 23, 2021 at 7:25 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> The bpf_timer_init() helper is receiving hidden 'map' argument and
> ...
>> +               if (insn->imm == BPF_FUNC_timer_init) {
>> +                       aux = &env->insn_aux_data[i + delta];
>> +                       if (bpf_map_ptr_poisoned(aux)) {
>> +                               verbose(env, "bpf_timer_init abusing map_ptr\n");
>> +                               return -EINVAL;
>> +                       }
>> +                       map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
>> +                       {
>> +                               struct bpf_insn ld_addrs[2] = {
>> +                                       BPF_LD_IMM64(BPF_REG_3, (long)map_ptr),
>> +                               };
>
> After a couple of hours of ohh so painful debugging I realized that this
> approach doesn't work for inner maps. Duh.
> For inner maps it remembers inner_map_meta which is a template
> of inner map.
> Then bpf_timer_cb() passes map ptr into timer callback and if it tries
> to do map operations on it the inner_map_meta->ops will be valid,
> but the struct bpf_map doesn't have the actual data.
> So to support map-in-map we need to require users to pass map pointer
> explicitly into bpf_timer_init().
> Unfortunately the verifier cannot guarantee that bpf timer field inside
> map element is from the same map that is passed as a map ptr.
> The verifier can check that they're equivalent from safety pov
> via bpf_map_meta_equal(), so common user mistakes will be caught by it.
> Still not pretty that it's partially on the user to do:
> bpf_timer_init(timer, CLOCK, map);
> with 'timer' matching the 'map'.

The implication being that if they don't match, the callback will just
get a different argument and it'll be up to the developer to deal with
any bugs arising from that?

> Another option is to drop 'map' arg from timer callback,
> but the usability of the callback will suffer. The inner maps
> will be quite painful to use from it.
> Anyway I'm going with explicit 'map' arg in the next respin.
> Other ideas?

So the problem here is that the inner map pointer is not known at
verification time but only at runtime? Could the verifier inject code to
always spill inner map pointers to a known area of the stack after a
map-in-map lookup, and then just load them back from there when needed?
Not sure that would be worth the complexity (and overhead!), though;
having to supply an explicit callback arg is not that uncommon a pattern
after all...

-Toke


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
  2021-07-01 11:51     ` Toke Høiland-Jørgensen
@ 2021-07-01 15:34       ` Alexei Starovoitov
  0 siblings, 0 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-07-01 15:34 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S. Miller, Martin KaFai Lau, Yonghong Song,
	Daniel Borkmann, Andrii Nakryiko, Network Development, bpf,
	Kernel Team

On Thu, Jul 01, 2021 at 01:51:04PM +0200, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Wed, Jun 23, 2021 at 7:25 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >>
> >> The bpf_timer_init() helper is receiving hidden 'map' argument and
> > ...
> >> +               if (insn->imm == BPF_FUNC_timer_init) {
> >> +                       aux = &env->insn_aux_data[i + delta];
> >> +                       if (bpf_map_ptr_poisoned(aux)) {
> >> +                               verbose(env, "bpf_timer_init abusing map_ptr\n");
> >> +                               return -EINVAL;
> >> +                       }
> >> +                       map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
> >> +                       {
> >> +                               struct bpf_insn ld_addrs[2] = {
> >> +                                       BPF_LD_IMM64(BPF_REG_3, (long)map_ptr),
> >> +                               };
> >
> > After a couple of hours of ohh so painful debugging I realized that this
> > approach doesn't work for inner maps. Duh.
> > For inner maps it remembers inner_map_meta which is a template
> > of inner map.
> > Then bpf_timer_cb() passes map ptr into timer callback and if it tries
> > to do map operations on it the inner_map_meta->ops will be valid,
> > but the struct bpf_map doesn't have the actual data.
> > So to support map-in-map we need to require users to pass map pointer
> > explicitly into bpf_timer_init().
> > Unfortunately the verifier cannot guarantee that bpf timer field inside
> > map element is from the same map that is passed as a map ptr.
> > The verifier can check that they're equivalent from safety pov
> > via bpf_map_meta_equal(), so common user mistakes will be caught by it.
> > Still not pretty that it's partially on the user to do:
> > bpf_timer_init(timer, CLOCK, map);
> > with 'timer' matching the 'map'.
> 
> The implication being that if they don't match, the callback will just
> get a different argument and it'll be up to the developer to deal with
> any bugs arising from that?

Right. The kernel won't crash, of course.

> > Another option is to drop 'map' arg from timer callback,
> > but the usability of the callback will suffer. The inner maps
> > will be quite painful to use from it.
> > Anyway I'm going with explicit 'map' arg in the next respin.
> > Other ideas?
> 
> So the problem here is that the inner map pointer is not known at
> verification time but only at runtime? Could the verifier inject code to

yep.

> always spill inner map pointers to a known area of the stack after a
> map-in-map lookup, and then just load them back from there when needed?

interesting idea. That made me thinking that the verifier has
"map_lookup tracking" ability with increasing reg->id.
Since in some cases we had to distinguish that
val1 = map_lookup(map1, key1);
val2 = map_lookup(map1, key1);
val1 != val2, though they could be from the same map and key.
Maybe building on top of that feature will address the map vs timer
equivalence issue.

> Not sure that would be worth the complexity (and overhead!), though;
> having to supply an explicit callback arg is not that uncommon a pattern
> after all...
> 
> -Toke
> 

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-07-01 15:34 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-24  2:25 [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Alexei Starovoitov
2021-06-24  2:25 ` [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers Alexei Starovoitov
2021-06-25  6:25   ` Yonghong Song
2021-06-25 14:57     ` Alexei Starovoitov
2021-06-25 15:54       ` Yonghong Song
2021-06-29  1:39         ` Alexei Starovoitov
2021-06-25 16:54   ` Yonghong Song
2021-06-29  1:46     ` Alexei Starovoitov
2021-06-29  2:24       ` Yonghong Song
2021-06-29  3:32         ` Alexei Starovoitov
2021-06-29  6:34       ` Andrii Nakryiko
2021-06-29 13:28         ` Alexei Starovoitov
2021-06-30 10:08           ` Andrii Nakryiko
2021-06-30 17:38             ` Alexei Starovoitov
2021-07-01  5:40   ` Alexei Starovoitov
2021-07-01 11:51     ` Toke Høiland-Jørgensen
2021-07-01 15:34       ` Alexei Starovoitov
2021-06-24  2:25 ` [PATCH v3 bpf-next 2/8] bpf: Add map side support for " Alexei Starovoitov
2021-06-25 19:46   ` Yonghong Song
2021-06-29  1:49     ` Alexei Starovoitov
2021-06-24  2:25 ` [PATCH v3 bpf-next 3/8] bpf: Remember BTF of inner maps Alexei Starovoitov
2021-06-29  1:45   ` Yonghong Song
2021-06-24  2:25 ` [PATCH v3 bpf-next 4/8] bpf: Relax verifier recursion check Alexei Starovoitov
2021-06-24  2:25 ` [PATCH v3 bpf-next 5/8] bpf: Implement verifier support for validation of async callbacks Alexei Starovoitov
2021-06-24  2:25 ` [PATCH v3 bpf-next 6/8] bpf: Teach stack depth check about " Alexei Starovoitov
2021-06-24  2:25 ` [PATCH v3 bpf-next 7/8] selftests/bpf: Add bpf_timer test Alexei Starovoitov
2021-06-24  2:25 ` [PATCH v3 bpf-next 8/8] selftests/bpf: Add a test with bpf_timer in inner map Alexei Starovoitov
2021-06-24 11:27 ` [PATCH v3 bpf-next 0/8] bpf: Introduce BPF timers Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).