All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 0/3] bpf: Introduce BPF timers.
@ 2021-06-11  4:24 Alexei Starovoitov
  2021-06-11  4:24 ` [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2021-06-11  4:24 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

v1->v2:
- Addressed great feedback from Andrii and Toke.
- Fixed race between parallel bpf_timer_*() ops.
- Fixed deadlock between timer callback and LRU eviction or bpf_map_delete/update.
- Disallowed mmap and global timers.
- Allow bpf_spin_lock and bpf_timer in an element. One of each.
- Fixed memory leaks due to map destruction and LRU eviction.
- Add support for specifying clockid in bpf_timer_init.
- Make bpf_timer helpers gpl_only.
- Fix key pointer in callback_fn when bpf_timer is inside array.
- A ton more tests.

The 1st patch implements interaction between bpf programs and bpf core.
The 2nd patch implements necessary safety checks.
The 3rd patch is the test.

Alexei Starovoitov (3):
  bpf: Introduce bpf_timer
  bpf: Add verifier checks for bpf_timer.
  selftests/bpf: Add bpf_timer test.

 include/linux/bpf.h                           |  47 ++-
 include/linux/btf.h                           |   1 +
 include/uapi/linux/bpf.h                      |  40 +++
 kernel/bpf/arraymap.c                         |  14 +
 kernel/bpf/btf.c                              |  77 ++++-
 kernel/bpf/hashtab.c                          |  56 +++-
 kernel/bpf/helpers.c                          | 227 ++++++++++++++
 kernel/bpf/local_storage.c                    |   4 +-
 kernel/bpf/map_in_map.c                       |   1 +
 kernel/bpf/syscall.c                          |  21 +-
 kernel/bpf/verifier.c                         | 133 ++++++++
 kernel/trace/bpf_trace.c                      |   2 +-
 scripts/bpf_doc.py                            |   2 +
 tools/include/uapi/linux/bpf.h                |  40 +++
 .../testing/selftests/bpf/prog_tests/timer.c  |  55 ++++
 tools/testing/selftests/bpf/progs/timer.c     | 293 ++++++++++++++++++
 16 files changed, 970 insertions(+), 43 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-11  4:24 [PATCH v2 bpf-next 0/3] bpf: Introduce BPF timers Alexei Starovoitov
@ 2021-06-11  4:24 ` Alexei Starovoitov
  2021-06-11  6:42   ` Cong Wang
                     ` (4 more replies)
  2021-06-11  4:24 ` [PATCH v2 bpf-next 2/3] bpf: Add verifier checks for bpf_timer Alexei Starovoitov
                   ` (2 subsequent siblings)
  3 siblings, 5 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2021-06-11  4:24 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
in hash/array/lru maps as regular field and helpers to operate on it:

// Initialize the timer to call 'callback_fn' static function
// First 4 bits of 'flags' specify clockid.
// Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags);

// Start the timer and set its expiration 'nsec' nanoseconds from the current time.
long bpf_timer_start(struct bpf_timer *timer, u64 nsec);

// Cancel the timer and wait for callback_fn to finish if it was running.
long bpf_timer_cancel(struct bpf_timer *timer);

Here is how BPF program might look like:
struct map_elem {
    int counter;
    struct bpf_timer timer;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1000);
    __type(key, int);
    __type(value, struct map_elem);
} hmap SEC(".maps");

static int timer_cb(void *map, int *key, struct map_elem *val);
/* val points to particular map element that contains bpf_timer. */

SEC("fentry/bpf_fentry_test1")
int BPF_PROG(test1, int a)
{
    struct map_elem *val;
    int key = 0;

    val = bpf_map_lookup_elem(&hmap, &key);
    if (val) {
        bpf_timer_init(&val->timer, timer_cb, CLOCK_REALTIME);
        bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */);
    }
}

This patch adds helper implementations that rely on hrtimers
to call bpf functions as timers expire.
The following patch adds necessary safety checks.

Only programs with CAP_BPF are allowed to use bpf_timer.

The amount of timers used by the program is constrained by
the memcg recorded at map creation time.

The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
supplied by the verifier. The prog pointer is needed to do refcnting of bpf
program to make sure that program doesn't get freed while timer is armed.

The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
and free the timer if given map element had it allocated.
"bpftool map update" command can be used to cancel timers.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h            |   2 +
 include/uapi/linux/bpf.h       |  40 ++++++
 kernel/bpf/helpers.c           | 227 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          | 109 ++++++++++++++++
 kernel/trace/bpf_trace.c       |   2 +-
 scripts/bpf_doc.py             |   2 +
 tools/include/uapi/linux/bpf.h |  40 ++++++
 7 files changed, 421 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 86dec5001ae2..3816b6bae6d3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -221,6 +221,7 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 }
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 			   bool lock_src);
+void bpf_timer_cancel_and_free(void *timer);
 int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
 
 struct bpf_offload_dev;
@@ -314,6 +315,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_FUNC,	/* pointer to a bpf program function */
 	ARG_PTR_TO_STACK_OR_NULL,	/* pointer to stack or NULL */
 	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
+	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
 	__BPF_ARG_TYPE_MAX,
 };
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2c1ba70abbf1..d25bbcdad8e6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4778,6 +4778,38 @@ union bpf_attr {
  * 		Execute close syscall for given FD.
  * 	Return
  * 		A syscall result.
+ *
+ * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
+ *	Description
+ *		Initialize the timer to call *callback_fn* static function.
+ *		First 4 bits of *flags* specify clockid. Only CLOCK_MONOTONIC,
+ *		CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
+ *		All other bits of *flags* are reserved.
+ *	Return
+ *		0 on success.
+ *		**-EBUSY** if *timer* is already initialized.
+ *		**-EINVAL** if invalid *flags* are passed.
+ *
+ * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
+ *	Description
+ *		Start the timer and set its expiration N nanoseconds from the
+ *		current time. The timer callback_fn will be invoked in soft irq
+ *		context on some cpu and will not repeat unless another
+ *		bpf_timer_start() is made. In such case the next invocation can
+ *		migrate to a different cpu.
+ *	Return
+ *		0 on success.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *
+ * long bpf_timer_cancel(struct bpf_timer *timer)
+ *	Description
+ *		Cancel the timer and wait for callback_fn to finish if it was running.
+ *	Return
+ *		0 if the timer was not active.
+ *		1 if the timer was active.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *		**-EDEADLK** if callback_fn tried to call bpf_timer_cancel() on its own timer
+ *		which would have led to a deadlock otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4949,6 +4981,9 @@ union bpf_attr {
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
 	FN(sys_close),			\
+	FN(timer_init),			\
+	FN(timer_start),		\
+	FN(timer_cancel),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -6061,6 +6096,11 @@ struct bpf_spin_lock {
 	__u32	val;
 };
 
+struct bpf_timer {
+	__u64 :64;
+	__u64 :64;
+};
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 544773970dbc..3a693d451ca3 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -985,6 +985,227 @@ const struct bpf_func_proto bpf_snprintf_proto = {
 	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
 };
 
+struct bpf_hrtimer {
+	struct hrtimer timer;
+	struct bpf_map *map;
+	struct bpf_prog *prog;
+	void *callback_fn;
+	void *value;
+};
+
+/* the actual struct hidden inside uapi struct bpf_timer */
+struct bpf_timer_kern {
+	struct bpf_hrtimer *timer;
+	struct bpf_spin_lock lock;
+};
+
+static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
+
+static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
+{
+	struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
+	struct bpf_prog *prog = t->prog;
+	struct bpf_map *map = t->map;
+	void *key;
+	u32 idx;
+	int ret;
+
+	/* bpf_timer_cb() runs in hrtimer_run_softirq. It doesn't migrate and
+	 * cannot be preempted by another bpf_timer_cb() on the same cpu.
+	 * Remember the timer this callback is servicing to prevent
+	 * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
+	 */
+	this_cpu_write(hrtimer_running, t);
+	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
+		struct bpf_array *array = container_of(map, struct bpf_array, map);
+
+		/* compute the key */
+		idx = ((char *)t->value - array->value) / array->elem_size;
+		key = &idx;
+	} else { /* hash or lru */
+		key = t->value - round_up(map->key_size, 8);
+	}
+
+	ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
+					    (u64)(long)key,
+					    (u64)(long)t->value, 0, 0);
+	WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
+
+	/* The bpf function finished executed. Drop the prog refcnt.
+	 * It could reach zero here and trigger free of bpf_prog
+	 * and subsequent free of the maps that were holding timers.
+	 * If callback_fn called bpf_timer_start on this timer
+	 * the prog refcnt will be > 0.
+	 *
+	 * If callback_fn deleted map element the 't' could have been freed,
+	 * hence t->prog deref is done earlier.
+	 */
+	bpf_prog_put(prog);
+	this_cpu_write(hrtimer_running, NULL);
+	return HRTIMER_NORESTART;
+}
+
+BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flags,
+	   struct bpf_map *, map, struct bpf_prog *, prog)
+{
+	clockid_t clockid = flags & (MAX_CLOCKS - 1);
+	struct bpf_hrtimer *t;
+	int ret = 0;
+
+	BUILD_BUG_ON(MAX_CLOCKS != 16);
+	if (flags >= MAX_CLOCKS ||
+	    /* similar to timerfd except _ALARM variants are not supported */
+            (clockid != CLOCK_MONOTONIC &&
+             clockid != CLOCK_REALTIME &&
+             clockid != CLOCK_BOOTTIME))
+		return -EINVAL;
+	____bpf_spin_lock(&timer->lock);
+	t = timer->timer;
+	if (t) {
+		ret = -EBUSY;
+		goto out;
+	}
+	/* allocate hrtimer via map_kmalloc to use memcg accounting */
+	t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
+	if (!t) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	t->callback_fn = cb;
+	t->value = (void *)timer /* - offset of bpf_timer inside elem */;
+	t->map = map;
+	t->prog = prog;
+	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
+	t->timer.function = bpf_timer_cb;
+	timer->timer = t;
+out:
+	____bpf_spin_unlock(&timer->lock);
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_timer_init_proto = {
+	.func		= bpf_timer_init,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+	.arg2_type	= ARG_PTR_TO_FUNC,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)
+{
+	struct bpf_hrtimer *t;
+	int ret = 0;
+
+	____bpf_spin_lock(&timer->lock);
+	t = timer->timer;
+	if (!t) {
+		ret = -EINVAL;
+		goto out;
+	}
+	if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
+		/* If the timer wasn't active or callback already executing
+		 * bump the prog refcnt to keep it alive until
+		 * callback is invoked (again).
+		 */
+		bpf_prog_inc(t->prog);
+	hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
+out:
+	____bpf_spin_unlock(&timer->lock);
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_timer_start_proto = {
+	.func		= bpf_timer_start,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+	.arg2_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
+{
+	struct bpf_hrtimer *t;
+	int ret = 0;
+
+	____bpf_spin_lock(&timer->lock);
+	t = timer->timer;
+	if (!t) {
+		ret = -EINVAL;
+		goto out;
+	}
+	if (this_cpu_read(hrtimer_running) == t) {
+		/* If bpf callback_fn is trying to bpf_timer_cancel()
+		 * its own timer the hrtimer_cancel() will deadlock
+		 * since it waits for callback_fn to finish
+		 */
+		ret = -EDEADLK;
+		goto out;
+	}
+	/* Cancel the timer and wait for associated callback to finish
+	 * if it was running.
+	 */
+	if (hrtimer_cancel(&t->timer) == 1) {
+		/* If the timer was active then drop the prog refcnt,
+		 * since callback will not be invoked.
+		 */
+		bpf_prog_put(t->prog);
+		ret = 1;
+	}
+out:
+	____bpf_spin_unlock(&timer->lock);
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_timer_cancel_proto = {
+	.func		= bpf_timer_cancel,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+};
+
+/* This function is called by delete_element in htab and lru maps
+ * and by map_free for array, lru, htab maps.
+ */
+void bpf_timer_cancel_and_free(void *val)
+{
+	struct bpf_timer_kern *timer = val;
+	struct bpf_hrtimer *t;
+
+	____bpf_spin_lock(&timer->lock);
+	t = timer->timer;
+	if (!t)
+		goto out;
+	/* Cancel the timer and wait for callback to complete if it was
+	 * running. Only individual delete_element in htab or lru maps can
+	 * return 1 from hrtimer_cancel.
+	 * The whole map is destroyed when its refcnt reaches zero.
+	 * That happens after bpf prog refcnt reaches zero.
+	 * bpf prog refcnt will not reach zero until all timers are executed.
+	 * So when maps are destroyed hrtimer_cancel will surely return 0.
+	 * In such case t->prog is a pointer to freed memory.
+	 *
+	 * When htab or lru is deleting individual element check that
+	 * bpf_map_delete_elem() isn't trying to delete elem with running timer.
+	 * In such case don't call hrtimer_cancel() (since it will deadlock)
+	 * and don't call hrtimer_try_to_cancel() (since it will just return -1).
+	 * Instead free the timer and set timer->timer = NULL.
+	 * The subsequent bpf_timer_start/cancel() helpers won't be able to use it.
+	 * In preallocated maps it's safe to do timer->timer = NULL.
+	 * The memory could be reused for another element while current timer
+	 * callback can still do bpf_timer_init() on it.
+	 * In non-preallocated maps timer->timer = NULL will happen after
+	 * callback completes, since prog execution is an RCU critical section.
+	 */
+	if (this_cpu_read(hrtimer_running) != t &&
+	    hrtimer_cancel(&t->timer) == 1)
+		bpf_prog_put(t->prog);
+	kfree(t);
+	timer->timer = NULL;
+out:
+	____bpf_spin_unlock(&timer->lock);
+}
+
 const struct bpf_func_proto bpf_get_current_task_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
@@ -1051,6 +1272,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_per_cpu_ptr_proto;
 	case BPF_FUNC_this_cpu_ptr:
 		return &bpf_this_cpu_ptr_proto;
+	case BPF_FUNC_timer_init:
+		return &bpf_timer_init_proto;
+	case BPF_FUNC_timer_start:
+		return &bpf_timer_start_proto;
+	case BPF_FUNC_timer_cancel:
+		return &bpf_timer_cancel_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1de4b8c6ee42..44ec9760b562 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4656,6 +4656,35 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
 	return 0;
 }
 
+static int process_timer_func(struct bpf_verifier_env *env, int regno,
+			      struct bpf_call_arg_meta *meta)
+{
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	bool is_const = tnum_is_const(reg->var_off);
+	struct bpf_map *map = reg->map_ptr;
+	u64 val = reg->var_off.value;
+
+	if (!is_const) {
+		verbose(env,
+			"R%d doesn't have constant offset. bpf_timer has to be at the constant offset\n",
+			regno);
+		return -EINVAL;
+	}
+	if (!map->btf) {
+		verbose(env, "map '%s' has to have BTF in order to use bpf_timer\n",
+			map->name);
+		return -EINVAL;
+	}
+	if (val) {
+		/* This restriction will be removed in the next patch */
+		verbose(env, "bpf_timer field can only be first in the map value element\n");
+		return -EINVAL;
+	}
+	WARN_ON(meta->map_ptr);
+	meta->map_ptr = map;
+	return 0;
+}
+
 static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
 {
 	return type == ARG_PTR_TO_MEM ||
@@ -4788,6 +4817,7 @@ static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { PTR_TO_PER
 static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
 static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
 static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
+static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
 
 static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_MAP_KEY]		= &map_key_value_types,
@@ -4819,6 +4849,7 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_FUNC]		= &func_ptr_types,
 	[ARG_PTR_TO_STACK_OR_NULL]	= &stack_ptr_types,
 	[ARG_PTR_TO_CONST_STR]		= &const_str_ptr_types,
+	[ARG_PTR_TO_TIMER]		= &timer_types,
 };
 
 static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
@@ -5000,6 +5031,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			verbose(env, "verifier internal error\n");
 			return -EFAULT;
 		}
+	} else if (arg_type == ARG_PTR_TO_TIMER) {
+		if (process_timer_func(env, regno, meta))
+			return -EACCES;
 	} else if (arg_type == ARG_PTR_TO_FUNC) {
 		meta->subprogno = reg->subprogno;
 	} else if (arg_type_is_mem_ptr(arg_type)) {
@@ -5742,6 +5776,43 @@ static int set_map_elem_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_timer_init_callback_state(struct bpf_verifier_env *env,
+					 struct bpf_func_state *caller,
+					 struct bpf_func_state *callee,
+					 int insn_idx)
+{
+	struct bpf_insn_aux_data *insn_aux = &env->insn_aux_data[insn_idx];
+	struct bpf_map *map_ptr;
+
+	if (bpf_map_ptr_poisoned(insn_aux)) {
+		verbose(env, "bpf_timer_init abusing map_ptr\n");
+		return -EINVAL;
+	}
+
+	map_ptr = BPF_MAP_PTR(insn_aux->map_ptr_state);
+
+	/* bpf_timer_init(struct bpf_timer *timer, void *callback_fn, u64 flags);
+	 * callback_fn(struct bpf_map *map, void *key, void *value);
+	 */
+	callee->regs[BPF_REG_1].type = CONST_PTR_TO_MAP;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
+	callee->regs[BPF_REG_1].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
+	callee->regs[BPF_REG_2].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
+	callee->regs[BPF_REG_3].map_ptr = map_ptr;
+
+	/* unused */
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+	callee->in_callback_fn = true;
+	return 0;
+}
+
 static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
 {
 	struct bpf_verifier_state *state = env->cur_state;
@@ -5837,6 +5908,7 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 	    func_id != BPF_FUNC_map_pop_elem &&
 	    func_id != BPF_FUNC_map_peek_elem &&
 	    func_id != BPF_FUNC_for_each_map_elem &&
+	    func_id != BPF_FUNC_timer_init &&
 	    func_id != BPF_FUNC_redirect_map)
 		return 0;
 
@@ -6069,6 +6141,13 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			return -EINVAL;
 	}
 
+	if (func_id == BPF_FUNC_timer_init) {
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_timer_init_callback_state);
+		if (err < 0)
+			return -EINVAL;
+	}
+
 	if (func_id == BPF_FUNC_snprintf) {
 		err = check_bpf_snprintf_call(env, regs);
 		if (err < 0)
@@ -12526,6 +12605,36 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			insn      = new_prog->insnsi + i + delta;
 			continue;
 		}
+		if (insn->imm == BPF_FUNC_timer_init) {
+			aux = &env->insn_aux_data[i + delta];
+			if (bpf_map_ptr_poisoned(aux)) {
+				verbose(env, "bpf_timer_init abusing map_ptr\n");
+				return -EINVAL;
+			}
+			map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
+			{
+				struct bpf_insn ld_addrs[4] = {
+					BPF_LD_IMM64(BPF_REG_4, (long)map_ptr),
+					BPF_LD_IMM64(BPF_REG_5, (long)prog),
+				};
+
+				insn_buf[0] = ld_addrs[0];
+				insn_buf[1] = ld_addrs[1];
+				insn_buf[2] = ld_addrs[2];
+				insn_buf[3] = ld_addrs[3];
+			}
+			insn_buf[4] = *insn;
+			cnt = 5;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta    += cnt - 1;
+			env->prog = prog = new_prog;
+			insn      = new_prog->insnsi + i + delta;
+			goto patch_call_imm;
+		}
 
 		/* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
 		 * and other inlining handlers are currently limited to 64 bit
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d2d7cf6cfe83..453a46c2d732 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1065,7 +1065,7 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_snprintf:
 		return &bpf_snprintf_proto;
 	default:
-		return NULL;
+		return bpf_base_func_proto(func_id);
 	}
 }
 
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 2d94025b38e9..00ac7b79cddb 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -547,6 +547,7 @@ COMMANDS
             'struct inode',
             'struct socket',
             'struct file',
+            'struct bpf_timer',
     ]
     known_types = {
             '...',
@@ -594,6 +595,7 @@ COMMANDS
             'struct inode',
             'struct socket',
             'struct file',
+            'struct bpf_timer',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 2c1ba70abbf1..d25bbcdad8e6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4778,6 +4778,38 @@ union bpf_attr {
  * 		Execute close syscall for given FD.
  * 	Return
  * 		A syscall result.
+ *
+ * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
+ *	Description
+ *		Initialize the timer to call *callback_fn* static function.
+ *		First 4 bits of *flags* specify clockid. Only CLOCK_MONOTONIC,
+ *		CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
+ *		All other bits of *flags* are reserved.
+ *	Return
+ *		0 on success.
+ *		**-EBUSY** if *timer* is already initialized.
+ *		**-EINVAL** if invalid *flags* are passed.
+ *
+ * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
+ *	Description
+ *		Start the timer and set its expiration N nanoseconds from the
+ *		current time. The timer callback_fn will be invoked in soft irq
+ *		context on some cpu and will not repeat unless another
+ *		bpf_timer_start() is made. In such case the next invocation can
+ *		migrate to a different cpu.
+ *	Return
+ *		0 on success.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *
+ * long bpf_timer_cancel(struct bpf_timer *timer)
+ *	Description
+ *		Cancel the timer and wait for callback_fn to finish if it was running.
+ *	Return
+ *		0 if the timer was not active.
+ *		1 if the timer was active.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *		**-EDEADLK** if callback_fn tried to call bpf_timer_cancel() on its own timer
+ *		which would have led to a deadlock otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4949,6 +4981,9 @@ union bpf_attr {
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
 	FN(sys_close),			\
+	FN(timer_init),			\
+	FN(timer_start),		\
+	FN(timer_cancel),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -6061,6 +6096,11 @@ struct bpf_spin_lock {
 	__u32	val;
 };
 
+struct bpf_timer {
+	__u64 :64;
+	__u64 :64;
+};
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
-- 
2.30.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 bpf-next 2/3] bpf: Add verifier checks for bpf_timer.
  2021-06-11  4:24 [PATCH v2 bpf-next 0/3] bpf: Introduce BPF timers Alexei Starovoitov
  2021-06-11  4:24 ` [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
@ 2021-06-11  4:24 ` Alexei Starovoitov
  2021-06-11  4:24 ` [PATCH v2 bpf-next 3/3] selftests/bpf: Add bpf_timer test Alexei Starovoitov
  2021-06-11  6:47 ` [PATCH v2 bpf-next 0/3] bpf: Introduce BPF timers Cong Wang
  3 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2021-06-11  4:24 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Add appropriate safety checks for bpf_timer:
- restrict to array, hash, lru. per-cpu maps cannot be supported.
- kfree bpf_timer during map_delete_elem and map_free.
- verifier btf checks.
- safe interaction with lookup/update/delete operations and iterator.
- relax the first field only requirement of the previous patch.
- search for bpf_timer in datasec in case it was placed in global data.
- check prog_rdonly, frozen flags.
- mmap is not allowed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h        | 45 ++++++++++++++++------
 include/linux/btf.h        |  1 +
 kernel/bpf/arraymap.c      | 14 +++++++
 kernel/bpf/btf.c           | 77 +++++++++++++++++++++++++++++++-------
 kernel/bpf/hashtab.c       | 56 +++++++++++++++++++++------
 kernel/bpf/helpers.c       |  2 +-
 kernel/bpf/local_storage.c |  4 +-
 kernel/bpf/map_in_map.c    |  1 +
 kernel/bpf/syscall.c       | 21 +++++++++--
 kernel/bpf/verifier.c      | 30 +++++++++++++--
 10 files changed, 205 insertions(+), 46 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3816b6bae6d3..7c403235c7e8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -168,6 +168,7 @@ struct bpf_map {
 	u32 max_entries;
 	u32 map_flags;
 	int spin_lock_off; /* >=0 valid offset, <0 error */
+	int timer_off; /* >=0 valid offset, <0 error */
 	u32 id;
 	int numa_node;
 	u32 btf_key_type_id;
@@ -197,24 +198,46 @@ static inline bool map_value_has_spin_lock(const struct bpf_map *map)
 	return map->spin_lock_off >= 0;
 }
 
-static inline void check_and_init_map_lock(struct bpf_map *map, void *dst)
+static inline bool map_value_has_timer(const struct bpf_map *map)
 {
-	if (likely(!map_value_has_spin_lock(map)))
-		return;
-	*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
-		(struct bpf_spin_lock){};
+	return map->timer_off >= 0;
 }
 
-/* copy everything but bpf_spin_lock */
+static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
+{
+	if (unlikely(map_value_has_spin_lock(map)))
+		*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
+			(struct bpf_spin_lock){};
+	if (unlikely(map_value_has_timer(map)))
+		*(struct bpf_timer *)(dst + map->timer_off) =
+			(struct bpf_timer){};
+}
+
+/* copy everything but bpf_spin_lock and bpf_timer. There could be one of each. */
 static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 {
+	u32 s_off = 0, s_sz = 0, t_off = 0, t_sz = 0;
+
 	if (unlikely(map_value_has_spin_lock(map))) {
-		u32 off = map->spin_lock_off;
+		s_off = map->spin_lock_off;
+		s_sz = sizeof(struct bpf_spin_lock);
+	} else if (unlikely(map_value_has_timer(map))) {
+		t_off = map->timer_off;
+		t_sz = sizeof(struct bpf_timer);
+	}
 
-		memcpy(dst, src, off);
-		memcpy(dst + off + sizeof(struct bpf_spin_lock),
-		       src + off + sizeof(struct bpf_spin_lock),
-		       map->value_size - off - sizeof(struct bpf_spin_lock));
+	if (unlikely(s_sz || t_sz)) {
+		if (s_off < t_off || !s_sz) {
+			swap(s_off, t_off);
+			swap(s_sz, t_sz);
+		}
+		memcpy(dst, src, t_off);
+		memcpy(dst + t_off + t_sz,
+		       src + t_off + t_sz,
+		       s_off - t_off - t_sz);
+		memcpy(dst + s_off + s_sz,
+		       src + s_off + s_sz,
+		       map->value_size - s_off - s_sz);
 	} else {
 		memcpy(dst, src, map->value_size);
 	}
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 94a0c976c90f..214fde93214b 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -99,6 +99,7 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   const struct btf_member *m,
 			   u32 expected_offset, u32 expected_size);
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
+int btf_find_timer(const struct btf *btf, const struct btf_type *t);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 3c4105603f9d..5c84ab7f8872 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -287,6 +287,12 @@ static int array_map_get_next_key(struct bpf_map *map, void *key, void *next_key
 	return 0;
 }
 
+static void check_and_free_timer_in_array(struct bpf_array *arr, void *val)
+{
+	if (unlikely(map_value_has_timer(&arr->map)))
+		bpf_timer_cancel_and_free(val + arr->map.timer_off);
+}
+
 /* Called from syscall or from eBPF program */
 static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 				 u64 map_flags)
@@ -321,6 +327,7 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 			copy_map_value_locked(map, val, value, false);
 		else
 			copy_map_value(map, val, value);
+		check_and_free_timer_in_array(array, val);
 	}
 	return 0;
 }
@@ -378,10 +385,17 @@ static void *array_map_vmalloc_addr(struct bpf_array *array)
 static void array_map_free(struct bpf_map *map)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	int i;
 
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
 		bpf_array_free_percpu(array);
 
+	if (unlikely(map_value_has_timer(map)))
+		for (i = 0; i < array->map.max_entries; i++)
+			bpf_timer_cancel_and_free(array->value +
+						  array->elem_size * i +
+						  map->timer_off);
+
 	if (array->map.map_flags & BPF_F_MMAPABLE)
 		bpf_map_area_free(array_map_vmalloc_addr(array));
 	else
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index a6e39c5ea0bf..28a8014b8379 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3046,43 +3046,92 @@ static void btf_struct_log(struct btf_verifier_env *env,
 	btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
 }
 
-/* find 'struct bpf_spin_lock' in map value.
- * return >= 0 offset if found
- * and < 0 in case of error
- */
-int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
+static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
+				 const char *name, int sz, int align)
 {
 	const struct btf_member *member;
 	u32 i, off = -ENOENT;
 
-	if (!__btf_type_is_struct(t))
-		return -EINVAL;
-
 	for_each_member(i, t, member) {
 		const struct btf_type *member_type = btf_type_by_id(btf,
 								    member->type);
 		if (!__btf_type_is_struct(member_type))
 			continue;
-		if (member_type->size != sizeof(struct bpf_spin_lock))
+		if (member_type->size != sz)
 			continue;
-		if (strcmp(__btf_name_by_offset(btf, member_type->name_off),
-			   "bpf_spin_lock"))
+		if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
 			continue;
 		if (off != -ENOENT)
-			/* only one 'struct bpf_spin_lock' is allowed */
+			/* only one such field is allowed */
 			return -E2BIG;
 		off = btf_member_bit_offset(t, member);
 		if (off % 8)
 			/* valid C code cannot generate such BTF */
 			return -EINVAL;
 		off /= 8;
-		if (off % __alignof__(struct bpf_spin_lock))
-			/* valid struct bpf_spin_lock will be 4 byte aligned */
+		if (off % align)
+			return -EINVAL;
+	}
+	return off;
+}
+
+static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
+				const char *name, int sz, int align)
+{
+	const struct btf_var_secinfo *vsi;
+	u32 i, off = -ENOENT;
+
+	for_each_vsi(i, t, vsi) {
+		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
+		const struct btf_type *var_type = btf_type_by_id(btf, var->type);
+
+		if (!__btf_type_is_struct(var_type))
+			continue;
+		if (var_type->size != sz)
+			continue;
+		if (vsi->size != sz)
+			continue;
+		if (strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
+			continue;
+		if (off != -ENOENT)
+			/* only one such field is allowed */
+			return -E2BIG;
+		off = vsi->offset;
+		if (off % align)
 			return -EINVAL;
 	}
 	return off;
 }
 
+static int btf_find_field(const struct btf *btf, const struct btf_type *t,
+			  const char *name, int sz, int align)
+{
+
+	if (__btf_type_is_struct(t))
+		return btf_find_struct_field(btf, t, name, sz, align);
+	else if (btf_type_is_datasec(t))
+		return btf_find_datasec_var(btf, t, name, sz, align);
+	return -EINVAL;
+}
+
+/* find 'struct bpf_spin_lock' in map value.
+ * return >= 0 offset if found
+ * and < 0 in case of error
+ */
+int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
+{
+	return btf_find_field(btf, t, "bpf_spin_lock",
+			      sizeof(struct bpf_spin_lock),
+			      __alignof__(struct bpf_spin_lock));
+}
+
+int btf_find_timer(const struct btf *btf, const struct btf_type *t)
+{
+	return btf_find_field(btf, t, "bpf_timer",
+			      sizeof(struct bpf_timer),
+			      __alignof__(struct bpf_timer));
+}
+
 static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
 			      u32 type_id, void *data, u8 bits_offset,
 			      struct btf_show *show)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 6f6681b07364..c885492d0a76 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -244,6 +244,17 @@ static void htab_free_elems(struct bpf_htab *htab)
 		cond_resched();
 	}
 free_elems:
+	if (unlikely(map_value_has_timer(&htab->map)))
+		for (i = 0; i < htab->map.max_entries; i++) {
+			struct htab_elem *elem;
+
+			elem = get_htab_elem(htab, i);
+			bpf_timer_cancel_and_free(elem->key +
+						  round_up(htab->map.key_size, 8) +
+						  htab->map.timer_off);
+			cond_resched();
+		}
+
 	bpf_map_area_free(htab->elems);
 }
 
@@ -265,8 +276,11 @@ static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key,
 	struct htab_elem *l;
 
 	if (node) {
+		u32 key_size = htab->map.key_size;
 		l = container_of(node, struct htab_elem, lru_node);
-		memcpy(l->key, key, htab->map.key_size);
+		memcpy(l->key, key, key_size);
+		check_and_init_map_value(&htab->map,
+					 l->key + round_up(key_size, 8));
 		return l;
 	}
 
@@ -694,6 +708,14 @@ static int htab_lru_map_gen_lookup(struct bpf_map *map,
 	return insn - insn_buf;
 }
 
+static void check_and_free_timer(struct bpf_htab *htab, struct htab_elem *elem)
+{
+	if (unlikely(map_value_has_timer(&htab->map)))
+		bpf_timer_cancel_and_free(elem->key +
+					  round_up(htab->map.key_size, 8) +
+					  htab->map.timer_off);
+}
+
 /* It is called from the bpf_lru_list when the LRU needs to delete
  * older elements from the htab.
  */
@@ -718,6 +740,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
 		if (l == tgt_l) {
 			hlist_nulls_del_rcu(&l->hash_node);
+			check_and_free_timer(htab, l);
 			break;
 		}
 
@@ -789,6 +812,7 @@ static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 {
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
+	check_and_free_timer(htab, l);
 	kfree(l);
 }
 
@@ -816,6 +840,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 	htab_put_fd_value(htab, l);
 
 	if (htab_is_prealloc(htab)) {
+		check_and_free_timer(htab, l);
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		atomic_dec(&htab->count);
@@ -919,8 +944,8 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			l_new = ERR_PTR(-ENOMEM);
 			goto dec_count;
 		}
-		check_and_init_map_lock(&htab->map,
-					l_new->key + round_up(key_size, 8));
+		check_and_init_map_value(&htab->map,
+					 l_new->key + round_up(key_size, 8));
 	}
 
 	memcpy(l_new->key, key, key_size);
@@ -1060,6 +1085,8 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		hlist_nulls_del_rcu(&l_old->hash_node);
 		if (!htab_is_prealloc(htab))
 			free_htab_elem(htab, l_old);
+		else
+			check_and_free_timer(htab, l_old);
 	}
 	ret = 0;
 err:
@@ -1067,6 +1094,12 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	return ret;
 }
 
+static void htab_lru_push_free(struct bpf_htab *htab, struct htab_elem *elem)
+{
+	check_and_free_timer(htab, elem);
+	bpf_lru_push_free(&htab->lru, &elem->lru_node);
+}
+
 static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 				    u64 map_flags)
 {
@@ -1099,7 +1132,8 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 	l_new = prealloc_lru_pop(htab, key, hash);
 	if (!l_new)
 		return -ENOMEM;
-	memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size);
+	copy_map_value(&htab->map,
+		       l_new->key + round_up(map->key_size, 8), value);
 
 	ret = htab_lock_bucket(htab, b, hash, &flags);
 	if (ret)
@@ -1125,9 +1159,9 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 	htab_unlock_bucket(htab, b, hash, flags);
 
 	if (ret)
-		bpf_lru_push_free(&htab->lru, &l_new->lru_node);
+		htab_lru_push_free(htab, l_new);
 	else if (l_old)
-		bpf_lru_push_free(&htab->lru, &l_old->lru_node);
+		htab_lru_push_free(htab, l_old);
 
 	return ret;
 }
@@ -1332,7 +1366,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 
 	htab_unlock_bucket(htab, b, hash, flags);
 	if (l)
-		bpf_lru_push_free(&htab->lru, &l->lru_node);
+		htab_lru_push_free(htab, l);
 	return ret;
 }
 
@@ -1449,7 +1483,7 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
 			else
 				copy_map_value(map, value, l->key +
 					       roundup_key_size);
-			check_and_init_map_lock(map, value);
+			check_and_init_map_value(map, value);
 		}
 
 		hlist_nulls_del_rcu(&l->hash_node);
@@ -1460,7 +1494,7 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
 	htab_unlock_bucket(htab, b, hash, bflags);
 
 	if (is_lru_map && l)
-		bpf_lru_push_free(&htab->lru, &l->lru_node);
+		htab_lru_push_free(htab, l);
 
 	return ret;
 }
@@ -1638,7 +1672,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 						      true);
 			else
 				copy_map_value(map, dst_val, value);
-			check_and_init_map_lock(map, dst_val);
+			check_and_init_map_value(map, dst_val);
 		}
 		if (do_delete) {
 			hlist_nulls_del_rcu(&l->hash_node);
@@ -1665,7 +1699,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 	while (node_to_free) {
 		l = node_to_free;
 		node_to_free = node_to_free->batch_flink;
-		bpf_lru_push_free(&htab->lru, &l->lru_node);
+		htab_lru_push_free(htab, l);
 	}
 
 next_batch:
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 3a693d451ca3..b8df592c33cc 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1072,7 +1072,7 @@ BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flag
 		goto out;
 	}
 	t->callback_fn = cb;
-	t->value = (void *)timer /* - offset of bpf_timer inside elem */;
+	t->value = (void *)timer - map->timer_off;
 	t->map = map;
 	t->prog = prog;
 	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index bd11db9774c3..95d70a08325d 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -173,7 +173,7 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *key,
 		return -ENOMEM;
 
 	memcpy(&new->data[0], value, map->value_size);
-	check_and_init_map_lock(map, new->data);
+	check_and_init_map_value(map, new->data);
 
 	new = xchg(&storage->buf, new);
 	kfree_rcu(new, rcu);
@@ -509,7 +509,7 @@ struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog,
 						    map->numa_node);
 		if (!storage->buf)
 			goto enomem;
-		check_and_init_map_lock(map, storage->buf->data);
+		check_and_init_map_value(map, storage->buf->data);
 	} else {
 		storage->percpu_buf = bpf_map_alloc_percpu(map, size, 8, gfp);
 		if (!storage->percpu_buf)
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 39ab0b68cade..2f961b941159 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -50,6 +50,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 	inner_map_meta->map_flags = inner_map->map_flags;
 	inner_map_meta->max_entries = inner_map->max_entries;
 	inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
+	inner_map_meta->timer_off = inner_map->timer_off;
 
 	/* Misc members not needed in bpf_map_meta_equal() check. */
 	inner_map_meta->ops = inner_map->ops;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 50457019da27..0e53f9e10e1a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -259,8 +259,8 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 				copy_map_value_locked(map, value, ptr, true);
 			else
 				copy_map_value(map, value, ptr);
-			/* mask lock, since value wasn't zero inited */
-			check_and_init_map_lock(map, value);
+			/* mask lock and timer, since value wasn't zero inited */
+			check_and_init_map_value(map, value);
 		}
 		rcu_read_unlock();
 	}
@@ -622,7 +622,8 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
 	struct bpf_map *map = filp->private_data;
 	int err;
 
-	if (!map->ops->map_mmap || map_value_has_spin_lock(map))
+	if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
+	    map_value_has_timer(map))
 		return -ENOTSUPP;
 
 	if (!(vma->vm_flags & VM_SHARED))
@@ -792,6 +793,16 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		}
 	}
 
+	map->timer_off = btf_find_timer(btf, value_type);
+	if (map_value_has_timer(map)) {
+		if (map->map_flags & BPF_F_RDONLY_PROG)
+			return -EACCES;
+		if (map->map_type != BPF_MAP_TYPE_HASH &&
+		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
+		    map->map_type != BPF_MAP_TYPE_ARRAY)
+			return -ENOTSUPP;
+	}
+
 	if (map->ops->map_check_btf)
 		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
 
@@ -843,6 +854,7 @@ static int map_create(union bpf_attr *attr)
 	mutex_init(&map->freeze_mutex);
 
 	map->spin_lock_off = -EINVAL;
+	map->timer_off = -EINVAL;
 	if (attr->btf_key_type_id || attr->btf_value_type_id ||
 	    /* Even the map's value is a kernel's struct,
 	     * the bpf_prog.o must have BTF to begin with
@@ -1590,7 +1602,8 @@ static int map_freeze(const union bpf_attr *attr)
 	if (IS_ERR(map))
 		return PTR_ERR(map);
 
-	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
+	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
+	    map_value_has_timer(map)) {
 		fdput(f);
 		return -ENOTSUPP;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 44ec9760b562..0ddafa91d0d3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3241,6 +3241,15 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			return -EACCES;
 		}
 	}
+	if (map_value_has_timer(map)) {
+		u32 t = map->timer_off;
+
+		if (reg->smin_value + off < t + sizeof(struct bpf_timer) &&
+		     t < reg->umax_value + off + size) {
+			verbose(env, "bpf_timer cannot be accessed directly by load/store\n");
+			return -EACCES;
+		}
+	}
 	return err;
 }
 
@@ -4675,9 +4684,24 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
 			map->name);
 		return -EINVAL;
 	}
-	if (val) {
-		/* This restriction will be removed in the next patch */
-		verbose(env, "bpf_timer field can only be first in the map value element\n");
+	if (!map_value_has_timer(map)) {
+		if (map->timer_off == -E2BIG)
+			verbose(env,
+				"map '%s' has more than one 'struct bpf_timer'\n",
+				map->name);
+		else if (map->timer_off == -ENOENT)
+			verbose(env,
+				"map '%s' doesn't have 'struct bpf_timer'\n",
+				map->name);
+		else
+			verbose(env,
+				"map '%s' is not a struct type or bpf_timer is mangled\n",
+				map->name);
+		return -EINVAL;
+	}
+	if (map->timer_off != val + reg->off) {
+		verbose(env, "off %lld doesn't point to 'struct bpf_timer' that is at %d\n",
+			val + reg->off, map->timer_off);
 		return -EINVAL;
 	}
 	WARN_ON(meta->map_ptr);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 bpf-next 3/3] selftests/bpf: Add bpf_timer test.
  2021-06-11  4:24 [PATCH v2 bpf-next 0/3] bpf: Introduce BPF timers Alexei Starovoitov
  2021-06-11  4:24 ` [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
  2021-06-11  4:24 ` [PATCH v2 bpf-next 2/3] bpf: Add verifier checks for bpf_timer Alexei Starovoitov
@ 2021-06-11  4:24 ` Alexei Starovoitov
  2021-06-11  6:47 ` [PATCH v2 bpf-next 0/3] bpf: Introduce BPF timers Cong Wang
  3 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2021-06-11  4:24 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Add bpf_timer test that creates timers in preallocated and
non-preallocated hash, in array and in lru maps.
Let array timer expire once and then re-arm it for 35 seconds.
Arm lru timer into the same callback.
Then arm and re-arm hash timers 10 times each.
At the last invocation of prealloc hash timer cancel the array timer.
Force timer free via LRU eviction and direct bpf_map_delete_elem.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 .../testing/selftests/bpf/prog_tests/timer.c  |  55 ++++
 tools/testing/selftests/bpf/progs/timer.c     | 293 ++++++++++++++++++
 2 files changed, 348 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer.c

diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
new file mode 100644
index 000000000000..25f40e1b9967
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/timer.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <test_progs.h>
+#include "timer.skel.h"
+
+static int timer(struct timer *timer_skel)
+{
+	int err, prog_fd;
+	__u32 duration = 0, retval;
+
+	err = timer__attach(timer_skel);
+	if (!ASSERT_OK(err, "timer_attach"))
+		return err;
+
+	ASSERT_EQ(timer_skel->data->callback_check, 52, "callback_check1");
+	ASSERT_EQ(timer_skel->data->callback2_check, 52, "callback2_check1");
+
+	prog_fd = bpf_program__fd(timer_skel->progs.test1);
+	err = bpf_prog_test_run(prog_fd, 1, NULL, 0,
+				NULL, NULL, &retval, &duration);
+	ASSERT_OK(err, "test_run");
+	ASSERT_EQ(retval, 0, "test_run");
+	timer__detach(timer_skel);
+
+	usleep(50); /* 10 usecs should be enough, but give it extra */
+	/* check that timer_cb1() was executed 10+10 times */
+	ASSERT_EQ(timer_skel->data->callback_check, 42, "callback_check2");
+	ASSERT_EQ(timer_skel->data->callback2_check, 42, "callback2_check2");
+
+	/* check that timer_cb2() was executed twice */
+	ASSERT_EQ(timer_skel->bss->bss_data, 10, "bss_data");
+
+	/* check that there were no errors in timer execution */
+	ASSERT_EQ(timer_skel->bss->err, 0, "err");
+
+	/* check that code paths completed */
+	ASSERT_EQ(timer_skel->bss->ok, 1 | 2 | 4, "ok");
+
+	return 0;
+}
+
+void test_timer(void)
+{
+	struct timer *timer_skel = NULL;
+	int err;
+
+	timer_skel = timer__open_and_load();
+	if (!ASSERT_OK_PTR(timer_skel, "timer_skel_load"))
+		goto cleanup;
+
+	err = timer(timer_skel);
+	ASSERT_OK(err, "timer");
+cleanup:
+	timer__destroy(timer_skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c
new file mode 100644
index 000000000000..665e13b2fa64
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/timer.c
@@ -0,0 +1,293 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <linux/bpf.h>
+#include <time.h>
+#include <errno.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_tcp_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+struct hmap_elem {
+	int counter;
+	struct bpf_timer timer;
+	struct bpf_spin_lock lock; /* unused */
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1000);
+	__type(key, int);
+	__type(value, struct hmap_elem);
+} hmap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__uint(max_entries, 1000);
+	__type(key, int);
+	__type(value, struct hmap_elem);
+} hmap_malloc SEC(".maps");
+
+struct elem {
+	struct bpf_timer t;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 2);
+	__type(key, int);
+	__type(value, struct elem);
+} array SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_LRU_HASH);
+	__uint(max_entries, 4);
+	__type(key, int);
+	__type(value, struct elem);
+} lru SEC(".maps");
+
+__u64 bss_data;
+__u64 err;
+__u64 ok;
+__u64 callback_check = 52;
+__u64 callback2_check = 52;
+
+#define ARRAY 1
+#define HTAB 2
+#define HTAB_MALLOC 3
+#define LRU 4
+
+/* callback for array and lru timers */
+static int timer_cb1(void *map, int *key, struct bpf_timer *timer)
+{
+	/* increment bss variable twice.
+	 * Once via array timer callback and once via lru timer callback
+	 */
+	bss_data += 5;
+
+	/* *key == 0 - the callback was called for array timer.
+	 * *key == 3 - the callback was called from lru timer.
+	 */
+	if (*key == ARRAY) {
+		struct bpf_timer *lru_timer;
+		int lru_key = LRU;
+
+		/* rearm array timer to be called again in ~35 seconds */
+		if (bpf_timer_start(timer, 1ull << 35) != 0)
+			err |= 1;
+
+		lru_timer = bpf_map_lookup_elem(&lru, &lru_key);
+		if (!lru_timer)
+			return 0;
+		if (bpf_timer_start(lru_timer, 0) != 0)
+			err |= 2;
+	} else if (*key == LRU) {
+		int lru_key, i;
+
+		for (i = LRU + 1;
+		     i <= 100  /* for current LRU eviction algorithm this number
+				* should be larger than ~ lru->max_entries * 2
+				*/;
+		     i++) {
+			struct elem init = {};
+
+			/* lru_key cannot be used as loop induction variable
+			 * otherwise the loop will be unbounded.
+			 */
+			lru_key = i;
+
+			/* add more elements into lru map to push out current
+			 * element and force deletion of this timer
+			 */
+			bpf_map_update_elem(map, &lru_key, &init, 0);
+			/* look it up to bump it into active list */
+			bpf_map_lookup_elem(map, &lru_key);
+
+			/* keep adding until *key changes underneath,
+			 * which means that key/timer memory was reused
+			 */
+			if (*key != LRU)
+				break;
+		}
+
+		/* check that the timer was removed */
+		if (bpf_timer_cancel(timer) != -EINVAL)
+			err |= 4;
+		ok |= 1;
+	}
+	return 0;
+}
+
+SEC("fentry/bpf_fentry_test1")
+int BPF_PROG(test1, int a)
+{
+	struct bpf_timer *arr_timer, *lru_timer;
+	struct elem init = {};
+	int lru_key = LRU;
+	int array_key = ARRAY;
+
+	arr_timer = bpf_map_lookup_elem(&array, &array_key);
+	if (!arr_timer)
+		return 0;
+	bpf_timer_init(arr_timer, timer_cb1, CLOCK_MONOTONIC);
+
+	bpf_map_update_elem(&lru, &lru_key, &init, 0);
+	lru_timer = bpf_map_lookup_elem(&lru, &lru_key);
+	if (!lru_timer)
+		return 0;
+	bpf_timer_init(lru_timer, timer_cb1, CLOCK_MONOTONIC);
+
+	bpf_timer_start(arr_timer, 0 /* call timer_cb1 asap */);
+
+	/* init more timers to check that array destruction
+	 * doesn't leak timer memory.
+	 */
+	array_key = 0;
+	arr_timer = bpf_map_lookup_elem(&array, &array_key);
+	if (!arr_timer)
+		return 0;
+	bpf_timer_init(arr_timer, timer_cb1, CLOCK_MONOTONIC);
+	return 0;
+}
+
+/* callback for prealloc and non-prealloca hashtab timers */
+static int timer_cb2(void *map, int *key, struct hmap_elem *val)
+{
+	if (*key == HTAB)
+		callback_check--;
+	else
+		callback2_check--;
+	if (val->counter > 0 && --val->counter) {
+		/* re-arm the timer again to execute after 1 usec */
+		bpf_timer_start(&val->timer, 1000);
+	} else if (*key == HTAB) {
+		struct bpf_timer *arr_timer;
+		int array_key = ARRAY;
+
+		/* cancel arr_timer otherwise bpf_fentry_test1 prog
+		 * will stay alive forever.
+		 */
+		arr_timer = bpf_map_lookup_elem(&array, &array_key);
+		if (!arr_timer)
+			return 0;
+		if (bpf_timer_cancel(arr_timer) != 1)
+			/* bpf_timer_cancel should return 1 to indicate
+			 * that arr_timer was active at this time
+			 */
+			err |= 8;
+
+		/* try to cancel ourself. It shouldn't deadlock. */
+		if (bpf_timer_cancel(&val->timer) != -EDEADLK)
+			err |= 16;
+
+		/* delete this key and this timer anyway.
+		 * It shouldn't deadlock either.
+		 */
+		bpf_map_delete_elem(map, key);
+
+		/* in preallocated hashmap both 'key' and 'val' could have been
+		 * reused to store another map element (like in LRU above),
+		 * but in controlled test environment the below test works.
+		 * It's not a use-after-free. The memory is owned by the map.
+		 */
+		if (bpf_timer_start(&val->timer, 1000) != -EINVAL)
+			err |= 32;
+		ok |= 2;
+	} else {
+		if (*key != HTAB_MALLOC)
+			err |= 64;
+
+		/* try to cancel ourself. It shouldn't deadlock. */
+		if (bpf_timer_cancel(&val->timer) != -EDEADLK)
+			err |= 128;
+
+		/* delete this key and this timer anyway.
+		 * It shouldn't deadlock either.
+		 */
+		bpf_map_delete_elem(map, key);
+
+		/* in non-preallocated hashmap both 'key' and 'val' are RCU
+		 * protected and still valid though this element was deleted
+		 * from the map. Arm this timer for ~35 seconds. When callback
+		 * finishes the call_rcu will invoke:
+		 * htab_elem_free_rcu
+		 *   check_and_free_timer
+		 *     bpf_timer_cancel_and_free
+		 * to cancel this 35 second sleep and delete the timer for real.
+		 */
+		if (bpf_timer_start(&val->timer, 1ull << 35) != 0)
+			err |= 256;
+		ok |= 4;
+	}
+	return 0;
+}
+
+int bpf_timer_test(void)
+{
+	struct hmap_elem *val;
+	int key = HTAB, key_malloc = HTAB_MALLOC;
+
+	val = bpf_map_lookup_elem(&hmap, &key);
+	if (val) {
+		if (bpf_timer_init(&val->timer, timer_cb2, CLOCK_BOOTTIME) != 0)
+			err |= 512;
+		bpf_timer_start(&val->timer, 1000);
+	}
+	val = bpf_map_lookup_elem(&hmap_malloc, &key_malloc);
+	if (val) {
+		if (bpf_timer_init(&val->timer, timer_cb2, CLOCK_BOOTTIME) != 0)
+			err |= 1024;
+		bpf_timer_start(&val->timer, 1000);
+	}
+	return 0;
+}
+
+SEC("fentry/bpf_fentry_test2")
+int BPF_PROG(test2, int a, int b)
+{
+	struct hmap_elem init = {}, *val;
+	int key = HTAB, key_malloc = HTAB_MALLOC;
+
+	init.counter = 10; /* number of times to trigger timer_cb1 */
+	bpf_map_update_elem(&hmap, &key, &init, 0);
+	val = bpf_map_lookup_elem(&hmap, &key);
+	if (val)
+		bpf_timer_init(&val->timer, timer_cb2, CLOCK_BOOTTIME);
+	/* update the same key to free the timer */
+	bpf_map_update_elem(&hmap, &key, &init, 0);
+
+	bpf_map_update_elem(&hmap_malloc, &key_malloc, &init, 0);
+	val = bpf_map_lookup_elem(&hmap_malloc, &key_malloc);
+	if (val)
+		bpf_timer_init(&val->timer, timer_cb2, CLOCK_BOOTTIME);
+	/* update the same key to free the timer */
+	bpf_map_update_elem(&hmap_malloc, &key_malloc, &init, 0);
+
+	/* init more timers to check that htab operations
+	 * don't leak timer memory.
+	 */
+	key = 0;
+	bpf_map_update_elem(&hmap, &key, &init, 0);
+	val = bpf_map_lookup_elem(&hmap, &key);
+	if (val)
+		bpf_timer_init(&val->timer, timer_cb2, CLOCK_BOOTTIME);
+	bpf_map_delete_elem(&hmap, &key);
+	bpf_map_update_elem(&hmap, &key, &init, 0);
+	val = bpf_map_lookup_elem(&hmap, &key);
+	if (val)
+		bpf_timer_init(&val->timer, timer_cb2, CLOCK_BOOTTIME);
+
+	/* and with non-prealloc htab */
+	key_malloc = 0;
+	bpf_map_update_elem(&hmap_malloc, &key_malloc, &init, 0);
+	val = bpf_map_lookup_elem(&hmap_malloc, &key_malloc);
+	if (val)
+		bpf_timer_init(&val->timer, timer_cb2, CLOCK_BOOTTIME);
+	bpf_map_delete_elem(&hmap_malloc, &key_malloc);
+	bpf_map_update_elem(&hmap_malloc, &key_malloc, &init, 0);
+	val = bpf_map_lookup_elem(&hmap_malloc, &key_malloc);
+	if (val)
+		bpf_timer_init(&val->timer, timer_cb2, CLOCK_BOOTTIME);
+
+	return bpf_timer_test();
+}
-- 
2.30.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-11  4:24 ` [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
@ 2021-06-11  6:42   ` Cong Wang
  2021-06-11 18:45     ` Alexei Starovoitov
  2021-06-11  7:05   ` Cong Wang
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Cong Wang @ 2021-06-11  6:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Daniel Borkmann, Andrii Nakryiko,
	Linux Kernel Network Developers, bpf, kernel-team

On Thu, Jun 10, 2021 at 9:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> in hash/array/lru maps as regular field and helpers to operate on it:

Can be or has to be? Huge difference here.

In the other thread, you said it is global data, which implies that it does
not have to be in a map.

In your test case or your example, all timers are still in a map. So what has
changed since then? Looks nothing to me.

>
> The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> program to make sure that program doesn't get freed while timer is armed.
>

Nice trick but...

[...]

> +BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)
> +{
> +       struct bpf_hrtimer *t;
> +       int ret = 0;
> +
> +       ____bpf_spin_lock(&timer->lock);
> +       t = timer->timer;
> +       if (!t) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +       if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> +               /* If the timer wasn't active or callback already executing
> +                * bump the prog refcnt to keep it alive until
> +                * callback is invoked (again).
> +                */
> +               bpf_prog_inc(t->prog);

Hmm, finally you begin refcounting it, which you were strongly against. ;)

Three questions:

1. Can t->prog be freed between bpf_timer_init() and bpf_timer_start()?
If the timer subprog is always in the same prog which installs it, then
this is fine. But then how do multiple programs share a timer? In the
case of conntrack, either ingress or egress could install the timer,
it only depends which one gets traffic first. Do they have to copy
the same subprog for the same timer?

2. Can t->prog be freed between a timer expiration and bpf_timer_start()
again? It gets a refcnt when starting a timer and puts it when cancelling
or expired, so t->prog can be freed right after cancelling or expired. What
if another program which shares this timer wants to restart this timer?

3. Since you offer no API to read the expiration time, why not just use
BPF_TEST_RUN with a user-space timer? This is preferred by Andrii.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 0/3] bpf: Introduce BPF timers.
  2021-06-11  4:24 [PATCH v2 bpf-next 0/3] bpf: Introduce BPF timers Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2021-06-11  4:24 ` [PATCH v2 bpf-next 3/3] selftests/bpf: Add bpf_timer test Alexei Starovoitov
@ 2021-06-11  6:47 ` Cong Wang
  3 siblings, 0 replies; 22+ messages in thread
From: Cong Wang @ 2021-06-11  6:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Daniel Borkmann, Andrii Nakryiko,
	Linux Kernel Network Developers, bpf, kernel-team

On Thu, Jun 10, 2021 at 9:26 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> v1->v2:
> - Addressed great feedback from Andrii and Toke.
> - Fixed race between parallel bpf_timer_*() ops.
> - Fixed deadlock between timer callback and LRU eviction or bpf_map_delete/update.
> - Disallowed mmap and global timers.
> - Allow bpf_spin_lock and bpf_timer in an element. One of each.
> - Fixed memory leaks due to map destruction and LRU eviction.
> - Add support for specifying clockid in bpf_timer_init.
> - Make bpf_timer helpers gpl_only.
> - Fix key pointer in callback_fn when bpf_timer is inside array.
> - A ton more tests.
>
> The 1st patch implements interaction between bpf programs and bpf core.
> The 2nd patch implements necessary safety checks.
> The 3rd patch is the test.

What is your use case to justify your own code? Asking because
you deny mine, so clearly my use case is not yours.

And more importantly, why not just use BPF_TEST_RUN with
a user-space timer? Asking because you offer no API to read or
modify timer expiration, so literally the same with BPF_TEST_RUN
approach.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-11  4:24 ` [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
  2021-06-11  6:42   ` Cong Wang
@ 2021-06-11  7:05   ` Cong Wang
  2021-06-11 22:12   ` Yonghong Song
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Cong Wang @ 2021-06-11  7:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Daniel Borkmann, Andrii Nakryiko,
	Linux Kernel Network Developers, bpf, kernel-team

On Thu, Jun 10, 2021 at 9:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> // Initialize the timer to call 'callback_fn' static function
> // First 4 bits of 'flags' specify clockid.
> // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags);

Another unpopular point of view:

This init() is not suitable for bpf programs, because unlike kernel modules,
there is no init or exit functions for a bpf program. And timer init
is typically
called during module init.

(bpf spinlock is different, because it can be simply initialized as UNLOCKED.)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-11  6:42   ` Cong Wang
@ 2021-06-11 18:45     ` Alexei Starovoitov
  2021-06-15  6:10       ` Cong Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2021-06-11 18:45 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Daniel Borkmann, Andrii Nakryiko,
	Linux Kernel Network Developers, bpf, kernel-team

On Thu, Jun 10, 2021 at 11:42:24PM -0700, Cong Wang wrote:
> On Thu, Jun 10, 2021 at 9:27 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>

Please stick to one email thread in the future, ok?

I'll consolidate them here:

> What is your use case to justify your own code? Asking because
> you deny mine, so clearly my use case is not yours.

I mentioned several use cases in the prior threads.
tldr: any periodic event in tracing, networking, security.
Garbage collection falls into this category as well, but please internalize
that implementing conntrack as it is today in the kernel is an explicit non-goal.

> And more importantly, why not just use BPF_TEST_RUN with
> a user-space timer? Asking because you offer no API to read or
> modify timer expiration, so literally the same with BPF_TEST_RUN
> approach.

a wrapper on top of hrtimer_get_remaining() like bpf_timer_get_remaining()
is trivial to add, but what is the use case?

> >
> > Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> > in hash/array/lru maps as regular field and helpers to operate on it:
> 
> Can be or has to be? Huge difference here.

map elements don't have to use timers.

> In the other thread, you said it is global data, which implies that it does
> not have to be in a map.

global data is a map. That was explained in the prior thread as well.

> 
> In your test case or your example, all timers are still in a map. So what has
> changed since then? Looks nothing to me.

look again?

> Hmm, finally you begin refcounting it, which you were strongly against. ;)

That was already answered in the prior thread.
tldr: there were two options. This is one of them. Another can be added
in the future as well.

> Three questions:
> 
> 1. Can t->prog be freed between bpf_timer_init() and bpf_timer_start()?

yes.

> If the timer subprog is always in the same prog which installs it, then

installs it? I'm not following the quesiton.

> this is fine. But then how do multiple programs share a timer? 

there is only one callback function.

> In the
> case of conntrack, either ingress or egress could install the timer,
> it only depends which one gets traffic first. Do they have to copy
> the same subprog for the same timer?

conntrack is an explicit non-goal.

> 
> 2. Can t->prog be freed between a timer expiration and bpf_timer_start()
> again? 

If it's already armed with the first bpf_timer_start() it won't be freed.

> It gets a refcnt when starting a timer and puts it when cancelling
> or expired, so t->prog can be freed right after cancelling or expired. What
> if another program which shares this timer wants to restart this timer?

There is only one callback_fn per timer. Another program can share
the struct bpf_timer and the map. It might have subprog callback_fn code
that looks exactly the same as callback_fn in the first prog.
For example when libbpf loads progs/timer.c (after it was compiled into .o)
it might share a subprog in the future (when kernel has support for
dynamic linking). From bpf user pov it's a single .c file.
The split into programs and subprograms is an implemenation detail
that C programmer doesn't need to worry about.

> 3. Since you offer no API to read the expiration time, why not just use
> BPF_TEST_RUN with a user-space timer? This is preferred by Andrii.

Andrii point was that there should be no syscall cmds that replicate
bpf_timer_init/start/cancel helpers. I agree with this.


> Thanks.
>
> Another unpopular point of view:
>
> This init() is not suitable for bpf programs, because unlike kernel modules,
> there is no init or exit functions for a bpf program. And timer init
> is typically
> called during module init.

Already answerd this in the prior thread. There will be __init and __fini like
subprograms in bpf progs.

Please apply the patches to your local tree and do few experiments based
on selftests/bpf/progs/timer.c. I think experimenting with the code
will answer all of your questions.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-11  4:24 ` [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
  2021-06-11  6:42   ` Cong Wang
  2021-06-11  7:05   ` Cong Wang
@ 2021-06-11 22:12   ` Yonghong Song
  2021-06-15  3:33     ` Alexei Starovoitov
  2021-06-14 16:51   ` Yonghong Song
  2021-06-15  4:48   ` Andrii Nakryiko
  4 siblings, 1 reply; 22+ messages in thread
From: Yonghong Song @ 2021-06-11 22:12 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, andrii, netdev, bpf, kernel-team



On 6/10/21 9:24 PM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> in hash/array/lru maps as regular field and helpers to operate on it:
> 
> // Initialize the timer to call 'callback_fn' static function
> // First 4 bits of 'flags' specify clockid.
> // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags);
> 
> // Start the timer and set its expiration 'nsec' nanoseconds from the current time.
> long bpf_timer_start(struct bpf_timer *timer, u64 nsec);
> 
> // Cancel the timer and wait for callback_fn to finish if it was running.
> long bpf_timer_cancel(struct bpf_timer *timer);
> 
> Here is how BPF program might look like:
> struct map_elem {
>      int counter;
>      struct bpf_timer timer;
> };
> 
> struct {
>      __uint(type, BPF_MAP_TYPE_HASH);
>      __uint(max_entries, 1000);
>      __type(key, int);
>      __type(value, struct map_elem);
> } hmap SEC(".maps");
> 
> static int timer_cb(void *map, int *key, struct map_elem *val);
> /* val points to particular map element that contains bpf_timer. */
> 
> SEC("fentry/bpf_fentry_test1")
> int BPF_PROG(test1, int a)
> {
>      struct map_elem *val;
>      int key = 0;
> 
>      val = bpf_map_lookup_elem(&hmap, &key);
>      if (val) {
>          bpf_timer_init(&val->timer, timer_cb, CLOCK_REALTIME);
>          bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */);
>      }
> }
> 
> This patch adds helper implementations that rely on hrtimers
> to call bpf functions as timers expire.
> The following patch adds necessary safety checks.
> 
> Only programs with CAP_BPF are allowed to use bpf_timer.
> 
> The amount of timers used by the program is constrained by
> the memcg recorded at map creation time.
> 
> The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> program to make sure that program doesn't get freed while timer is armed.
> 
> The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
> and free the timer if given map element had it allocated.
> "bpftool map update" command can be used to cancel timers.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   include/linux/bpf.h            |   2 +
>   include/uapi/linux/bpf.h       |  40 ++++++
>   kernel/bpf/helpers.c           | 227 +++++++++++++++++++++++++++++++++
>   kernel/bpf/verifier.c          | 109 ++++++++++++++++
>   kernel/trace/bpf_trace.c       |   2 +-
>   scripts/bpf_doc.py             |   2 +
>   tools/include/uapi/linux/bpf.h |  40 ++++++
>   7 files changed, 421 insertions(+), 1 deletion(-)
> 
[...]
>   
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2c1ba70abbf1..d25bbcdad8e6 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4778,6 +4778,38 @@ union bpf_attr {
>    * 		Execute close syscall for given FD.
>    * 	Return
>    * 		A syscall result.
> + *
> + * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
> + *	Description
> + *		Initialize the timer to call *callback_fn* static function.
> + *		First 4 bits of *flags* specify clockid. Only CLOCK_MONOTONIC,
> + *		CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> + *		All other bits of *flags* are reserved.
> + *	Return
> + *		0 on success.
> + *		**-EBUSY** if *timer* is already initialized.
> + *		**-EINVAL** if invalid *flags* are passed.
> + *
> + * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
> + *	Description
> + *		Start the timer and set its expiration N nanoseconds from the
> + *		current time. The timer callback_fn will be invoked in soft irq
> + *		context on some cpu and will not repeat unless another
> + *		bpf_timer_start() is made. In such case the next invocation can
> + *		migrate to a different cpu.
> + *	Return
> + *		0 on success.
> + *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
> + *
> + * long bpf_timer_cancel(struct bpf_timer *timer)
> + *	Description
> + *		Cancel the timer and wait for callback_fn to finish if it was running.
> + *	Return
> + *		0 if the timer was not active.
> + *		1 if the timer was active.
> + *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
> + *		**-EDEADLK** if callback_fn tried to call bpf_timer_cancel() on its own timer
> + *		which would have led to a deadlock otherwise.
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -4949,6 +4981,9 @@ union bpf_attr {
>   	FN(sys_bpf),			\
>   	FN(btf_find_by_name_kind),	\
>   	FN(sys_close),			\
> +	FN(timer_init),			\
> +	FN(timer_start),		\
> +	FN(timer_cancel),		\
>   	/* */
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> @@ -6061,6 +6096,11 @@ struct bpf_spin_lock {
>   	__u32	val;
>   };
>   
> +struct bpf_timer {
> +	__u64 :64;
> +	__u64 :64;
> +};
> +
>   struct bpf_sysctl {
>   	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
>   				 * Allows 1,2,4-byte read, but no write.
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 544773970dbc..3a693d451ca3 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -985,6 +985,227 @@ const struct bpf_func_proto bpf_snprintf_proto = {
>   	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
>   };
>   
> +struct bpf_hrtimer {
> +	struct hrtimer timer;
> +	struct bpf_map *map;
> +	struct bpf_prog *prog;
> +	void *callback_fn;
> +	void *value;
> +};
> +
> +/* the actual struct hidden inside uapi struct bpf_timer */
> +struct bpf_timer_kern {
> +	struct bpf_hrtimer *timer;
> +	struct bpf_spin_lock lock;
> +};

Looks like in 32bit system, sizeof(struct bpf_timer_kern) is 64
and sizeof(struct bpf_timer) is 128.

struct bpf_spin_lock {
         __u32   val;
};

struct bpf_timer {
	__u64 :64;
	__u64 :64;
};

Checking the code, we may not have issues as structure
"bpf_timer" is only used to reserve spaces and
map copy value routine handles that properly.

Maybe we can still make it consistent with
two fields in bpf_timer_kern mapping to
two fields in bpf_timer?

struct bpf_timer_kern {
	__bpf_md_ptr(struct bpf_hrtimer *, timer);
	struct bpf_spin_lock lock;
};

> +
> +static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
> +
> +static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
> +{
> +	struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
> +	struct bpf_prog *prog = t->prog;
> +	struct bpf_map *map = t->map;
> +	void *key;
> +	u32 idx;
> +	int ret;
> +
> +	/* bpf_timer_cb() runs in hrtimer_run_softirq. It doesn't migrate and
> +	 * cannot be preempted by another bpf_timer_cb() on the same cpu.
> +	 * Remember the timer this callback is servicing to prevent
> +	 * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
> +	 */
> +	this_cpu_write(hrtimer_running, t);
> +	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> +		struct bpf_array *array = container_of(map, struct bpf_array, map);
> +
> +		/* compute the key */
> +		idx = ((char *)t->value - array->value) / array->elem_size;
> +		key = &idx;
> +	} else { /* hash or lru */
> +		key = t->value - round_up(map->key_size, 8);
> +	}
> +
> +	ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> +					    (u64)(long)key,
> +					    (u64)(long)t->value, 0, 0);
> +	WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> +
> +	/* The bpf function finished executed. Drop the prog refcnt.
> +	 * It could reach zero here and trigger free of bpf_prog
> +	 * and subsequent free of the maps that were holding timers.
> +	 * If callback_fn called bpf_timer_start on this timer
> +	 * the prog refcnt will be > 0.
> +	 *
> +	 * If callback_fn deleted map element the 't' could have been freed,
> +	 * hence t->prog deref is done earlier.
> +	 */
> +	bpf_prog_put(prog);
> +	this_cpu_write(hrtimer_running, NULL);
> +	return HRTIMER_NORESTART;
> +}
> +
> +BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flags,
> +	   struct bpf_map *, map, struct bpf_prog *, prog)
> +{
[...]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-11  4:24 ` [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
                     ` (2 preceding siblings ...)
  2021-06-11 22:12   ` Yonghong Song
@ 2021-06-14 16:51   ` Yonghong Song
  2021-06-15  3:29     ` Alexei Starovoitov
  2021-06-15  4:48   ` Andrii Nakryiko
  4 siblings, 1 reply; 22+ messages in thread
From: Yonghong Song @ 2021-06-14 16:51 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, andrii, netdev, bpf, kernel-team



On 6/10/21 9:24 PM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> in hash/array/lru maps as regular field and helpers to operate on it:
> 
> // Initialize the timer to call 'callback_fn' static function
> // First 4 bits of 'flags' specify clockid.
> // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags);
> 
> // Start the timer and set its expiration 'nsec' nanoseconds from the current time.
> long bpf_timer_start(struct bpf_timer *timer, u64 nsec);
> 
> // Cancel the timer and wait for callback_fn to finish if it was running.
> long bpf_timer_cancel(struct bpf_timer *timer);
> 
> Here is how BPF program might look like:
> struct map_elem {
>      int counter;
>      struct bpf_timer timer;
> };
> 
> struct {
>      __uint(type, BPF_MAP_TYPE_HASH);
>      __uint(max_entries, 1000);
>      __type(key, int);
>      __type(value, struct map_elem);
> } hmap SEC(".maps");
> 
> static int timer_cb(void *map, int *key, struct map_elem *val);
> /* val points to particular map element that contains bpf_timer. */
> 
> SEC("fentry/bpf_fentry_test1")
> int BPF_PROG(test1, int a)
> {
>      struct map_elem *val;
>      int key = 0;
> 
>      val = bpf_map_lookup_elem(&hmap, &key);
>      if (val) {
>          bpf_timer_init(&val->timer, timer_cb, CLOCK_REALTIME);
>          bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */);
>      }
> }
> 
> This patch adds helper implementations that rely on hrtimers
> to call bpf functions as timers expire.
> The following patch adds necessary safety checks.
> 
> Only programs with CAP_BPF are allowed to use bpf_timer.
> 
> The amount of timers used by the program is constrained by
> the memcg recorded at map creation time.
> 
> The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> program to make sure that program doesn't get freed while timer is armed.
> 
> The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
> and free the timer if given map element had it allocated.
> "bpftool map update" command can be used to cancel timers.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   include/linux/bpf.h            |   2 +
>   include/uapi/linux/bpf.h       |  40 ++++++
>   kernel/bpf/helpers.c           | 227 +++++++++++++++++++++++++++++++++
>   kernel/bpf/verifier.c          | 109 ++++++++++++++++
>   kernel/trace/bpf_trace.c       |   2 +-
>   scripts/bpf_doc.py             |   2 +
>   tools/include/uapi/linux/bpf.h |  40 ++++++
>   7 files changed, 421 insertions(+), 1 deletion(-)
> 
[...]
> +
> +static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
> +{
> +	struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
> +	struct bpf_prog *prog = t->prog;
> +	struct bpf_map *map = t->map;
> +	void *key;
> +	u32 idx;
> +	int ret;
> +
> +	/* bpf_timer_cb() runs in hrtimer_run_softirq. It doesn't migrate and
> +	 * cannot be preempted by another bpf_timer_cb() on the same cpu.
> +	 * Remember the timer this callback is servicing to prevent
> +	 * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
> +	 */
> +	this_cpu_write(hrtimer_running, t);
> +	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> +		struct bpf_array *array = container_of(map, struct bpf_array, map);
> +
> +		/* compute the key */
> +		idx = ((char *)t->value - array->value) / array->elem_size;
> +		key = &idx;
> +	} else { /* hash or lru */
> +		key = t->value - round_up(map->key_size, 8);
> +	}
> +
> +	ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> +					    (u64)(long)key,
> +					    (u64)(long)t->value, 0, 0);
> +	WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */

I didn't find that next patch disallows callback return value 1 in the 
verifier. If we indeed disallows return value 1 in the verifier. We
don't need WARN_ON here. Did I miss anything?

> +
> +	/* The bpf function finished executed. Drop the prog refcnt.
> +	 * It could reach zero here and trigger free of bpf_prog
> +	 * and subsequent free of the maps that were holding timers.
> +	 * If callback_fn called bpf_timer_start on this timer
> +	 * the prog refcnt will be > 0.
> +	 *
> +	 * If callback_fn deleted map element the 't' could have been freed,
> +	 * hence t->prog deref is done earlier.
> +	 */
> +	bpf_prog_put(prog);
> +	this_cpu_write(hrtimer_running, NULL);
> +	return HRTIMER_NORESTART;
> +}
> +
> +BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flags,
> +	   struct bpf_map *, map, struct bpf_prog *, prog)
> +{
> +	clockid_t clockid = flags & (MAX_CLOCKS - 1);
> +	struct bpf_hrtimer *t;
> +	int ret = 0;
> +
> +	BUILD_BUG_ON(MAX_CLOCKS != 16);
> +	if (flags >= MAX_CLOCKS ||
> +	    /* similar to timerfd except _ALARM variants are not supported */
> +            (clockid != CLOCK_MONOTONIC &&
> +             clockid != CLOCK_REALTIME &&
> +             clockid != CLOCK_BOOTTIME))
> +		return -EINVAL;
> +	____bpf_spin_lock(&timer->lock);
> +	t = timer->timer;
> +	if (t) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +	/* allocate hrtimer via map_kmalloc to use memcg accounting */
> +	t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
> +	if (!t) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	t->callback_fn = cb;
> +	t->value = (void *)timer /* - offset of bpf_timer inside elem */;
> +	t->map = map;
> +	t->prog = prog;
> +	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
> +	t->timer.function = bpf_timer_cb;
> +	timer->timer = t;
> +out:
> +	____bpf_spin_unlock(&timer->lock);
> +	return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_init_proto = {
> +	.func		= bpf_timer_init,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_TIMER,
> +	.arg2_type	= ARG_PTR_TO_FUNC,
> +	.arg3_type	= ARG_ANYTHING,
> +};
> +
> +BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)
> +{
> +	struct bpf_hrtimer *t;
> +	int ret = 0;
> +
> +	____bpf_spin_lock(&timer->lock);
> +	t = timer->timer;
> +	if (!t) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> +		/* If the timer wasn't active or callback already executing
> +		 * bump the prog refcnt to keep it alive until
> +		 * callback is invoked (again).
> +		 */
> +		bpf_prog_inc(t->prog);

I am not 100% sure. But could we have race condition here?
    cpu 1: running bpf_timer_start() helper call
    cpu 2: doing hrtimer work (calling callback etc.)

Is it possible that
   !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
may be true and then right before bpf_prog_inc(t->prog), it becomes 
true? If hrtimer_callback_running() is called, it is possible that
callback function could have dropped the reference count for t->prog,
so we could already go into the body of the function
__bpf_prog_put()?

static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
{
         if (atomic64_dec_and_test(&prog->aux->refcnt)) {
                 perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_UNLOAD, 0);
                 bpf_audit_prog(prog, BPF_AUDIT_UNLOAD);
                 /* bpf_prog_free_id() must be called first */
                 bpf_prog_free_id(prog, do_idr_lock);
                 __bpf_prog_put_noref(prog, true);
         }
}


> +	hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
> +out:
> +	____bpf_spin_unlock(&timer->lock);
> +	return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_start_proto = {
> +	.func		= bpf_timer_start,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_TIMER,
> +	.arg2_type	= ARG_ANYTHING,
> +};
> +
> +BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> +{
> +	struct bpf_hrtimer *t;
> +	int ret = 0;
> +
> +	____bpf_spin_lock(&timer->lock);
> +	t = timer->timer;
> +	if (!t) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	if (this_cpu_read(hrtimer_running) == t) {
> +		/* If bpf callback_fn is trying to bpf_timer_cancel()
> +		 * its own timer the hrtimer_cancel() will deadlock
> +		 * since it waits for callback_fn to finish
> +		 */
> +		ret = -EDEADLK;
> +		goto out;
> +	}
> +	/* Cancel the timer and wait for associated callback to finish
> +	 * if it was running.
> +	 */
> +	if (hrtimer_cancel(&t->timer) == 1) {

Again, could we have race here between bpf program and hrtimer_cancel()?

> +		/* If the timer was active then drop the prog refcnt,
> +		 * since callback will not be invoked.
> +		 */
> +		bpf_prog_put(t->prog);
> +		ret = 1;
> +	}
> +out:
> +	____bpf_spin_unlock(&timer->lock);
> +	return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_cancel_proto = {
> +	.func		= bpf_timer_cancel,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_TIMER,
> +};
> +
> +/* This function is called by delete_element in htab and lru maps
> + * and by map_free for array, lru, htab maps.
> + */
> +void bpf_timer_cancel_and_free(void *val)
> +{
> +	struct bpf_timer_kern *timer = val;
> +	struct bpf_hrtimer *t;
> +
> +	____bpf_spin_lock(&timer->lock);
> +	t = timer->timer;
> +	if (!t)
> +		goto out;
> +	/* Cancel the timer and wait for callback to complete if it was
> +	 * running. Only individual delete_element in htab or lru maps can
> +	 * return 1 from hrtimer_cancel.
> +	 * The whole map is destroyed when its refcnt reaches zero.
> +	 * That happens after bpf prog refcnt reaches zero.
> +	 * bpf prog refcnt will not reach zero until all timers are executed.
> +	 * So when maps are destroyed hrtimer_cancel will surely return 0.
> +	 * In such case t->prog is a pointer to freed memory.
> +	 *
> +	 * When htab or lru is deleting individual element check that
> +	 * bpf_map_delete_elem() isn't trying to delete elem with running timer.
> +	 * In such case don't call hrtimer_cancel() (since it will deadlock)
> +	 * and don't call hrtimer_try_to_cancel() (since it will just return -1).
> +	 * Instead free the timer and set timer->timer = NULL.
> +	 * The subsequent bpf_timer_start/cancel() helpers won't be able to use it.
> +	 * In preallocated maps it's safe to do timer->timer = NULL.
> +	 * The memory could be reused for another element while current timer
> +	 * callback can still do bpf_timer_init() on it.
> +	 * In non-preallocated maps timer->timer = NULL will happen after
> +	 * callback completes, since prog execution is an RCU critical section.
> +	 */
> +	if (this_cpu_read(hrtimer_running) != t &&
> +	    hrtimer_cancel(&t->timer) == 1)
> +		bpf_prog_put(t->prog);
> +	kfree(t);
> +	timer->timer = NULL;
> +out:
> +	____bpf_spin_unlock(&timer->lock);
> +}
> +
>   const struct bpf_func_proto bpf_get_current_task_proto __weak;
>   const struct bpf_func_proto bpf_probe_read_user_proto __weak;
>   const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
> @@ -1051,6 +1272,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>   		return &bpf_per_cpu_ptr_proto;
>   	case BPF_FUNC_this_cpu_ptr:
>   		return &bpf_this_cpu_ptr_proto;
> +	case BPF_FUNC_timer_init:
> +		return &bpf_timer_init_proto;
> +	case BPF_FUNC_timer_start:
> +		return &bpf_timer_start_proto;
> +	case BPF_FUNC_timer_cancel:
> +		return &bpf_timer_cancel_proto;
>   	default:
>   		break;
>   	}
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 1de4b8c6ee42..44ec9760b562 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -4656,6 +4656,35 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
>   	return 0;
>   }
>   
> +static int process_timer_func(struct bpf_verifier_env *env, int regno,
> +			      struct bpf_call_arg_meta *meta)
> +{
> +	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> +	bool is_const = tnum_is_const(reg->var_off);
> +	struct bpf_map *map = reg->map_ptr;
> +	u64 val = reg->var_off.value;
> +
> +	if (!is_const) {
> +		verbose(env,
> +			"R%d doesn't have constant offset. bpf_timer has to be at the constant offset\n",
> +			regno);
> +		return -EINVAL;
> +	}
> +	if (!map->btf) {
> +		verbose(env, "map '%s' has to have BTF in order to use bpf_timer\n",
> +			map->name);
> +		return -EINVAL;
> +	}
> +	if (val) {
> +		/* This restriction will be removed in the next patch */
> +		verbose(env, "bpf_timer field can only be first in the map value element\n");
> +		return -EINVAL;
> +	}
> +	WARN_ON(meta->map_ptr);

Could you explain when this could happen?

> +	meta->map_ptr = map;
> +	return 0;
> +}
> +
>   static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
>   {
>   	return type == ARG_PTR_TO_MEM ||
> @@ -4788,6 +4817,7 @@ static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { PTR_TO_PER
>   static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
>   static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
>   static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
> +static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
>   
[...]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-14 16:51   ` Yonghong Song
@ 2021-06-15  3:29     ` Alexei Starovoitov
  2021-06-15  5:31       ` Andrii Nakryiko
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2021-06-15  3:29 UTC (permalink / raw)
  To: Yonghong Song
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Network Development, bpf, Kernel Team

On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > +                                         (u64)(long)key,
> > +                                         (u64)(long)t->value, 0, 0);
> > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
>
> I didn't find that next patch disallows callback return value 1 in the
> verifier. If we indeed disallows return value 1 in the verifier. We
> don't need WARN_ON here. Did I miss anything?

Ohh. I forgot to address this bit in the verifier. Will fix.

> > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > +             /* If the timer wasn't active or callback already executing
> > +              * bump the prog refcnt to keep it alive until
> > +              * callback is invoked (again).
> > +              */
> > +             bpf_prog_inc(t->prog);
>
> I am not 100% sure. But could we have race condition here?
>     cpu 1: running bpf_timer_start() helper call
>     cpu 2: doing hrtimer work (calling callback etc.)
>
> Is it possible that
>    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> may be true and then right before bpf_prog_inc(t->prog), it becomes
> true? If hrtimer_callback_running() is called, it is possible that
> callback function could have dropped the reference count for t->prog,
> so we could already go into the body of the function
> __bpf_prog_put()?

you're correct. Indeed there is a race.
Circular dependency is a never ending headache.
That's the same design mistake as with tail_calls.
It felt that this case would be simpler than tail_calls and a bpf program
pinning itself with bpf_prog_inc can be made to work... nope.
I'll get rid of this and switch to something 'obviously correct'.
Probably a link list with a lock to keep a set of init-ed timers and
auto-cancel them on prog refcnt going to zero.
To do 'bpf daemon' the prog would need to be pinned.

> > +     if (val) {
> > +             /* This restriction will be removed in the next patch */
> > +             verbose(env, "bpf_timer field can only be first in the map value element\n");
> > +             return -EINVAL;
> > +     }
> > +     WARN_ON(meta->map_ptr);
>
> Could you explain when this could happen?

Only if there is a verifier bug or new helper is added with arg to timer
and arg to map. I'll switch to verbose() + efault instead.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-11 22:12   ` Yonghong Song
@ 2021-06-15  3:33     ` Alexei Starovoitov
  2021-06-15  4:21       ` Yonghong Song
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2021-06-15  3:33 UTC (permalink / raw)
  To: Yonghong Song
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Network Development, bpf, Kernel Team

On Fri, Jun 11, 2021 at 3:12 PM Yonghong Song <yhs@fb.com> wrote:
> > +struct bpf_hrtimer {
> > +     struct hrtimer timer;
> > +     struct bpf_map *map;
> > +     struct bpf_prog *prog;
> > +     void *callback_fn;
> > +     void *value;
> > +};
> > +
> > +/* the actual struct hidden inside uapi struct bpf_timer */
> > +struct bpf_timer_kern {
> > +     struct bpf_hrtimer *timer;
> > +     struct bpf_spin_lock lock;
> > +};
>
> Looks like in 32bit system, sizeof(struct bpf_timer_kern) is 64
> and sizeof(struct bpf_timer) is 128.
>
> struct bpf_spin_lock {
>          __u32   val;
> };
>
> struct bpf_timer {
>         __u64 :64;
>         __u64 :64;
> };
>
> Checking the code, we may not have issues as structure
> "bpf_timer" is only used to reserve spaces and
> map copy value routine handles that properly.
>
> Maybe we can still make it consistent with
> two fields in bpf_timer_kern mapping to
> two fields in bpf_timer?
>
> struct bpf_timer_kern {
>         __bpf_md_ptr(struct bpf_hrtimer *, timer);
>         struct bpf_spin_lock lock;
> };

Such alignment of fields is not necessary,
since the fields are not accessible directly from bpf prog.
struct bpf_timer_kern needs to fit into struct bpf_timer and
alignof these two structs needs to be the same.
That's all. I'll add build_bug_on to make sure.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-15  3:33     ` Alexei Starovoitov
@ 2021-06-15  4:21       ` Yonghong Song
  0 siblings, 0 replies; 22+ messages in thread
From: Yonghong Song @ 2021-06-15  4:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Network Development, bpf, Kernel Team



On 6/14/21 8:33 PM, Alexei Starovoitov wrote:
> On Fri, Jun 11, 2021 at 3:12 PM Yonghong Song <yhs@fb.com> wrote:
>>> +struct bpf_hrtimer {
>>> +     struct hrtimer timer;
>>> +     struct bpf_map *map;
>>> +     struct bpf_prog *prog;
>>> +     void *callback_fn;
>>> +     void *value;
>>> +};
>>> +
>>> +/* the actual struct hidden inside uapi struct bpf_timer */
>>> +struct bpf_timer_kern {
>>> +     struct bpf_hrtimer *timer;
>>> +     struct bpf_spin_lock lock;
>>> +};
>>
>> Looks like in 32bit system, sizeof(struct bpf_timer_kern) is 64
>> and sizeof(struct bpf_timer) is 128.
>>
>> struct bpf_spin_lock {
>>           __u32   val;
>> };
>>
>> struct bpf_timer {
>>          __u64 :64;
>>          __u64 :64;
>> };
>>
>> Checking the code, we may not have issues as structure
>> "bpf_timer" is only used to reserve spaces and
>> map copy value routine handles that properly.
>>
>> Maybe we can still make it consistent with
>> two fields in bpf_timer_kern mapping to
>> two fields in bpf_timer?
>>
>> struct bpf_timer_kern {
>>          __bpf_md_ptr(struct bpf_hrtimer *, timer);
>>          struct bpf_spin_lock lock;
>> };
> 
> Such alignment of fields is not necessary,
> since the fields are not accessible directly from bpf prog.
> struct bpf_timer_kern needs to fit into struct bpf_timer and
> alignof these two structs needs to be the same.
> That's all. I'll add build_bug_on to make sure.

Sounds good to me. Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-11  4:24 ` [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
                     ` (3 preceding siblings ...)
  2021-06-14 16:51   ` Yonghong Song
@ 2021-06-15  4:48   ` Andrii Nakryiko
  4 siblings, 0 replies; 22+ messages in thread
From: Andrii Nakryiko @ 2021-06-15  4:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Kernel Team

On Thu, Jun 10, 2021 at 9:24 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> in hash/array/lru maps as regular field and helpers to operate on it:
>
> // Initialize the timer to call 'callback_fn' static function
> // First 4 bits of 'flags' specify clockid.
> // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags);
>
> // Start the timer and set its expiration 'nsec' nanoseconds from the current time.
> long bpf_timer_start(struct bpf_timer *timer, u64 nsec);
>
> // Cancel the timer and wait for callback_fn to finish if it was running.
> long bpf_timer_cancel(struct bpf_timer *timer);
>
> Here is how BPF program might look like:
> struct map_elem {
>     int counter;
>     struct bpf_timer timer;
> };
>
> struct {
>     __uint(type, BPF_MAP_TYPE_HASH);
>     __uint(max_entries, 1000);
>     __type(key, int);
>     __type(value, struct map_elem);
> } hmap SEC(".maps");
>
> static int timer_cb(void *map, int *key, struct map_elem *val);
> /* val points to particular map element that contains bpf_timer. */
>
> SEC("fentry/bpf_fentry_test1")
> int BPF_PROG(test1, int a)
> {
>     struct map_elem *val;
>     int key = 0;
>
>     val = bpf_map_lookup_elem(&hmap, &key);
>     if (val) {
>         bpf_timer_init(&val->timer, timer_cb, CLOCK_REALTIME);
>         bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */);
>     }
> }
>
> This patch adds helper implementations that rely on hrtimers
> to call bpf functions as timers expire.
> The following patch adds necessary safety checks.
>
> Only programs with CAP_BPF are allowed to use bpf_timer.
>
> The amount of timers used by the program is constrained by
> the memcg recorded at map creation time.
>
> The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> program to make sure that program doesn't get freed while timer is armed.
>
> The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
> and free the timer if given map element had it allocated.
> "bpftool map update" command can be used to cancel timers.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Looks great!

Acked-by: Andrii Nakryiko <andrii@kernel.org>

>  include/linux/bpf.h            |   2 +
>  include/uapi/linux/bpf.h       |  40 ++++++
>  kernel/bpf/helpers.c           | 227 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          | 109 ++++++++++++++++
>  kernel/trace/bpf_trace.c       |   2 +-
>  scripts/bpf_doc.py             |   2 +
>  tools/include/uapi/linux/bpf.h |  40 ++++++
>  7 files changed, 421 insertions(+), 1 deletion(-)
>

[...]

> + *
> + * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
> + *     Description
> + *             Initialize the timer to call *callback_fn* static function.
> + *             First 4 bits of *flags* specify clockid. Only CLOCK_MONOTONIC,
> + *             CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> + *             All other bits of *flags* are reserved.
> + *     Return
> + *             0 on success.
> + *             **-EBUSY** if *timer* is already initialized.
> + *             **-EINVAL** if invalid *flags* are passed.
> + *
> + * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
> + *     Description
> + *             Start the timer and set its expiration N nanoseconds from the
> + *             current time. The timer callback_fn will be invoked in soft irq
> + *             context on some cpu and will not repeat unless another
> + *             bpf_timer_start() is made. In such case the next invocation can
> + *             migrate to a different cpu.

This is a nice description, thanks.

> + *     Return
> + *             0 on success.
> + *             **-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
> + *
> + * long bpf_timer_cancel(struct bpf_timer *timer)
> + *     Description
> + *             Cancel the timer and wait for callback_fn to finish if it was running.
> + *     Return
> + *             0 if the timer was not active.
> + *             1 if the timer was active.
> + *             **-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
> + *             **-EDEADLK** if callback_fn tried to call bpf_timer_cancel() on its own timer
> + *             which would have led to a deadlock otherwise.
>   */

[...]

> +       ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> +                                           (u64)(long)key,
> +                                           (u64)(long)t->value, 0, 0);
> +       WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> +
> +       /* The bpf function finished executed. Drop the prog refcnt.

typo: execution

> +        * It could reach zero here and trigger free of bpf_prog
> +        * and subsequent free of the maps that were holding timers.
> +        * If callback_fn called bpf_timer_start on this timer
> +        * the prog refcnt will be > 0.
> +        *
> +        * If callback_fn deleted map element the 't' could have been freed,
> +        * hence t->prog deref is done earlier.
> +        */
> +       bpf_prog_put(prog);
> +       this_cpu_write(hrtimer_running, NULL);
> +       return HRTIMER_NORESTART;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-15  3:29     ` Alexei Starovoitov
@ 2021-06-15  5:31       ` Andrii Nakryiko
  2021-06-15  5:40         ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Andrii Nakryiko @ 2021-06-15  5:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Network Development, bpf, Kernel Team

On Mon, Jun 14, 2021 at 8:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > > +                                         (u64)(long)key,
> > > +                                         (u64)(long)t->value, 0, 0);
> > > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> >
> > I didn't find that next patch disallows callback return value 1 in the
> > verifier. If we indeed disallows return value 1 in the verifier. We
> > don't need WARN_ON here. Did I miss anything?
>
> Ohh. I forgot to address this bit in the verifier. Will fix.
>
> > > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > > +             /* If the timer wasn't active or callback already executing
> > > +              * bump the prog refcnt to keep it alive until
> > > +              * callback is invoked (again).
> > > +              */
> > > +             bpf_prog_inc(t->prog);
> >
> > I am not 100% sure. But could we have race condition here?
> >     cpu 1: running bpf_timer_start() helper call
> >     cpu 2: doing hrtimer work (calling callback etc.)
> >
> > Is it possible that
> >    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> > may be true and then right before bpf_prog_inc(t->prog), it becomes
> > true? If hrtimer_callback_running() is called, it is possible that
> > callback function could have dropped the reference count for t->prog,
> > so we could already go into the body of the function
> > __bpf_prog_put()?
>
> you're correct. Indeed there is a race.
> Circular dependency is a never ending headache.
> That's the same design mistake as with tail_calls.
> It felt that this case would be simpler than tail_calls and a bpf program
> pinning itself with bpf_prog_inc can be made to work... nope.
> I'll get rid of this and switch to something 'obviously correct'.
> Probably a link list with a lock to keep a set of init-ed timers and
> auto-cancel them on prog refcnt going to zero.
> To do 'bpf daemon' the prog would need to be pinned.

Hm.. wouldn't this eliminate that race:

switch (hrtimer_try_to_cancel(&t->timer))
{
case 0:
    /* nothing was queued */
    bpf_prog_inc(t->prog);
    break;
case 1:
    /* already have refcnt and it won't be bpf_prog_put by callback */
    break;
case -1:
    /* callback is running and will bpf_prog_put, so we need to take
another refcnt */
    bpf_prog_inc(t->prog);
    break;
}
hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);

So instead of guessing (racily) whether there is a queued callback or
not, try to cancel just in case there is. Then rely on the nice
guarantees that hrtimer cancellation API provides.

Reading a bit more of hrtimer API, I'm more concerned now with the
per-cpu variable (hrtimer_running). Seems like the timer can get
migrated from one CPU to another, so all the auxiliary per-CPU state
might get invalidated without us knowing about that.

But it's getting late, I'll think about all this a bit more tomorrow
with a fresh head.

>
> > > +     if (val) {
> > > +             /* This restriction will be removed in the next patch */
> > > +             verbose(env, "bpf_timer field can only be first in the map value element\n");
> > > +             return -EINVAL;
> > > +     }
> > > +     WARN_ON(meta->map_ptr);
> >
> > Could you explain when this could happen?
>
> Only if there is a verifier bug or new helper is added with arg to timer
> and arg to map. I'll switch to verbose() + efault instead.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-15  5:31       ` Andrii Nakryiko
@ 2021-06-15  5:40         ` Alexei Starovoitov
  2021-06-15 15:24           ` Andrii Nakryiko
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2021-06-15  5:40 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Network Development, bpf, Kernel Team

On Mon, Jun 14, 2021 at 10:31 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jun 14, 2021 at 8:29 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > > > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > > > +                                         (u64)(long)key,
> > > > +                                         (u64)(long)t->value, 0, 0);
> > > > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> > >
> > > I didn't find that next patch disallows callback return value 1 in the
> > > verifier. If we indeed disallows return value 1 in the verifier. We
> > > don't need WARN_ON here. Did I miss anything?
> >
> > Ohh. I forgot to address this bit in the verifier. Will fix.
> >
> > > > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > > > +             /* If the timer wasn't active or callback already executing
> > > > +              * bump the prog refcnt to keep it alive until
> > > > +              * callback is invoked (again).
> > > > +              */
> > > > +             bpf_prog_inc(t->prog);
> > >
> > > I am not 100% sure. But could we have race condition here?
> > >     cpu 1: running bpf_timer_start() helper call
> > >     cpu 2: doing hrtimer work (calling callback etc.)
> > >
> > > Is it possible that
> > >    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> > > may be true and then right before bpf_prog_inc(t->prog), it becomes
> > > true? If hrtimer_callback_running() is called, it is possible that
> > > callback function could have dropped the reference count for t->prog,
> > > so we could already go into the body of the function
> > > __bpf_prog_put()?
> >
> > you're correct. Indeed there is a race.
> > Circular dependency is a never ending headache.
> > That's the same design mistake as with tail_calls.
> > It felt that this case would be simpler than tail_calls and a bpf program
> > pinning itself with bpf_prog_inc can be made to work... nope.
> > I'll get rid of this and switch to something 'obviously correct'.
> > Probably a link list with a lock to keep a set of init-ed timers and
> > auto-cancel them on prog refcnt going to zero.
> > To do 'bpf daemon' the prog would need to be pinned.
>
> Hm.. wouldn't this eliminate that race:
>
> switch (hrtimer_try_to_cancel(&t->timer))
> {
> case 0:
>     /* nothing was queued */
>     bpf_prog_inc(t->prog);
>     break;
> case 1:
>     /* already have refcnt and it won't be bpf_prog_put by callback */
>     break;
> case -1:
>     /* callback is running and will bpf_prog_put, so we need to take
> another refcnt */
>     bpf_prog_inc(t->prog);
>     break;
> }
> hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
>
> So instead of guessing (racily) whether there is a queued callback or
> not, try to cancel just in case there is. Then rely on the nice
> guarantees that hrtimer cancellation API provides.

I haven't thought it through yet, but the above approach could
indeed solve this particular race. Unfortunately there are other races.
There is an issue with bpf_timer_init. Since it doesn't take refcnt
another program might do lookup and bpf_timer_start
while the first prog got to refcnt=0 and got freed.
Adding refcnt to bpf_timer_init() makes the prog self pinned
and no callback might ever be executed (if there were no bpf_timer_start),
so that will cause a high chance of bpf prog stuck in the kernel.
There could be ref+uref schemes similar to tail_calls to address all that,
but it gets ugly quickly.
imo all these issues and races is a sign that such self pinning
shouldn't be allowed.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-11 18:45     ` Alexei Starovoitov
@ 2021-06-15  6:10       ` Cong Wang
  2021-06-16  4:53         ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Cong Wang @ 2021-06-15  6:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Daniel Borkmann, Andrii Nakryiko,
	Linux Kernel Network Developers, bpf, kernel-team

On Fri, Jun 11, 2021 at 11:45 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jun 10, 2021 at 11:42:24PM -0700, Cong Wang wrote:
> > On Thu, Jun 10, 2021 at 9:27 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
>
> Please stick to one email thread in the future, ok?
>
> I'll consolidate them here:
>
> > What is your use case to justify your own code? Asking because
> > you deny mine, so clearly my use case is not yours.
>
> I mentioned several use cases in the prior threads.
> tldr: any periodic event in tracing, networking, security.
> Garbage collection falls into this category as well, but please internalize
> that implementing conntrack as it is today in the kernel is an explicit non-goal.

You need to read my use case again, it is for the conntrack
in Cilium, not the kernel one.

>
> > And more importantly, why not just use BPF_TEST_RUN with
> > a user-space timer? Asking because you offer no API to read or
> > modify timer expiration, so literally the same with BPF_TEST_RUN
> > approach.
>
> a wrapper on top of hrtimer_get_remaining() like bpf_timer_get_remaining()
> is trivial to add, but what is the use case?

If you do not have any use case, then stick to BPF_TEST_RUN
with user-space timers? And of course your patches are not needed
at all.

>
> > >
> > > Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> > > in hash/array/lru maps as regular field and helpers to operate on it:
> >
> > Can be or has to be? Huge difference here.
>
> map elements don't have to use timers.

You interpret this in a wrong way, what I asked is whether a bpf timer has
to be embedded in a map. IOW, can a bpf timer be a standalone global
data?

>
> > In the other thread, you said it is global data, which implies that it does
> > not have to be in a map.
>
> global data is a map. That was explained in the prior thread as well.
>

I think you implied bpf timer can exist without a map, hence I am asking.


> >
> > In your test case or your example, all timers are still in a map. So what has
> > changed since then? Looks nothing to me.
>
> look again?

Yes, I just looked at it again, only more confusing, not less.

>
> > Hmm, finally you begin refcounting it, which you were strongly against. ;)
>
> That was already answered in the prior thread.
> tldr: there were two options. This is one of them. Another can be added
> in the future as well.
>
> > Three questions:
> >
> > 1. Can t->prog be freed between bpf_timer_init() and bpf_timer_start()?
>
> yes.

Good. So if a program which only initializes the timer and then exits,
the other program which shares this timer will crash when it calls
bpf_timer_start(), right?

>
> > If the timer subprog is always in the same prog which installs it, then
>
> installs it? I'm not following the quesiton.
>
> > this is fine. But then how do multiple programs share a timer?
>
> there is only one callback function.

That's exactly my question. How is one callback function shared
by multiple eBPF programs which want to share the timer?


>
> > In the
> > case of conntrack, either ingress or egress could install the timer,
> > it only depends which one gets traffic first. Do they have to copy
> > the same subprog for the same timer?
>
> conntrack is an explicit non-goal.

I interpret this as you do not want timers to be shared by multiple
eBPF programs, correct? Weirdly, maps are shared, timers are
placed in a map, so timers should be shared naturally too.

>
> >
> > 2. Can t->prog be freed between a timer expiration and bpf_timer_start()
> > again?
>
> If it's already armed with the first bpf_timer_start() it won't be freed.

Why? I see t->prog is released in your timer callback:

+       bpf_prog_put(prog);
+       this_cpu_write(hrtimer_running, NULL);
+       return HRTIMER_NORESTART;

>
> > It gets a refcnt when starting a timer and puts it when cancelling
> > or expired, so t->prog can be freed right after cancelling or expired. What
> > if another program which shares this timer wants to restart this timer?
>
> There is only one callback_fn per timer. Another program can share
> the struct bpf_timer and the map. It might have subprog callback_fn code
> that looks exactly the same as callback_fn in the first prog.
> For example when libbpf loads progs/timer.c (after it was compiled into .o)
> it might share a subprog in the future (when kernel has support for
> dynamic linking). From bpf user pov it's a single .c file.
> The split into programs and subprograms is an implemenation detail
> that C programmer doesn't need to worry about.

Not exactly, they share a same C file but still can be loaded/unloaded
separately. And logically, whether a timer has been initialized once or
twice makes a huge difference for programers.

>
> > 3. Since you offer no API to read the expiration time, why not just use
> > BPF_TEST_RUN with a user-space timer? This is preferred by Andrii.
>
> Andrii point was that there should be no syscall cmds that replicate
> bpf_timer_init/start/cancel helpers. I agree with this.

Actually there is no strong reason to bother a bpf timer unless you
want to access the timer itself, which mostly contains expiration.
User-space timers work just fine for your cases, even if not, extending
BPF_TEST_RUN should too.

>
>
> > Thanks.
> >
> > Another unpopular point of view:
> >
> > This init() is not suitable for bpf programs, because unlike kernel modules,
> > there is no init or exit functions for a bpf program. And timer init
> > is typically
> > called during module init.
>
> Already answerd this in the prior thread. There will be __init and __fini like
> subprograms in bpf progs.

I interpret this as init does not make sense until we have __init and __fini?

>
> Please apply the patches to your local tree and do few experiments based
> on selftests/bpf/progs/timer.c. I think experimenting with the code
> will answer all of your questions.

Sounds like you find a great excuse for a failure of documentation.
What I asked are just fundamental design questions you should have
covered in your cover letter, which is literally empty.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-15  5:40         ` Alexei Starovoitov
@ 2021-06-15 15:24           ` Andrii Nakryiko
  2021-06-16  4:26             ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Andrii Nakryiko @ 2021-06-15 15:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Network Development, bpf, Kernel Team

On Mon, Jun 14, 2021 at 10:41 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jun 14, 2021 at 10:31 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Jun 14, 2021 at 8:29 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > > > > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > > > > +                                         (u64)(long)key,
> > > > > +                                         (u64)(long)t->value, 0, 0);
> > > > > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> > > >
> > > > I didn't find that next patch disallows callback return value 1 in the
> > > > verifier. If we indeed disallows return value 1 in the verifier. We
> > > > don't need WARN_ON here. Did I miss anything?
> > >
> > > Ohh. I forgot to address this bit in the verifier. Will fix.
> > >
> > > > > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > > > > +             /* If the timer wasn't active or callback already executing
> > > > > +              * bump the prog refcnt to keep it alive until
> > > > > +              * callback is invoked (again).
> > > > > +              */
> > > > > +             bpf_prog_inc(t->prog);
> > > >
> > > > I am not 100% sure. But could we have race condition here?
> > > >     cpu 1: running bpf_timer_start() helper call
> > > >     cpu 2: doing hrtimer work (calling callback etc.)
> > > >
> > > > Is it possible that
> > > >    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> > > > may be true and then right before bpf_prog_inc(t->prog), it becomes
> > > > true? If hrtimer_callback_running() is called, it is possible that
> > > > callback function could have dropped the reference count for t->prog,
> > > > so we could already go into the body of the function
> > > > __bpf_prog_put()?
> > >
> > > you're correct. Indeed there is a race.
> > > Circular dependency is a never ending headache.
> > > That's the same design mistake as with tail_calls.
> > > It felt that this case would be simpler than tail_calls and a bpf program
> > > pinning itself with bpf_prog_inc can be made to work... nope.
> > > I'll get rid of this and switch to something 'obviously correct'.
> > > Probably a link list with a lock to keep a set of init-ed timers and
> > > auto-cancel them on prog refcnt going to zero.
> > > To do 'bpf daemon' the prog would need to be pinned.
> >
> > Hm.. wouldn't this eliminate that race:
> >
> > switch (hrtimer_try_to_cancel(&t->timer))
> > {
> > case 0:
> >     /* nothing was queued */
> >     bpf_prog_inc(t->prog);
> >     break;
> > case 1:
> >     /* already have refcnt and it won't be bpf_prog_put by callback */
> >     break;
> > case -1:
> >     /* callback is running and will bpf_prog_put, so we need to take
> > another refcnt */
> >     bpf_prog_inc(t->prog);
> >     break;
> > }
> > hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
> >
> > So instead of guessing (racily) whether there is a queued callback or
> > not, try to cancel just in case there is. Then rely on the nice
> > guarantees that hrtimer cancellation API provides.
>
> I haven't thought it through yet, but the above approach could
> indeed solve this particular race. Unfortunately there are other races.
> There is an issue with bpf_timer_init. Since it doesn't take refcnt
> another program might do lookup and bpf_timer_start
> while the first prog got to refcnt=0 and got freed.

I think it's because of an API design. bpf_timer_init() takes a
callback (i.e., bpf_prog) but doesn't really do anything with it (so
doesn't take refcnt). It's both problematic, as you point out, and
unnecessarily restricting because it doesn't allow to change the
callback (e.g., when map is shared and bpf_program has to be changed).
If you change API to be:

long bpf_timer_init(struct bpf_timer *timer, int flags);
long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsecs);

You'll avoid this problem because bpf_timer_start will take refcnt
when arming (or re-arming) the timer. bpf_timer_init() will only take
care of initial memory allocation and hrtimer_init, but will leave
timer->prog as NULL until bpf_timer_start(). Wouldn't that solve all
the problems and be more flexible/powerful? If necessary, we can teach
bpf_timer_cb() to take spinlock briefly to avoid races fetching prog
pointer, but I haven't thought much about whether that's necessary.

If we wanted to push this to extreme, btw, we don't really need
bpf_timer_init(), bpf_timer_start() can do bpf_hrtimer allocation the
very first time (having pre-allocated spinlock makes this non-racy and
easy). But I don't know how expensive hrtimer_init() is, so it might
still make sense to split those two operations. Further, merging
bpf_timer_init() and bpf_timer_start() would require 6 input
arguments, which is a bit problematic. I have an idea how to get rid
of the necessity to pass in bpf_prog (so we'll be fine with just 5
explicit arguments), which simplifies other things (like
bpf_cgroup_storage implementation) as well, but I don't have patches
yet.

> Adding refcnt to bpf_timer_init() makes the prog self pinned
> and no callback might ever be executed (if there were no bpf_timer_start),
> so that will cause a high chance of bpf prog stuck in the kernel.
> There could be ref+uref schemes similar to tail_calls to address all that,
> but it gets ugly quickly.
> imo all these issues and races is a sign that such self pinning
> shouldn't be allowed.

I think prog refcounting is actually the saner and less surprising
approach here and we just need to spend a bit more time thinking how
to make everything work reliably. hrtimer API seems to be designed to
handle cases like this, which makes everything much easier.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-15 15:24           ` Andrii Nakryiko
@ 2021-06-16  4:26             ` Alexei Starovoitov
  2021-06-16  5:54               ` Andrii Nakryiko
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2021-06-16  4:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Network Development, bpf, Kernel Team

On Tue, Jun 15, 2021 at 08:24:13AM -0700, Andrii Nakryiko wrote:
> On Mon, Jun 14, 2021 at 10:41 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Jun 14, 2021 at 10:31 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Mon, Jun 14, 2021 at 8:29 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > > > > > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > > > > > +                                         (u64)(long)key,
> > > > > > +                                         (u64)(long)t->value, 0, 0);
> > > > > > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> > > > >
> > > > > I didn't find that next patch disallows callback return value 1 in the
> > > > > verifier. If we indeed disallows return value 1 in the verifier. We
> > > > > don't need WARN_ON here. Did I miss anything?
> > > >
> > > > Ohh. I forgot to address this bit in the verifier. Will fix.
> > > >
> > > > > > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > > > > > +             /* If the timer wasn't active or callback already executing
> > > > > > +              * bump the prog refcnt to keep it alive until
> > > > > > +              * callback is invoked (again).
> > > > > > +              */
> > > > > > +             bpf_prog_inc(t->prog);
> > > > >
> > > > > I am not 100% sure. But could we have race condition here?
> > > > >     cpu 1: running bpf_timer_start() helper call
> > > > >     cpu 2: doing hrtimer work (calling callback etc.)
> > > > >
> > > > > Is it possible that
> > > > >    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> > > > > may be true and then right before bpf_prog_inc(t->prog), it becomes
> > > > > true? If hrtimer_callback_running() is called, it is possible that
> > > > > callback function could have dropped the reference count for t->prog,
> > > > > so we could already go into the body of the function
> > > > > __bpf_prog_put()?
> > > >
> > > > you're correct. Indeed there is a race.
> > > > Circular dependency is a never ending headache.
> > > > That's the same design mistake as with tail_calls.
> > > > It felt that this case would be simpler than tail_calls and a bpf program
> > > > pinning itself with bpf_prog_inc can be made to work... nope.
> > > > I'll get rid of this and switch to something 'obviously correct'.
> > > > Probably a link list with a lock to keep a set of init-ed timers and
> > > > auto-cancel them on prog refcnt going to zero.
> > > > To do 'bpf daemon' the prog would need to be pinned.
> > >
> > > Hm.. wouldn't this eliminate that race:
> > >
> > > switch (hrtimer_try_to_cancel(&t->timer))
> > > {
> > > case 0:
> > >     /* nothing was queued */
> > >     bpf_prog_inc(t->prog);
> > >     break;
> > > case 1:
> > >     /* already have refcnt and it won't be bpf_prog_put by callback */
> > >     break;
> > > case -1:
> > >     /* callback is running and will bpf_prog_put, so we need to take
> > > another refcnt */
> > >     bpf_prog_inc(t->prog);
> > >     break;
> > > }
> > > hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);

Turned out that this approach has the same race as Yonghong mentioned.
Calling hrtimer_callback_running() directly or through hrtimer_try_to_cancel()
with extra cpu_base->lock as in above doesn't prevent the race.
bpf_prog_put could have already happened and above case:
 case -1:
     /* callback is running and will bpf_prog_put, so we need to take
 another refcnt */
     bpf_prog_inc(t->prog);

would be incrementing refcnt from zero.

> > >
> > > So instead of guessing (racily) whether there is a queued callback or
> > > not, try to cancel just in case there is. Then rely on the nice
> > > guarantees that hrtimer cancellation API provides.
> >
> > I haven't thought it through yet, but the above approach could
> > indeed solve this particular race. Unfortunately there are other races.
> > There is an issue with bpf_timer_init. Since it doesn't take refcnt
> > another program might do lookup and bpf_timer_start
> > while the first prog got to refcnt=0 and got freed.
> 
> I think it's because of an API design. bpf_timer_init() takes a
> callback (i.e., bpf_prog) but doesn't really do anything with it (so
> doesn't take refcnt). It's both problematic, as you point out, and
> unnecessarily restricting because it doesn't allow to change the
> callback (e.g., when map is shared and bpf_program has to be changed).
> If you change API to be:
> 
> long bpf_timer_init(struct bpf_timer *timer, int flags);
> long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsecs);
> 
> You'll avoid this problem because bpf_timer_start will take refcnt
> when arming (or re-arming) the timer. bpf_timer_init() will only take
> care of initial memory allocation and hrtimer_init, but will leave
> timer->prog as NULL until bpf_timer_start(). Wouldn't that solve all
> the problems and be more flexible/powerful?

Unfortunately no. The race would still be present and I don't see
a clean way of solving.

> If necessary, we can teach
> bpf_timer_cb() to take spinlock briefly to avoid races fetching prog
> pointer, but I haven't thought much about whether that's necessary.

That doesn't help either.
hrtimer_try_to_cancel() returning -1 (or checking it via
hrtimer_callback_running) doesn't mean that refcnt > 0.
It could have reached zero in bpf_timer_cb.
I thought whether introducing bpf_prog_inc_if_not_zero()
and using it in bpf_timer_start() could solve it...
Nope. The prog pointer could be already be freed if processing
of bpf_timer_cb is slow enough.
Then I thought whether we can move refcnt from prog into
'struct bpf_timer_kern'...
Then considered ref/uref counting...
It's slippery slop of wrong turns.

> If we wanted to push this to extreme, btw, we don't really need
> bpf_timer_init(), bpf_timer_start() can do bpf_hrtimer allocation the
> very first time (having pre-allocated spinlock makes this non-racy and
> easy). But I don't know how expensive hrtimer_init() is, so it might
> still make sense to split those two operations.

hrtimer_init is cheap, but bpf_timer_init() is expensive due
to memory allocation.
It's done once, so arguably should be ok,
but I'd like to avoid reinventing the wheel and stick
to api-s similar to hrtimer.

> Further, merging
> bpf_timer_init() and bpf_timer_start() would require 6 input
> arguments, which is a bit problematic. I have an idea how to get rid
> of the necessity to pass in bpf_prog (so we'll be fine with just 5
> explicit arguments), which simplifies other things (like
> bpf_cgroup_storage implementation) as well, but I don't have patches
> yet.
> 
> > Adding refcnt to bpf_timer_init() makes the prog self pinned
> > and no callback might ever be executed (if there were no bpf_timer_start),
> > so that will cause a high chance of bpf prog stuck in the kernel.
> > There could be ref+uref schemes similar to tail_calls to address all that,
> > but it gets ugly quickly.
> > imo all these issues and races is a sign that such self pinning
> > shouldn't be allowed.
> 
> I think prog refcounting is actually the saner and less surprising
> approach here and we just need to spend a bit more time thinking how
> to make everything work reliably. hrtimer API seems to be designed to
> handle cases like this, which makes everything much easier.

I made two 180 degree turns already. In the beginning I was strongly
against circular dependencies since old history of tail_call taught us
a valuable lesson. Then somehow convinced myself that this time around it will
be different and went with this crazy refcnting scheme. The last couple
weeks you, me, Yonghong, Toke and others invested countless hours thinking
through the race conditions. It's a sign that design took the wrong turn.
Circular dependencies must be avoided if possible. Here it's such case.
There is no need to complicate bpf_timer with crazy refcnting schemes.
The user space can simply pin the program in bpffs. In the future we might
introduce a self-pinning helper that would pin the program and create a file.
Sort-of like syscall prog type would pin self.
That would be explicit and clean api instead of obscure hacks inside bpf_timer*().

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-15  6:10       ` Cong Wang
@ 2021-06-16  4:53         ` Alexei Starovoitov
  0 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2021-06-16  4:53 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Daniel Borkmann, Andrii Nakryiko,
	Linux Kernel Network Developers, bpf, kernel-team

On Mon, Jun 14, 2021 at 11:10:46PM -0700, Cong Wang wrote:
> On Fri, Jun 11, 2021 at 11:45 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jun 10, 2021 at 11:42:24PM -0700, Cong Wang wrote:
> > > On Thu, Jun 10, 2021 at 9:27 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Please stick to one email thread in the future, ok?
> >
> > I'll consolidate them here:
> >
> > > What is your use case to justify your own code? Asking because
> > > you deny mine, so clearly my use case is not yours.
> >
> > I mentioned several use cases in the prior threads.
> > tldr: any periodic event in tracing, networking, security.
> > Garbage collection falls into this category as well, but please internalize
> > that implementing conntrack as it is today in the kernel is an explicit non-goal.
> 
> You need to read my use case again, it is for the conntrack
> in Cilium, not the kernel one.

Cong,

the toxicity in your reply is a bit too much.
Clearly you're very upset with something.
You cannot make peace that your timer patches were rejected?
You realize that they were 'changes requested' with specific
feedback of what you can do?
You've said that it's not possible and came up with reasons to believe so.
I had no choice, but to implement the suggested changes myself.
Why? Because the requests for bpf timers came up many times over the years.
The first time it was in 2013 when bpf didn't even exist upstream.
It was still in RFC stages. Then during XDP development, etc.
Everytime folks who needed timers found a workaround.
Even today the garbage collection can be implemented in the bpf prog
without PROG_RUN cmd and user space. The prog can be attached
to periodic perf event.
If cilium conntrack needed garbage collection they could have
implemented it long ago without user space daemons and kernel additions.

> > > 1. Can t->prog be freed between bpf_timer_init() and bpf_timer_start()?
> >
> > yes.
> 
> Good. So if a program which only initializes the timer and then exits,
> the other program which shares this timer will crash when it calls
> bpf_timer_start(), right?

Correct. There is a discussion about it in a different thread.

> 
> >
> > > If the timer subprog is always in the same prog which installs it, then
> >
> > installs it? I'm not following the quesiton.
> >
> > > this is fine. But then how do multiple programs share a timer?
> >
> > there is only one callback function.
> 
> That's exactly my question. How is one callback function shared
> by multiple eBPF programs which want to share the timer?

callback function is a piece of .text. Inside C program it's written once
and then compiled once into single piece of BPF asm code by llvm.
Later libbpf can copy-paste it into many bpf programs.
The C programmer doesn't see it and doesn't need to worry about it.
From the kernel memory consumption point of view it's a bit inefficient.
In the future we will add support for dynamic linking in the kernel
and in libbpf. The bpf progs will be able to share already loaded subprograms.

> 
> >
> > > In the
> > > case of conntrack, either ingress or egress could install the timer,
> > > it only depends which one gets traffic first. Do they have to copy
> > > the same subprog for the same timer?
> >
> > conntrack is an explicit non-goal.
> 
> I interpret this as you do not want timers to be shared by multiple
> eBPF programs, correct? Weirdly, maps are shared, timers are
> placed in a map, so timers should be shared naturally too.

I only meant that re-implementing kernel conntrack as a bpf program
is an explicit non-goal.
Meaning that people can do it and with this timer api it can be easily done.
But it's explicitly excluded from api requirements.
It doesn't mean that it's hard to do. It means that reimplenting kernel
conntrack as-is with non-scalable garbage collection algorithm
is outside of the scope of this work.

> > > 2. Can t->prog be freed between a timer expiration and bpf_timer_start()
> > > again?
> >
> > If it's already armed with the first bpf_timer_start() it won't be freed.
> 
> Why? I see t->prog is released in your timer callback:

That's another race and there is another thread where it's being discussed.
Please read the whole thread to get on the same page with everyone else.

> 
> Not exactly, they share a same C file but still can be loaded/unloaded
> separately. And logically, whether a timer has been initialized once or
> twice makes a huge difference for programers.

I've looked how kernel is using hrtimer apis and didn't find a single
case where timer function is redefined or more than one timer callback is used
with single hrtimer.

> Sounds like you find a great excuse for a failure of documentation.

I got different feedback regarding documentation that is already present in
the patches, but I'll expand the comments and documentation further
to make it more clear.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-16  4:26             ` Alexei Starovoitov
@ 2021-06-16  5:54               ` Andrii Nakryiko
  2021-06-16 16:52                 ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Andrii Nakryiko @ 2021-06-16  5:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Network Development, bpf, Kernel Team

On Tue, Jun 15, 2021 at 9:26 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jun 15, 2021 at 08:24:13AM -0700, Andrii Nakryiko wrote:
> > On Mon, Jun 14, 2021 at 10:41 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Jun 14, 2021 at 10:31 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Mon, Jun 14, 2021 at 8:29 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > > > > > > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > > > > > > +                                         (u64)(long)key,
> > > > > > > +                                         (u64)(long)t->value, 0, 0);
> > > > > > > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> > > > > >
> > > > > > I didn't find that next patch disallows callback return value 1 in the
> > > > > > verifier. If we indeed disallows return value 1 in the verifier. We
> > > > > > don't need WARN_ON here. Did I miss anything?
> > > > >
> > > > > Ohh. I forgot to address this bit in the verifier. Will fix.
> > > > >
> > > > > > > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > > > > > > +             /* If the timer wasn't active or callback already executing
> > > > > > > +              * bump the prog refcnt to keep it alive until
> > > > > > > +              * callback is invoked (again).
> > > > > > > +              */
> > > > > > > +             bpf_prog_inc(t->prog);
> > > > > >
> > > > > > I am not 100% sure. But could we have race condition here?
> > > > > >     cpu 1: running bpf_timer_start() helper call
> > > > > >     cpu 2: doing hrtimer work (calling callback etc.)
> > > > > >
> > > > > > Is it possible that
> > > > > >    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> > > > > > may be true and then right before bpf_prog_inc(t->prog), it becomes
> > > > > > true? If hrtimer_callback_running() is called, it is possible that
> > > > > > callback function could have dropped the reference count for t->prog,
> > > > > > so we could already go into the body of the function
> > > > > > __bpf_prog_put()?
> > > > >
> > > > > you're correct. Indeed there is a race.
> > > > > Circular dependency is a never ending headache.
> > > > > That's the same design mistake as with tail_calls.
> > > > > It felt that this case would be simpler than tail_calls and a bpf program
> > > > > pinning itself with bpf_prog_inc can be made to work... nope.
> > > > > I'll get rid of this and switch to something 'obviously correct'.
> > > > > Probably a link list with a lock to keep a set of init-ed timers and
> > > > > auto-cancel them on prog refcnt going to zero.
> > > > > To do 'bpf daemon' the prog would need to be pinned.
> > > >
> > > > Hm.. wouldn't this eliminate that race:
> > > >
> > > > switch (hrtimer_try_to_cancel(&t->timer))
> > > > {
> > > > case 0:
> > > >     /* nothing was queued */
> > > >     bpf_prog_inc(t->prog);
> > > >     break;
> > > > case 1:
> > > >     /* already have refcnt and it won't be bpf_prog_put by callback */
> > > >     break;
> > > > case -1:
> > > >     /* callback is running and will bpf_prog_put, so we need to take
> > > > another refcnt */
> > > >     bpf_prog_inc(t->prog);
> > > >     break;
> > > > }
> > > > hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
>
> Turned out that this approach has the same race as Yonghong mentioned.
> Calling hrtimer_callback_running() directly or through hrtimer_try_to_cancel()
> with extra cpu_base->lock as in above doesn't prevent the race.
> bpf_prog_put could have already happened and above case:
>  case -1:
>      /* callback is running and will bpf_prog_put, so we need to take
>  another refcnt */
>      bpf_prog_inc(t->prog);
>
> would be incrementing refcnt from zero.
>
> > > >
> > > > So instead of guessing (racily) whether there is a queued callback or
> > > > not, try to cancel just in case there is. Then rely on the nice
> > > > guarantees that hrtimer cancellation API provides.
> > >
> > > I haven't thought it through yet, but the above approach could
> > > indeed solve this particular race. Unfortunately there are other races.
> > > There is an issue with bpf_timer_init. Since it doesn't take refcnt
> > > another program might do lookup and bpf_timer_start
> > > while the first prog got to refcnt=0 and got freed.
> >
> > I think it's because of an API design. bpf_timer_init() takes a
> > callback (i.e., bpf_prog) but doesn't really do anything with it (so
> > doesn't take refcnt). It's both problematic, as you point out, and
> > unnecessarily restricting because it doesn't allow to change the
> > callback (e.g., when map is shared and bpf_program has to be changed).
> > If you change API to be:
> >
> > long bpf_timer_init(struct bpf_timer *timer, int flags);
> > long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsecs);
> >
> > You'll avoid this problem because bpf_timer_start will take refcnt
> > when arming (or re-arming) the timer. bpf_timer_init() will only take
> > care of initial memory allocation and hrtimer_init, but will leave
> > timer->prog as NULL until bpf_timer_start(). Wouldn't that solve all
> > the problems and be more flexible/powerful?
>
> Unfortunately no. The race would still be present and I don't see
> a clean way of solving.
>
> > If necessary, we can teach
> > bpf_timer_cb() to take spinlock briefly to avoid races fetching prog
> > pointer, but I haven't thought much about whether that's necessary.
>
> That doesn't help either.
> hrtimer_try_to_cancel() returning -1 (or checking it via
> hrtimer_callback_running) doesn't mean that refcnt > 0.
> It could have reached zero in bpf_timer_cb.
> I thought whether introducing bpf_prog_inc_if_not_zero()
> and using it in bpf_timer_start() could solve it...
> Nope. The prog pointer could be already be freed if processing
> of bpf_timer_cb is slow enough.
> Then I thought whether we can move refcnt from prog into
> 'struct bpf_timer_kern'...
> Then considered ref/uref counting...
> It's slippery slop of wrong turns.
>
> > If we wanted to push this to extreme, btw, we don't really need
> > bpf_timer_init(), bpf_timer_start() can do bpf_hrtimer allocation the
> > very first time (having pre-allocated spinlock makes this non-racy and
> > easy). But I don't know how expensive hrtimer_init() is, so it might
> > still make sense to split those two operations.
>
> hrtimer_init is cheap, but bpf_timer_init() is expensive due
> to memory allocation.
> It's done once, so arguably should be ok,
> but I'd like to avoid reinventing the wheel and stick
> to api-s similar to hrtimer.
>
> > Further, merging
> > bpf_timer_init() and bpf_timer_start() would require 6 input
> > arguments, which is a bit problematic. I have an idea how to get rid
> > of the necessity to pass in bpf_prog (so we'll be fine with just 5
> > explicit arguments), which simplifies other things (like
> > bpf_cgroup_storage implementation) as well, but I don't have patches
> > yet.
> >
> > > Adding refcnt to bpf_timer_init() makes the prog self pinned
> > > and no callback might ever be executed (if there were no bpf_timer_start),
> > > so that will cause a high chance of bpf prog stuck in the kernel.
> > > There could be ref+uref schemes similar to tail_calls to address all that,
> > > but it gets ugly quickly.
> > > imo all these issues and races is a sign that such self pinning
> > > shouldn't be allowed.
> >
> > I think prog refcounting is actually the saner and less surprising
> > approach here and we just need to spend a bit more time thinking how
> > to make everything work reliably. hrtimer API seems to be designed to
> > handle cases like this, which makes everything much easier.
>
> I made two 180 degree turns already. In the beginning I was strongly
> against circular dependencies since old history of tail_call taught us
> a valuable lesson. Then somehow convinced myself that this time around it will
> be different and went with this crazy refcnting scheme. The last couple
> weeks you, me, Yonghong, Toke and others invested countless hours thinking
> through the race conditions. It's a sign that design took the wrong turn.
> Circular dependencies must be avoided if possible. Here it's such case.

It could be the case, of course. But let's try to think this through
to the end before giving up. I think it's mostly because we are trying
to be too clever with lockless synchronization. I'm still not
convinced that it's a bad design.

I had a feeling that bpf_timer_cb needs to take lock as well. But once
we add that, refcounting becomes simpler and more deterministic, IMO.
Here's what I have in mind. I keep only important parts of the code,
so it's not a complete implementation. Please take a look below, I
left a few comments here and there.


struct bpf_hrtimer {
       struct hrtimer timer;
       struct bpf_map *map;
       void *value;

       struct bpf_prog *prog;
       void *callback_fn;

       /* pointer to that lock in struct bpf_timer_kern
        * so that we can access it from bpf_timer_cb()
        */
       struct bpf_spin_lock *lock;
};

BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, int, flags,
           struct bpf_map *, map)
{
       struct bpf_hrtimer *t;
       int ret = 0;

       ____bpf_spin_lock(&timer->lock);
       t = timer->timer;
       if (t) {
               ret = -EBUSY;
               goto out;
       }
       /* allocate hrtimer via map_kmalloc to use memcg accounting */
       t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
       if (!t) {
               ret = -ENOMEM;
               goto out;
       }
       t->value = (void *)timer /* - offset of bpf_timer inside elem */;
       t->map = map;
       t->timer.function = bpf_timer_cb;

       /* we'll init them in bpf_timer_start */
       t->prog = NULL;
       t->callback_fn = NULL;

       hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
       timer->timer = t;
out:
       ____bpf_spin_unlock(&timer->lock);
       return ret;
}


BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs,
           void *, cb, struct bpf_prog *, prog)
{
       struct bpf_hrtimer *t;
       int ret = 0;

       ____bpf_spin_lock(&timer->lock);
       t = timer->timer;
       if (!t) {
               ret = -EINVAL;
               goto out;
       }

       /* doesn't matter what it returns, we just request cancellation */
       hrtimer_try_to_cancel(&t->timer);

       /* t->prog might not be the same as prog (!) */
       if (prog != t->prog) {
            /* callback hasn't yet dropped refcnt */
           if (t->prog) /* if it's null bpf_timer_cb() is running and
will put it later */
               bpf_prog_put(t->prog);

           if (IS_ERR(bpf_prog_inc_not_zero(prog))) {
               /* this will only happen if prog is still running (and
it's actually us),
                * but it was already put to zero, e.g., by closing last FD,
                * so there is no point in scheduling a new run
                */
               t->prog = NULL;
               t->callback_fn = NULL;
               ret = -E_WE_ARE_SHUTTING_DOWN;
               goto out;
           }
       } /* otherwise we keep existing refcnt on t->prog == prog */

       /* potentially new combination of prog and cb */
       t->prog = prog;
       t->callback_fn = cb;

       hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
out:
       ____bpf_spin_unlock(&timer->lock);
       return ret;
}

BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
{
       struct bpf_hrtimer *t;
       int ret = 0;

       ____bpf_spin_lock(&timer->lock);
       t = timer->timer;
       if (!t) {
               ret = -EINVAL;
               goto out;
       }

       /* this part I still worry about due to possibility of cpu migration,
        * we need to think if we should migrate_disable() in bpf_timer_cb()
        * and bpf_timer_* helpers(), but that's a separate topic
        */
       if (this_cpu_read(hrtimer_running) == t) {
               ret = -EDEADLK;
               goto out;
       }

       ret = hrtimer_cancel(&t->timer);

       if (t->prog) {
            /* bpf_timer_cb hasn't put it yet (and now won't) */
            bpf_prog_put(t->prog);
            t->prog = NULL;
            t->callback_fn = NULL;
       }
out:
       ____bpf_spin_unlock(&timer->lock);
       return ret;
}

static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
{
       struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
       struct bpf_map *map = t->map;
       struct bpf_prog *prog;
       void *key, *callback_fn;
       u32 idx;
       int ret;

       /* this is very IMPORTANT  */
       ____bpf_spin_lock(t->lock);

       prog = t->prog;
       if (!prog) {
           /* we were cancelled, prog is put already, exit early */
           ____bpf_spin_unlock(&timer->lock);
           return HRTIMER_NORESTART;
       }
       callback_fn = t->callback_fn;

       /* make sure bpf_timer_cancel/bpf_timer_start won't
bpf_prog_put our prog */
       t->prog = NULL;
       t->callback_fn = NULL;

       ____bpf_spin_unlock(t->lock);

       /* at this point we "own" prog's refcnt decrement */

       this_cpu_write(hrtimer_running, t);

       ...

       ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
                                           (u64)(long)key,
                                           (u64)(long)value, 0, 0);
       WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */

       bpf_prog_put(prog); /* always correct and non-racy */

       this_cpu_write(hrtimer_running, NULL);

       return HRTIMER_NORESTART;
}

bpf_timer_cancel_and_free() is mostly the same with t->prog NULL check
as everywhere else

> There is no need to complicate bpf_timer with crazy refcnting schemes.
> The user space can simply pin the program in bpffs. In the future we might
> introduce a self-pinning helper that would pin the program and create a file.
> Sort-of like syscall prog type would pin self.
> That would be explicit and clean api instead of obscure hacks inside bpf_timer*().

Do I understand correctly that the alternative that you are proposing
is to keep some linked list of all map_values across all maps in the
system that have initialized bpf_hrtimer with that particular bpf_prog
in them? And when bpf_prog is put to zero you'll go and destruct them
all in a race-free way?

I have a bit of a hard time imagining how that will be implemented
exactly, so I might be overcomplicating that in my mind. Will be happy
to see the working code.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-16  5:54               ` Andrii Nakryiko
@ 2021-06-16 16:52                 ` Alexei Starovoitov
  0 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2021-06-16 16:52 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, David S. Miller, Daniel Borkmann, Andrii Nakryiko,
	Network Development, bpf, Kernel Team

[-- Attachment #1: Type: text/plain, Size: 7703 bytes --]

On Tue, Jun 15, 2021 at 10:54:40PM -0700, Andrii Nakryiko wrote:
> 
> It could be the case, of course. But let's try to think this through
> to the end before giving up. I think it's mostly because we are trying
> to be too clever with lockless synchronization.

imo your proposed code fits "too clever" too ;)
Just a reminder that few emails ago you've argued 
about "obviously correct" approach, but now...

> I had a feeling that bpf_timer_cb needs to take lock as well. But once
> we add that, refcounting becomes simpler and more deterministic, IMO.
> Here's what I have in mind. I keep only important parts of the code,
> so it's not a complete implementation. Please take a look below, I
> left a few comments here and there.
> 
> 
> struct bpf_hrtimer {
>        struct hrtimer timer;
>        struct bpf_map *map;
>        void *value;
> 
>        struct bpf_prog *prog;
>        void *callback_fn;
> 
>        /* pointer to that lock in struct bpf_timer_kern
>         * so that we can access it from bpf_timer_cb()
>         */
>        struct bpf_spin_lock *lock;
> };
> 
> BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, int, flags,
>            struct bpf_map *, map)
> {
>        struct bpf_hrtimer *t;
>        int ret = 0;
> 
>        ____bpf_spin_lock(&timer->lock);
>        t = timer->timer;
>        if (t) {
>                ret = -EBUSY;
>                goto out;
>        }
>        /* allocate hrtimer via map_kmalloc to use memcg accounting */
>        t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
>        if (!t) {
>                ret = -ENOMEM;
>                goto out;
>        }
>        t->value = (void *)timer /* - offset of bpf_timer inside elem */;
>        t->map = map;
>        t->timer.function = bpf_timer_cb;
> 
>        /* we'll init them in bpf_timer_start */
>        t->prog = NULL;
>        t->callback_fn = NULL;
> 
>        hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
>        timer->timer = t;
> out:
>        ____bpf_spin_unlock(&timer->lock);
>        return ret;
> }
> 
> 
> BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs,
>            void *, cb, struct bpf_prog *, prog)
> {
>        struct bpf_hrtimer *t;
>        int ret = 0;
> 
>        ____bpf_spin_lock(&timer->lock);
>        t = timer->timer;
>        if (!t) {
>                ret = -EINVAL;
>                goto out;
>        }
> 
>        /* doesn't matter what it returns, we just request cancellation */
>        hrtimer_try_to_cancel(&t->timer);
> 
>        /* t->prog might not be the same as prog (!) */
>        if (prog != t->prog) {
>             /* callback hasn't yet dropped refcnt */
>            if (t->prog) /* if it's null bpf_timer_cb() is running and
> will put it later */
>                bpf_prog_put(t->prog);
> 
>            if (IS_ERR(bpf_prog_inc_not_zero(prog))) {
>                /* this will only happen if prog is still running (and
> it's actually us),
>                 * but it was already put to zero, e.g., by closing last FD,
>                 * so there is no point in scheduling a new run
>                 */

I have a bit of mind explosion here... everything will be alright.

>                t->prog = NULL;
>                t->callback_fn = NULL;
>                ret = -E_WE_ARE_SHUTTING_DOWN;
>                goto out;
>            }
>        } /* otherwise we keep existing refcnt on t->prog == prog */
> 
>        /* potentially new combination of prog and cb */
>        t->prog = prog;
>        t->callback_fn = cb;
> 
>        hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
> out:
>        ____bpf_spin_unlock(&timer->lock);
>        return ret;
> }
> 
> BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> {
>        struct bpf_hrtimer *t;
>        int ret = 0;
> 
>        ____bpf_spin_lock(&timer->lock);
>        t = timer->timer;
>        if (!t) {
>                ret = -EINVAL;
>                goto out;
>        }
> 
>        /* this part I still worry about due to possibility of cpu migration,
>         * we need to think if we should migrate_disable() in bpf_timer_cb()
>         * and bpf_timer_* helpers(), but that's a separate topic
>         */
>        if (this_cpu_read(hrtimer_running) == t) {
>                ret = -EDEADLK;
>                goto out;
>        }
> 
>        ret = hrtimer_cancel(&t->timer);
> 
>        if (t->prog) {
>             /* bpf_timer_cb hasn't put it yet (and now won't) */
>             bpf_prog_put(t->prog);
>             t->prog = NULL;
>             t->callback_fn = NULL;
>        }
> out:
>        ____bpf_spin_unlock(&timer->lock);
>        return ret;
> }
> 
> static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
> {
>        struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
>        struct bpf_map *map = t->map;
>        struct bpf_prog *prog;
>        void *key, *callback_fn;
>        u32 idx;
>        int ret;
> 
>        /* this is very IMPORTANT  */
>        ____bpf_spin_lock(t->lock);
> 
>        prog = t->prog;
>        if (!prog) {
>            /* we were cancelled, prog is put already, exit early */
>            ____bpf_spin_unlock(&timer->lock);
>            return HRTIMER_NORESTART;
>        }
>        callback_fn = t->callback_fn;
> 
>        /* make sure bpf_timer_cancel/bpf_timer_start won't
> bpf_prog_put our prog */
>        t->prog = NULL;
>        t->callback_fn = NULL;
> 
>        ____bpf_spin_unlock(t->lock);
> 
>        /* at this point we "own" prog's refcnt decrement */
> 
>        this_cpu_write(hrtimer_running, t);
> 
>        ...
> 
>        ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
>                                            (u64)(long)key,
>                                            (u64)(long)value, 0, 0);
>        WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> 
>        bpf_prog_put(prog); /* always correct and non-racy */
> 
>        this_cpu_write(hrtimer_running, NULL);
> 
>        return HRTIMER_NORESTART;
> }
> 
> bpf_timer_cancel_and_free() is mostly the same with t->prog NULL check
> as everywhere else

I haven't started detailed analysis of above proposal, but looks overly
complicated on the first glance. Not saying it's bad or good.
Just complexity and races are striking.

> 
> > There is no need to complicate bpf_timer with crazy refcnting schemes.
> > The user space can simply pin the program in bpffs. In the future we might
> > introduce a self-pinning helper that would pin the program and create a file.
> > Sort-of like syscall prog type would pin self.
> > That would be explicit and clean api instead of obscure hacks inside bpf_timer*().
> 
> Do I understand correctly that the alternative that you are proposing
> is to keep some linked list of all map_values across all maps in the
> system that have initialized bpf_hrtimer with that particular bpf_prog
> in them? And when bpf_prog is put to zero you'll go and destruct them
> all in a race-free way?
> 
> I have a bit of a hard time imagining how that will be implemented
> exactly, so I might be overcomplicating that in my mind. Will be happy
> to see the working code.

Here is working code...
Note how patch 1 is so much simpler without complicated refcnting.
And how patch 2 removes for_each_map_element that was necessary earlier.
Also note that link list approach is an optimization.
Instead of keeping a link list the bpf_free_used_timers() could call
a map specific op to iterate all elems and free timers with
timer->prog == prog_going_away.
That was my initial proposal couple month ago.
link_list is purely an optimization instead of for_each_map_elem.

[-- Attachment #2: 0001-bpf-Cancel-and-free-timers-when-prog-is-going-away.patch --]
[-- Type: text/plain, Size: 7318 bytes --]

From c11bf0aa23f1df25682056f2c581c9bc9bd8df31 Mon Sep 17 00:00:00 2001
From: Alexei Starovoitov <ast@kernel.org>
Date: Wed, 16 Jun 2021 09:19:36 -0700
Subject: [PATCH bpf-next 1/2] bpf: Cancel and free timers when prog is going
 away.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h  |  3 ++
 kernel/bpf/core.c    |  3 ++
 kernel/bpf/helpers.c | 70 +++++++++++++++++++++++++-------------------
 3 files changed, 46 insertions(+), 30 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7c403235c7e8..f67ea2512844 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -245,6 +245,7 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 			   bool lock_src);
 void bpf_timer_cancel_and_free(void *timer);
+void bpf_free_used_timers(struct bpf_prog_aux *aux);
 int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
 
 struct bpf_offload_dev;
@@ -871,6 +872,8 @@ struct bpf_prog_aux {
 	u32 size_poke_tab;
 	struct bpf_ksym ksym;
 	const struct bpf_prog_ops *ops;
+	spinlock_t timers_lock;
+	struct hlist_head used_timers;
 	struct bpf_map **used_maps;
 	struct mutex used_maps_mutex; /* mutex for used_maps and used_map_cnt */
 	struct btf_mod_pair *used_btfs;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 5e31ee9f7512..aa7960986a75 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -104,6 +104,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
 	fp->jit_requested = ebpf_jit_enabled();
 
 	INIT_LIST_HEAD_RCU(&fp->aux->ksym.lnode);
+	INIT_HLIST_HEAD(&fp->aux->used_timers);
+	spin_lock_init(&fp->aux->timers_lock);
 	mutex_init(&fp->aux->used_maps_mutex);
 	mutex_init(&fp->aux->dst_mutex);
 
@@ -2201,6 +2203,7 @@ static void bpf_prog_free_deferred(struct work_struct *work)
 	int i;
 
 	aux = container_of(work, struct bpf_prog_aux, work);
+	bpf_free_used_timers(aux);
 	bpf_free_used_maps(aux);
 	bpf_free_used_btfs(aux);
 	if (bpf_prog_is_dev_bound(aux))
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index b8df592c33cc..08f5d0f73f68 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -987,6 +987,7 @@ const struct bpf_func_proto bpf_snprintf_proto = {
 
 struct bpf_hrtimer {
 	struct hrtimer timer;
+	struct hlist_node hlist;
 	struct bpf_map *map;
 	struct bpf_prog *prog;
 	void *callback_fn;
@@ -1004,7 +1005,6 @@ static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
 static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
 {
 	struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
-	struct bpf_prog *prog = t->prog;
 	struct bpf_map *map = t->map;
 	void *key;
 	u32 idx;
@@ -1031,16 +1031,6 @@ static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
 					    (u64)(long)t->value, 0, 0);
 	WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
 
-	/* The bpf function finished executed. Drop the prog refcnt.
-	 * It could reach zero here and trigger free of bpf_prog
-	 * and subsequent free of the maps that were holding timers.
-	 * If callback_fn called bpf_timer_start on this timer
-	 * the prog refcnt will be > 0.
-	 *
-	 * If callback_fn deleted map element the 't' could have been freed,
-	 * hence t->prog deref is done earlier.
-	 */
-	bpf_prog_put(prog);
 	this_cpu_write(hrtimer_running, NULL);
 	return HRTIMER_NORESTART;
 }
@@ -1077,6 +1067,10 @@ BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flag
 	t->prog = prog;
 	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
 	t->timer.function = bpf_timer_cb;
+	INIT_HLIST_NODE(&t->hlist);
+	spin_lock(&prog->aux->timers_lock);
+	hlist_add_head_rcu(&t->hlist, &prog->aux->used_timers);
+	spin_unlock(&prog->aux->timers_lock);
 	timer->timer = t;
 out:
 	____bpf_spin_unlock(&timer->lock);
@@ -1103,12 +1097,6 @@ BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)
 		ret = -EINVAL;
 		goto out;
 	}
-	if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
-		/* If the timer wasn't active or callback already executing
-		 * bump the prog refcnt to keep it alive until
-		 * callback is invoked (again).
-		 */
-		bpf_prog_inc(t->prog);
 	hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
 out:
 	____bpf_spin_unlock(&timer->lock);
@@ -1145,13 +1133,7 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
 	/* Cancel the timer and wait for associated callback to finish
 	 * if it was running.
 	 */
-	if (hrtimer_cancel(&t->timer) == 1) {
-		/* If the timer was active then drop the prog refcnt,
-		 * since callback will not be invoked.
-		 */
-		bpf_prog_put(t->prog);
-		ret = 1;
-	}
+	ret = hrtimer_cancel(&t->timer);
 out:
 	____bpf_spin_unlock(&timer->lock);
 	return ret;
@@ -1164,8 +1146,10 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
 	.arg1_type	= ARG_PTR_TO_TIMER,
 };
 
-/* This function is called by delete_element in htab and lru maps
- * and by map_free for array, lru, htab maps.
+/* This function is called by map_delete/update_elem for individual
+ * element and by bpf_free_used_timers when prog is going away.
+ * When map is destroyed by ops->map_free all bpf_timers in there
+ * are freed.
  */
 void bpf_timer_cancel_and_free(void *val)
 {
@@ -1177,7 +1161,7 @@ void bpf_timer_cancel_and_free(void *val)
 	if (!t)
 		goto out;
 	/* Cancel the timer and wait for callback to complete if it was
-	 * running. Only individual delete_element in htab or lru maps can
+	 * running. Only delete/update of individual element can
 	 * return 1 from hrtimer_cancel.
 	 * The whole map is destroyed when its refcnt reaches zero.
 	 * That happens after bpf prog refcnt reaches zero.
@@ -1197,15 +1181,41 @@ void bpf_timer_cancel_and_free(void *val)
 	 * In non-preallocated maps timer->timer = NULL will happen after
 	 * callback completes, since prog execution is an RCU critical section.
 	 */
-	if (this_cpu_read(hrtimer_running) != t &&
-	    hrtimer_cancel(&t->timer) == 1)
-		bpf_prog_put(t->prog);
+	if (this_cpu_read(hrtimer_running) != t)
+		hrtimer_cancel(&t->timer);
+
+	spin_lock(&t->prog->aux->timers_lock);
+	hlist_del_rcu(&t->hlist);
+	spin_unlock(&t->prog->aux->timers_lock);
+	t->prog = LIST_POISON1;
 	kfree(t);
 	timer->timer = NULL;
 out:
 	____bpf_spin_unlock(&timer->lock);
 }
 
+/* This function is called after prog->refcnt reaches zero.
+ * It's called before bpf_free_used_maps to clean up timers in maps
+ * if going away prog had callback_fn-s for them.
+ */
+void bpf_free_used_timers(struct bpf_prog_aux *aux)
+{
+	struct bpf_timer_kern *timer;
+	struct bpf_hrtimer *t;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+	hlist_for_each_entry_safe(t, n, &aux->used_timers, hlist) {
+		timer = t->value + t->map->timer_off;
+		/* The map isn't going away. The 'timer' points into map
+		 * element that isn't going away either, but cancel_and_free
+		 * could be racing with parallel map_delete_elem.
+		 */
+		bpf_timer_cancel_and_free(timer);
+	}
+	rcu_read_unlock();
+}
+
 const struct bpf_func_proto bpf_get_current_task_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
-- 
2.30.2


[-- Attachment #3: 0002-bpf-Don-t-iterate-all-map-elements-anymore.patch --]
[-- Type: text/plain, Size: 1771 bytes --]

From 62d9bd33aac388c34e7fd3b411e0d40084d07f4b Mon Sep 17 00:00:00 2001
From: Alexei Starovoitov <ast@kernel.org>
Date: Wed, 16 Jun 2021 09:40:32 -0700
Subject: [PATCH bpf-next 2/2] bpf: Don't iterate all map elements anymore.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/arraymap.c |  7 -------
 kernel/bpf/hashtab.c  | 11 -----------
 2 files changed, 18 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 5c84ab7f8872..d82a6de65273 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -385,17 +385,10 @@ static void *array_map_vmalloc_addr(struct bpf_array *array)
 static void array_map_free(struct bpf_map *map)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
-	int i;
 
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
 		bpf_array_free_percpu(array);
 
-	if (unlikely(map_value_has_timer(map)))
-		for (i = 0; i < array->map.max_entries; i++)
-			bpf_timer_cancel_and_free(array->value +
-						  array->elem_size * i +
-						  map->timer_off);
-
 	if (array->map.map_flags & BPF_F_MMAPABLE)
 		bpf_map_area_free(array_map_vmalloc_addr(array));
 	else
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index c885492d0a76..5e2736c46185 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -244,17 +244,6 @@ static void htab_free_elems(struct bpf_htab *htab)
 		cond_resched();
 	}
 free_elems:
-	if (unlikely(map_value_has_timer(&htab->map)))
-		for (i = 0; i < htab->map.max_entries; i++) {
-			struct htab_elem *elem;
-
-			elem = get_htab_elem(htab, i);
-			bpf_timer_cancel_and_free(elem->key +
-						  round_up(htab->map.key_size, 8) +
-						  htab->map.timer_off);
-			cond_resched();
-		}
-
 	bpf_map_area_free(htab->elems);
 }
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-06-16 16:52 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-11  4:24 [PATCH v2 bpf-next 0/3] bpf: Introduce BPF timers Alexei Starovoitov
2021-06-11  4:24 ` [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
2021-06-11  6:42   ` Cong Wang
2021-06-11 18:45     ` Alexei Starovoitov
2021-06-15  6:10       ` Cong Wang
2021-06-16  4:53         ` Alexei Starovoitov
2021-06-11  7:05   ` Cong Wang
2021-06-11 22:12   ` Yonghong Song
2021-06-15  3:33     ` Alexei Starovoitov
2021-06-15  4:21       ` Yonghong Song
2021-06-14 16:51   ` Yonghong Song
2021-06-15  3:29     ` Alexei Starovoitov
2021-06-15  5:31       ` Andrii Nakryiko
2021-06-15  5:40         ` Alexei Starovoitov
2021-06-15 15:24           ` Andrii Nakryiko
2021-06-16  4:26             ` Alexei Starovoitov
2021-06-16  5:54               ` Andrii Nakryiko
2021-06-16 16:52                 ` Alexei Starovoitov
2021-06-15  4:48   ` Andrii Nakryiko
2021-06-11  4:24 ` [PATCH v2 bpf-next 2/3] bpf: Add verifier checks for bpf_timer Alexei Starovoitov
2021-06-11  4:24 ` [PATCH v2 bpf-next 3/3] selftests/bpf: Add bpf_timer test Alexei Starovoitov
2021-06-11  6:47 ` [PATCH v2 bpf-next 0/3] bpf: Introduce BPF timers Cong Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.