All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/3] bpf: Introduce BPF timers.
@ 2021-05-27  4:02 Alexei Starovoitov
  2021-05-27  4:02 ` [PATCH bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2021-05-27  4:02 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The 1st patch implements interaction between bpf programs and bpf core.
The 2nd patch implements necessary safety checks.
The 3rd patch is the test.
The patchset it's not ready to land, since it needs a lot more tests,
but it's in functionally solid shape. Please review.

Alexei Starovoitov (3):
  bpf: Introduce bpf_timer
  bpf: Add verifier checks for bpf_timer.
  selftests/bpf: Add bpf_timer test.

 include/linux/bpf.h                           |  37 +++-
 include/linux/btf.h                           |   1 +
 include/uapi/linux/bpf.h                      |  26 +++
 kernel/bpf/arraymap.c                         |   7 +
 kernel/bpf/btf.c                              |  77 +++++++--
 kernel/bpf/hashtab.c                          |  53 ++++--
 kernel/bpf/helpers.c                          | 160 ++++++++++++++++++
 kernel/bpf/local_storage.c                    |   4 +-
 kernel/bpf/syscall.c                          |  23 ++-
 kernel/bpf/verifier.c                         | 134 +++++++++++++++
 kernel/trace/bpf_trace.c                      |   2 +-
 scripts/bpf_doc.py                            |   2 +
 tools/include/uapi/linux/bpf.h                |  26 +++
 .../testing/selftests/bpf/prog_tests/timer.c  |  47 +++++
 tools/testing/selftests/bpf/progs/timer.c     |  85 ++++++++++
 15 files changed, 644 insertions(+), 40 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-05-27  4:02 [PATCH bpf-next 0/3] bpf: Introduce BPF timers Alexei Starovoitov
@ 2021-05-27  4:02 ` Alexei Starovoitov
  2021-05-27 16:57   ` Toke Høiland-Jørgensen
  2021-06-02 22:08   ` Andrii Nakryiko
  2021-05-27  4:02 ` [PATCH bpf-next 2/3] bpf: Add verifier checks for bpf_timer Alexei Starovoitov
  2021-05-27  4:02 ` [PATCH bpf-next 3/3] selftests/bpf: Add bpf_timer test Alexei Starovoitov
  2 siblings, 2 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2021-05-27  4:02 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce 'struct bpf_timer { __u64 :64; };' that can be embedded
in hash/array/lru maps as regular field and helpers to operate on it:
long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags);
long bpf_timer_start(struct bpf_timer *timer, u64 nsecs);
long bpf_timer_cancel(struct bpf_timer *timer);

Here is how BPF program might look like:
struct map_elem {
    int counter;
    struct bpf_timer timer;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1000);
    __type(key, int);
    __type(value, struct map_elem);
} hmap SEC(".maps");

struct bpf_timer global_timer;

static int timer_cb1(void *map, int *key, __u64 *data);
/* global_timer is in bss which is special bpf array of one element.
 * data points to beginning of bss.
 */

static int timer_cb2(void *map, int *key, struct map_elem *val);
/* val points to particular map element that contains bpf_timer. */

SEC("fentry/bpf_fentry_test1")
int BPF_PROG(test1, int a)
{
    struct map_elem *val;
    int key = 0;
    bpf_timer_init(&global_timer, timer_cb1, 0);
    bpf_timer_start(&global_timer, 0 /* call timer_cb1 asap */);

    val = bpf_map_lookup_elem(&hmap, &key);
    if (val) {
        bpf_timer_init(&val->timer, timer_cb2, 0);
        bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 msec */);
    }
}

This patch adds helper implementations that rely on hrtimers
to call bpf functions as timers expire.
The following patch adds necessary safety checks.

Only programs with CAP_BPF are allowed to use bpf_timer.

The amount of timers used by the program is constrained by
the memcg recorded at map creation time.

The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
supplied by the verifier. The prog pointer is needed to do refcnting of bpf
program to make sure that program doesn't get freed while timer is armed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h            |   1 +
 include/uapi/linux/bpf.h       |  26 ++++++
 kernel/bpf/helpers.c           | 160 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          | 110 +++++++++++++++++++++++
 kernel/trace/bpf_trace.c       |   2 +-
 scripts/bpf_doc.py             |   2 +
 tools/include/uapi/linux/bpf.h |  26 ++++++
 7 files changed, 326 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1e9a0ff3217b..925b8416ea0a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -314,6 +314,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_FUNC,	/* pointer to a bpf program function */
 	ARG_PTR_TO_STACK_OR_NULL,	/* pointer to stack or NULL */
 	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
+	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
 	__BPF_ARG_TYPE_MAX,
 };
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 562adeac1d67..3da901d5076b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4774,6 +4774,25 @@ union bpf_attr {
  * 		Execute close syscall for given FD.
  * 	Return
  * 		A syscall result.
+ *
+ * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
+ *	Description
+ *		Initialize the timer to call given static function.
+ *	Return
+ *		zero
+ *
+ * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
+ *	Description
+ *		Start the timer and set its expiration N nanoseconds from
+ *		the current time.
+ *	Return
+ *		zero
+ *
+ * long bpf_timer_cancel(struct bpf_timer *timer)
+ *	Description
+ *		Deactivate the timer.
+ *	Return
+ *		zero
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4945,6 +4964,9 @@ union bpf_attr {
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
 	FN(sys_close),			\
+	FN(timer_init),			\
+	FN(timer_start),		\
+	FN(timer_cancel),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -6051,6 +6073,10 @@ struct bpf_spin_lock {
 	__u32	val;
 };
 
+struct bpf_timer {
+	__u64 :64;
+};
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 544773970dbc..6f9620cbe95d 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -985,6 +985,160 @@ const struct bpf_func_proto bpf_snprintf_proto = {
 	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
 };
 
+struct bpf_hrtimer {
+	struct hrtimer timer;
+	spinlock_t lock;
+	struct bpf_map *map;
+	struct bpf_prog *prog;
+	void *callback_fn;
+	void *key;
+	void *value;
+};
+
+/* the actual struct hidden inside uapi struct bpf_timer */
+struct bpf_timer_kern {
+	struct bpf_hrtimer *timer;
+};
+
+static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
+
+static enum hrtimer_restart timer_cb(struct hrtimer *timer)
+{
+	struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
+	unsigned long flags;
+	int ret;
+
+	/* timer_cb() runs in hrtimer_run_softirq and doesn't migrate.
+	 * Remember the timer this callback is servicing to prevent
+	 * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
+	 */
+	this_cpu_write(hrtimer_running, t);
+	ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)t->map,
+					    (u64)(long)t->key,
+					    (u64)(long)t->value, 0, 0);
+	WARN_ON(ret != 0); /* todo: define 0 vs 1 or disallow 1 in the verifier */
+	spin_lock_irqsave(&t->lock, flags);
+	if (!hrtimer_is_queued(timer))
+		bpf_prog_put(t->prog);
+	spin_unlock_irqrestore(&t->lock, flags);
+	this_cpu_write(hrtimer_running, NULL);
+	return HRTIMER_NORESTART;
+}
+
+BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flags,
+	   struct bpf_map *, map, struct bpf_prog *, prog)
+{
+	struct bpf_hrtimer *t;
+
+	if (flags)
+		return -EINVAL;
+	if (READ_ONCE(timer->timer))
+		return -EBUSY;
+	/* allocate hrtimer via map_kmalloc to use memcg accounting */
+	t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
+	if (!t)
+		return -ENOMEM;
+	t->callback_fn = cb;
+	t->value = (void *)timer /* - offset of bpf_timer inside elem */;
+	t->key = t->value - round_up(map->key_size, 8);
+	t->map = map;
+	t->prog = prog;
+	spin_lock_init(&t->lock);
+	hrtimer_init(&t->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_SOFT);
+	t->timer.function = timer_cb;
+	if (cmpxchg(&timer->timer, NULL, t)) {
+		/* Parallel bpf_timer_init() calls raced. */
+		kfree(t);
+		return -EBUSY;
+	}
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_timer_init_proto = {
+	.func		= bpf_timer_init,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+	.arg2_type	= ARG_PTR_TO_FUNC,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)
+{
+	struct bpf_hrtimer *t;
+	unsigned long flags;
+
+	t = READ_ONCE(timer->timer);
+	if (!t)
+		return -EINVAL;
+	spin_lock_irqsave(&t->lock, flags);
+	/* Keep the prog alive until callback is invoked */
+	if (!hrtimer_active(&t->timer))
+		bpf_prog_inc(t->prog);
+	hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
+	spin_unlock_irqrestore(&t->lock, flags);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_timer_start_proto = {
+	.func		= bpf_timer_start,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+	.arg2_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
+{
+	struct bpf_hrtimer *t;
+	unsigned long flags;
+
+	t = READ_ONCE(timer->timer);
+	if (!t)
+		return -EINVAL;
+	if (this_cpu_read(hrtimer_running) == t)
+		/* If bpf callback_fn is trying to bpf_timer_cancel()
+		 * its own timer the hrtimer_cancel() will deadlock
+		 * since it waits for callback_fn to finish
+		 */
+		return -EBUSY;
+	spin_lock_irqsave(&t->lock, flags);
+	/* Cancel the timer and wait for associated callback to finish
+	 * if it was running.
+	 */
+	if (hrtimer_cancel(&t->timer) == 1)
+		/* If the timer was active then drop the prog refcnt,
+		 * since callback will not be invoked.
+		 */
+		bpf_prog_put(t->prog);
+	spin_unlock_irqrestore(&t->lock, flags);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_timer_cancel_proto = {
+	.func		= bpf_timer_cancel,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+};
+
+void bpf_timer_cancel_and_free(void *val)
+{
+	struct bpf_timer_kern *timer = val;
+	struct bpf_hrtimer *t;
+
+	t = READ_ONCE(timer->timer);
+	if (!t)
+		return;
+	/* Cancel the timer and wait for callback to complete
+	 * if it was running
+	 */
+	if (hrtimer_cancel(&t->timer) == 1)
+		bpf_prog_put(t->prog);
+	kfree(t);
+	WRITE_ONCE(timer->timer, NULL);
+}
+
 const struct bpf_func_proto bpf_get_current_task_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
@@ -1051,6 +1205,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_per_cpu_ptr_proto;
 	case BPF_FUNC_this_cpu_ptr:
 		return &bpf_this_cpu_ptr_proto;
+	case BPF_FUNC_timer_init:
+		return &bpf_timer_init_proto;
+	case BPF_FUNC_timer_start:
+		return &bpf_timer_start_proto;
+	case BPF_FUNC_timer_cancel:
+		return &bpf_timer_cancel_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1de4b8c6ee42..f386f85aee5c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4656,6 +4656,35 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
 	return 0;
 }
 
+static int process_timer_func(struct bpf_verifier_env *env, int regno,
+			      struct bpf_call_arg_meta *meta)
+{
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	bool is_const = tnum_is_const(reg->var_off);
+	struct bpf_map *map = reg->map_ptr;
+	u64 val = reg->var_off.value;
+
+	if (!is_const) {
+		verbose(env,
+			"R%d doesn't have constant offset. bpf_timer has to be at the constant offset\n",
+			regno);
+		return -EINVAL;
+	}
+	if (!map->btf) {
+		verbose(env, "map '%s' has to have BTF in order to use bpf_timer\n",
+			map->name);
+		return -EINVAL;
+	}
+	if (val) {
+		/* todo: relax this requirement */
+		verbose(env, "bpf_timer field can only be first in the map value element\n");
+		return -EINVAL;
+	}
+	WARN_ON(meta->map_ptr);
+	meta->map_ptr = map;
+	return 0;
+}
+
 static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
 {
 	return type == ARG_PTR_TO_MEM ||
@@ -4788,6 +4817,7 @@ static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { PTR_TO_PER
 static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
 static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
 static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
+static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
 
 static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_MAP_KEY]		= &map_key_value_types,
@@ -4819,6 +4849,7 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_FUNC]		= &func_ptr_types,
 	[ARG_PTR_TO_STACK_OR_NULL]	= &stack_ptr_types,
 	[ARG_PTR_TO_CONST_STR]		= &const_str_ptr_types,
+	[ARG_PTR_TO_TIMER]		= &timer_types,
 };
 
 static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
@@ -5000,6 +5031,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			verbose(env, "verifier internal error\n");
 			return -EFAULT;
 		}
+	} else if (arg_type == ARG_PTR_TO_TIMER) {
+		if (process_timer_func(env, regno, meta))
+			return -EACCES;
 	} else if (arg_type == ARG_PTR_TO_FUNC) {
 		meta->subprogno = reg->subprogno;
 	} else if (arg_type_is_mem_ptr(arg_type)) {
@@ -5742,6 +5776,43 @@ static int set_map_elem_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_timer_init_callback_state(struct bpf_verifier_env *env,
+					 struct bpf_func_state *caller,
+					 struct bpf_func_state *callee,
+					 int insn_idx)
+{
+	struct bpf_insn_aux_data *insn_aux = &env->insn_aux_data[insn_idx];
+	struct bpf_map *map_ptr;
+
+	if (bpf_map_ptr_poisoned(insn_aux)) {
+		verbose(env, "bpf_timer_init abusing map_ptr\n");
+		return -EINVAL;
+	}
+
+	map_ptr = BPF_MAP_PTR(insn_aux->map_ptr_state);
+
+	/* bpf_timer_init(struct bpf_timer *timer, void *callback_fn, u64 flags);
+	 * callback_fn(struct bpf_map *map, void *key, void *value);
+	 */
+	callee->regs[BPF_REG_1].type = CONST_PTR_TO_MAP;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
+	callee->regs[BPF_REG_1].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
+	callee->regs[BPF_REG_2].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
+	callee->regs[BPF_REG_3].map_ptr = map_ptr;
+
+	/* unused */
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+	callee->in_callback_fn = true;
+	return 0;
+}
+
 static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
 {
 	struct bpf_verifier_state *state = env->cur_state;
@@ -5837,6 +5908,7 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 	    func_id != BPF_FUNC_map_pop_elem &&
 	    func_id != BPF_FUNC_map_peek_elem &&
 	    func_id != BPF_FUNC_for_each_map_elem &&
+	    func_id != BPF_FUNC_timer_init &&
 	    func_id != BPF_FUNC_redirect_map)
 		return 0;
 
@@ -6069,6 +6141,13 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			return -EINVAL;
 	}
 
+	if (func_id == BPF_FUNC_timer_init) {
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_timer_init_callback_state);
+		if (err < 0)
+			return -EINVAL;
+	}
+
 	if (func_id == BPF_FUNC_snprintf) {
 		err = check_bpf_snprintf_call(env, regs);
 		if (err < 0)
@@ -12526,6 +12605,37 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			insn      = new_prog->insnsi + i + delta;
 			continue;
 		}
+		if (insn->imm == BPF_FUNC_timer_init) {
+
+			aux = &env->insn_aux_data[i + delta];
+			if (bpf_map_ptr_poisoned(aux)) {
+				verbose(env, "bpf_timer_init abusing map_ptr\n");
+				return -EINVAL;
+			}
+			map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
+			{
+				struct bpf_insn ld_addrs[4] = {
+					BPF_LD_IMM64(BPF_REG_4, (long)map_ptr),
+					BPF_LD_IMM64(BPF_REG_5, (long)prog),
+				};
+
+				insn_buf[0] = ld_addrs[0];
+				insn_buf[1] = ld_addrs[1];
+				insn_buf[2] = ld_addrs[2];
+				insn_buf[3] = ld_addrs[3];
+			}
+			insn_buf[4] = *insn;
+			cnt = 5;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta    += cnt - 1;
+			env->prog = prog = new_prog;
+			insn      = new_prog->insnsi + i + delta;
+			goto patch_call_imm;
+		}
 
 		/* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
 		 * and other inlining handlers are currently limited to 64 bit
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d2d7cf6cfe83..453a46c2d732 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1065,7 +1065,7 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_snprintf:
 		return &bpf_snprintf_proto;
 	default:
-		return NULL;
+		return bpf_base_func_proto(func_id);
 	}
 }
 
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 2d94025b38e9..00ac7b79cddb 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -547,6 +547,7 @@ COMMANDS
             'struct inode',
             'struct socket',
             'struct file',
+            'struct bpf_timer',
     ]
     known_types = {
             '...',
@@ -594,6 +595,7 @@ COMMANDS
             'struct inode',
             'struct socket',
             'struct file',
+            'struct bpf_timer',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 562adeac1d67..3da901d5076b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4774,6 +4774,25 @@ union bpf_attr {
  * 		Execute close syscall for given FD.
  * 	Return
  * 		A syscall result.
+ *
+ * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
+ *	Description
+ *		Initialize the timer to call given static function.
+ *	Return
+ *		zero
+ *
+ * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
+ *	Description
+ *		Start the timer and set its expiration N nanoseconds from
+ *		the current time.
+ *	Return
+ *		zero
+ *
+ * long bpf_timer_cancel(struct bpf_timer *timer)
+ *	Description
+ *		Deactivate the timer.
+ *	Return
+ *		zero
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4945,6 +4964,9 @@ union bpf_attr {
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
 	FN(sys_close),			\
+	FN(timer_init),			\
+	FN(timer_start),		\
+	FN(timer_cancel),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -6051,6 +6073,10 @@ struct bpf_spin_lock {
 	__u32	val;
 };
 
+struct bpf_timer {
+	__u64 :64;
+};
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH bpf-next 2/3] bpf: Add verifier checks for bpf_timer.
  2021-05-27  4:02 [PATCH bpf-next 0/3] bpf: Introduce BPF timers Alexei Starovoitov
  2021-05-27  4:02 ` [PATCH bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
@ 2021-05-27  4:02 ` Alexei Starovoitov
  2021-06-02 22:34   ` Andrii Nakryiko
  2021-05-27  4:02 ` [PATCH bpf-next 3/3] selftests/bpf: Add bpf_timer test Alexei Starovoitov
  2 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2021-05-27  4:02 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Add appropriate safety checks for bpf_timer:
- restrict to array, hash, lru. per-cpu maps cannot be supported.
- kfree bpf_timer during map_delete_elem and map_free.
- verifier btf checks.
- safe interaction with lookup/update/delete operations and iterator.
- relax the first field only requirement of the previous patch.
- allow bpf_timer in global data and search for it in datasec.
- check prog_rdonly, frozen flags.
- mmap is allowed. otherwise global timer is not possible.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h        | 36 +++++++++++++-----
 include/linux/btf.h        |  1 +
 kernel/bpf/arraymap.c      |  7 ++++
 kernel/bpf/btf.c           | 77 +++++++++++++++++++++++++++++++-------
 kernel/bpf/hashtab.c       | 53 ++++++++++++++++++++------
 kernel/bpf/helpers.c       |  2 +-
 kernel/bpf/local_storage.c |  4 +-
 kernel/bpf/syscall.c       | 23 ++++++++++--
 kernel/bpf/verifier.c      | 30 +++++++++++++--
 9 files changed, 190 insertions(+), 43 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 925b8416ea0a..7deff4438808 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -168,6 +168,7 @@ struct bpf_map {
 	u32 max_entries;
 	u32 map_flags;
 	int spin_lock_off; /* >=0 valid offset, <0 error */
+	int timer_off; /* >=0 valid offset, <0 error */
 	u32 id;
 	int numa_node;
 	u32 btf_key_type_id;
@@ -197,24 +198,41 @@ static inline bool map_value_has_spin_lock(const struct bpf_map *map)
 	return map->spin_lock_off >= 0;
 }
 
-static inline void check_and_init_map_lock(struct bpf_map *map, void *dst)
+static inline bool map_value_has_timer(const struct bpf_map *map)
 {
-	if (likely(!map_value_has_spin_lock(map)))
-		return;
-	*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
-		(struct bpf_spin_lock){};
+	return map->timer_off >= 0;
+}
+
+void bpf_timer_cancel_and_free(void *timer);
+
+static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
+{
+	if (unlikely(map_value_has_spin_lock(map)))
+		*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
+			(struct bpf_spin_lock){};
+	if (unlikely(map_value_has_timer(map)))
+		*(struct bpf_timer *)(dst + map->timer_off) =
+			(struct bpf_timer){};
 }
 
 /* copy everything but bpf_spin_lock */
 static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 {
+	u32 off = 0, size = 0;
+
 	if (unlikely(map_value_has_spin_lock(map))) {
-		u32 off = map->spin_lock_off;
+		off = map->spin_lock_off;
+		size = sizeof(struct bpf_spin_lock);
+	} else if (unlikely(map_value_has_timer(map))) {
+		off = map->timer_off;
+		size = sizeof(struct bpf_timer);
+	}
 
+	if (unlikely(size)) {
 		memcpy(dst, src, off);
-		memcpy(dst + off + sizeof(struct bpf_spin_lock),
-		       src + off + sizeof(struct bpf_spin_lock),
-		       map->value_size - off - sizeof(struct bpf_spin_lock));
+		memcpy(dst + off + size,
+		       src + off + size,
+		       map->value_size - off - size);
 	} else {
 		memcpy(dst, src, map->value_size);
 	}
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 94a0c976c90f..214fde93214b 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -99,6 +99,7 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   const struct btf_member *m,
 			   u32 expected_offset, u32 expected_size);
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
+int btf_find_timer(const struct btf *btf, const struct btf_type *t);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 3c4105603f9d..3fedd9209770 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -378,10 +378,17 @@ static void *array_map_vmalloc_addr(struct bpf_array *array)
 static void array_map_free(struct bpf_map *map)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	int i;
 
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
 		bpf_array_free_percpu(array);
 
+	if (unlikely(map_value_has_timer(map)))
+		for (i = 0; i < array->map.max_entries; i++)
+			bpf_timer_cancel_and_free(array->value +
+						  array->elem_size * i +
+						  map->timer_off);
+
 	if (array->map.map_flags & BPF_F_MMAPABLE)
 		bpf_map_area_free(array_map_vmalloc_addr(array));
 	else
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index a6e39c5ea0bf..28a8014b8379 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3046,43 +3046,92 @@ static void btf_struct_log(struct btf_verifier_env *env,
 	btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
 }
 
-/* find 'struct bpf_spin_lock' in map value.
- * return >= 0 offset if found
- * and < 0 in case of error
- */
-int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
+static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
+				 const char *name, int sz, int align)
 {
 	const struct btf_member *member;
 	u32 i, off = -ENOENT;
 
-	if (!__btf_type_is_struct(t))
-		return -EINVAL;
-
 	for_each_member(i, t, member) {
 		const struct btf_type *member_type = btf_type_by_id(btf,
 								    member->type);
 		if (!__btf_type_is_struct(member_type))
 			continue;
-		if (member_type->size != sizeof(struct bpf_spin_lock))
+		if (member_type->size != sz)
 			continue;
-		if (strcmp(__btf_name_by_offset(btf, member_type->name_off),
-			   "bpf_spin_lock"))
+		if (strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
 			continue;
 		if (off != -ENOENT)
-			/* only one 'struct bpf_spin_lock' is allowed */
+			/* only one such field is allowed */
 			return -E2BIG;
 		off = btf_member_bit_offset(t, member);
 		if (off % 8)
 			/* valid C code cannot generate such BTF */
 			return -EINVAL;
 		off /= 8;
-		if (off % __alignof__(struct bpf_spin_lock))
-			/* valid struct bpf_spin_lock will be 4 byte aligned */
+		if (off % align)
+			return -EINVAL;
+	}
+	return off;
+}
+
+static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
+				const char *name, int sz, int align)
+{
+	const struct btf_var_secinfo *vsi;
+	u32 i, off = -ENOENT;
+
+	for_each_vsi(i, t, vsi) {
+		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
+		const struct btf_type *var_type = btf_type_by_id(btf, var->type);
+
+		if (!__btf_type_is_struct(var_type))
+			continue;
+		if (var_type->size != sz)
+			continue;
+		if (vsi->size != sz)
+			continue;
+		if (strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
+			continue;
+		if (off != -ENOENT)
+			/* only one such field is allowed */
+			return -E2BIG;
+		off = vsi->offset;
+		if (off % align)
 			return -EINVAL;
 	}
 	return off;
 }
 
+static int btf_find_field(const struct btf *btf, const struct btf_type *t,
+			  const char *name, int sz, int align)
+{
+
+	if (__btf_type_is_struct(t))
+		return btf_find_struct_field(btf, t, name, sz, align);
+	else if (btf_type_is_datasec(t))
+		return btf_find_datasec_var(btf, t, name, sz, align);
+	return -EINVAL;
+}
+
+/* find 'struct bpf_spin_lock' in map value.
+ * return >= 0 offset if found
+ * and < 0 in case of error
+ */
+int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
+{
+	return btf_find_field(btf, t, "bpf_spin_lock",
+			      sizeof(struct bpf_spin_lock),
+			      __alignof__(struct bpf_spin_lock));
+}
+
+int btf_find_timer(const struct btf *btf, const struct btf_type *t)
+{
+	return btf_find_field(btf, t, "bpf_timer",
+			      sizeof(struct bpf_timer),
+			      __alignof__(struct bpf_timer));
+}
+
 static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
 			      u32 type_id, void *data, u8 bits_offset,
 			      struct btf_show *show)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 6f6681b07364..28d66fa74780 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -244,6 +244,17 @@ static void htab_free_elems(struct bpf_htab *htab)
 		cond_resched();
 	}
 free_elems:
+	if (unlikely(map_value_has_timer(&htab->map)))
+		for (i = 0; i < htab->map.max_entries; i++) {
+			struct htab_elem *elem;
+
+			elem = get_htab_elem(htab, i);
+			bpf_timer_cancel_and_free(elem->key +
+						  round_up(htab->map.key_size, 8) +
+						  htab->map.timer_off);
+			cond_resched();
+		}
+
 	bpf_map_area_free(htab->elems);
 }
 
@@ -265,8 +276,11 @@ static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key,
 	struct htab_elem *l;
 
 	if (node) {
+		u32 key_size = htab->map.key_size;
 		l = container_of(node, struct htab_elem, lru_node);
-		memcpy(l->key, key, htab->map.key_size);
+		memcpy(l->key, key, key_size);
+		check_and_init_map_value(&htab->map,
+					 l->key + round_up(key_size, 8));
 		return l;
 	}
 
@@ -785,10 +799,19 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	return -ENOENT;
 }
 
+static void check_and_free_timer(struct bpf_htab *htab, struct htab_elem *elem)
+{
+	if (unlikely(map_value_has_timer(&htab->map)))
+		bpf_timer_cancel_and_free(elem->key +
+					  round_up(htab->map.key_size, 8) +
+					  htab->map.timer_off);
+}
+
 static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 {
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
+	check_and_free_timer(htab, l);
 	kfree(l);
 }
 
@@ -816,6 +839,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 	htab_put_fd_value(htab, l);
 
 	if (htab_is_prealloc(htab)) {
+		check_and_free_timer(htab, l);
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		atomic_dec(&htab->count);
@@ -919,8 +943,8 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			l_new = ERR_PTR(-ENOMEM);
 			goto dec_count;
 		}
-		check_and_init_map_lock(&htab->map,
-					l_new->key + round_up(key_size, 8));
+		check_and_init_map_value(&htab->map,
+					 l_new->key + round_up(key_size, 8));
 	}
 
 	memcpy(l_new->key, key, key_size);
@@ -1067,6 +1091,12 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	return ret;
 }
 
+static void htab_lru_push_free(struct bpf_htab *htab, struct htab_elem *elem)
+{
+	check_and_free_timer(htab, elem);
+	bpf_lru_push_free(&htab->lru, &elem->lru_node);
+}
+
 static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 				    u64 map_flags)
 {
@@ -1099,7 +1129,8 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 	l_new = prealloc_lru_pop(htab, key, hash);
 	if (!l_new)
 		return -ENOMEM;
-	memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size);
+	copy_map_value(&htab->map,
+		       l_new->key + round_up(map->key_size, 8), value);
 
 	ret = htab_lock_bucket(htab, b, hash, &flags);
 	if (ret)
@@ -1125,9 +1156,9 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 	htab_unlock_bucket(htab, b, hash, flags);
 
 	if (ret)
-		bpf_lru_push_free(&htab->lru, &l_new->lru_node);
+		htab_lru_push_free(htab, l_new);
 	else if (l_old)
-		bpf_lru_push_free(&htab->lru, &l_old->lru_node);
+		htab_lru_push_free(htab, l_old);
 
 	return ret;
 }
@@ -1332,7 +1363,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 
 	htab_unlock_bucket(htab, b, hash, flags);
 	if (l)
-		bpf_lru_push_free(&htab->lru, &l->lru_node);
+		htab_lru_push_free(htab, l);
 	return ret;
 }
 
@@ -1449,7 +1480,7 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
 			else
 				copy_map_value(map, value, l->key +
 					       roundup_key_size);
-			check_and_init_map_lock(map, value);
+			check_and_init_map_value(map, value);
 		}
 
 		hlist_nulls_del_rcu(&l->hash_node);
@@ -1460,7 +1491,7 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
 	htab_unlock_bucket(htab, b, hash, bflags);
 
 	if (is_lru_map && l)
-		bpf_lru_push_free(&htab->lru, &l->lru_node);
+		htab_lru_push_free(htab, l);
 
 	return ret;
 }
@@ -1638,7 +1669,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 						      true);
 			else
 				copy_map_value(map, dst_val, value);
-			check_and_init_map_lock(map, dst_val);
+			check_and_init_map_value(map, dst_val);
 		}
 		if (do_delete) {
 			hlist_nulls_del_rcu(&l->hash_node);
@@ -1665,7 +1696,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 	while (node_to_free) {
 		l = node_to_free;
 		node_to_free = node_to_free->batch_flink;
-		bpf_lru_push_free(&htab->lru, &l->lru_node);
+		htab_lru_push_free(htab, l);
 	}
 
 next_batch:
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 6f9620cbe95d..8580b7bfc8bb 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1039,7 +1039,7 @@ BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flag
 	if (!t)
 		return -ENOMEM;
 	t->callback_fn = cb;
-	t->value = (void *)timer /* - offset of bpf_timer inside elem */;
+	t->value = (void *)timer - map->timer_off;
 	t->key = t->value - round_up(map->key_size, 8);
 	t->map = map;
 	t->prog = prog;
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index bd11db9774c3..95d70a08325d 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -173,7 +173,7 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *key,
 		return -ENOMEM;
 
 	memcpy(&new->data[0], value, map->value_size);
-	check_and_init_map_lock(map, new->data);
+	check_and_init_map_value(map, new->data);
 
 	new = xchg(&storage->buf, new);
 	kfree_rcu(new, rcu);
@@ -509,7 +509,7 @@ struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog,
 						    map->numa_node);
 		if (!storage->buf)
 			goto enomem;
-		check_and_init_map_lock(map, storage->buf->data);
+		check_and_init_map_value(map, storage->buf->data);
 	} else {
 		storage->percpu_buf = bpf_map_alloc_percpu(map, size, 8, gfp);
 		if (!storage->percpu_buf)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 50457019da27..d49ba04d549e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -259,8 +259,8 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 				copy_map_value_locked(map, value, ptr, true);
 			else
 				copy_map_value(map, value, ptr);
-			/* mask lock, since value wasn't zero inited */
-			check_and_init_map_lock(map, value);
+			/* mask lock and timer, since value wasn't zero inited */
+			check_and_init_map_value(map, value);
 		}
 		rcu_read_unlock();
 	}
@@ -792,6 +792,21 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		}
 	}
 
+	map->timer_off = btf_find_timer(btf, value_type);
+	if (map_value_has_timer(map)) {
+		if (map->map_flags & BPF_F_RDONLY_PROG)
+			return -EACCES;
+		if (map->map_type != BPF_MAP_TYPE_HASH &&
+		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
+		    map->map_type != BPF_MAP_TYPE_ARRAY)
+			return -ENOTSUPP;
+		if (map_value_has_spin_lock(map))
+			/* map values with bpf_spin_lock and bpf_timer
+			 * are not supported yet.
+			 */
+			return -EOPNOTSUPP;
+	}
+
 	if (map->ops->map_check_btf)
 		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
 
@@ -843,6 +858,7 @@ static int map_create(union bpf_attr *attr)
 	mutex_init(&map->freeze_mutex);
 
 	map->spin_lock_off = -EINVAL;
+	map->timer_off = -EINVAL;
 	if (attr->btf_key_type_id || attr->btf_value_type_id ||
 	    /* Even the map's value is a kernel's struct,
 	     * the bpf_prog.o must have BTF to begin with
@@ -1590,7 +1606,8 @@ static int map_freeze(const union bpf_attr *attr)
 	if (IS_ERR(map))
 		return PTR_ERR(map);
 
-	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
+	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
+	    map_value_has_timer(map)) {
 		fdput(f);
 		return -ENOTSUPP;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f386f85aee5c..0a828dc4968e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3241,6 +3241,15 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			return -EACCES;
 		}
 	}
+	if (map_value_has_timer(map)) {
+		u32 t = map->timer_off;
+
+		if (reg->smin_value + off < t + sizeof(struct bpf_timer) &&
+		     t < reg->umax_value + off + size) {
+			verbose(env, "bpf_timer cannot be accessed directly by load/store\n");
+			return -EACCES;
+		}
+	}
 	return err;
 }
 
@@ -4675,9 +4684,24 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
 			map->name);
 		return -EINVAL;
 	}
-	if (val) {
-		/* todo: relax this requirement */
-		verbose(env, "bpf_timer field can only be first in the map value element\n");
+	if (!map_value_has_timer(map)) {
+		if (map->timer_off == -E2BIG)
+			verbose(env,
+				"map '%s' has more than one 'struct bpf_timer'\n",
+				map->name);
+		else if (map->timer_off == -ENOENT)
+			verbose(env,
+				"map '%s' doesn't have 'struct bpf_timer'\n",
+				map->name);
+		else
+			verbose(env,
+				"map '%s' is not a struct type or bpf_timer is mangled\n",
+				map->name);
+		return -EINVAL;
+	}
+	if (map->timer_off != val + reg->off) {
+		verbose(env, "off %lld doesn't point to 'struct bpf_timer' that is at %d\n",
+			val + reg->off, map->timer_off);
 		return -EINVAL;
 	}
 	WARN_ON(meta->map_ptr);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH bpf-next 3/3] selftests/bpf: Add bpf_timer test.
  2021-05-27  4:02 [PATCH bpf-next 0/3] bpf: Introduce BPF timers Alexei Starovoitov
  2021-05-27  4:02 ` [PATCH bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
  2021-05-27  4:02 ` [PATCH bpf-next 2/3] bpf: Add verifier checks for bpf_timer Alexei Starovoitov
@ 2021-05-27  4:02 ` Alexei Starovoitov
  2 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2021-05-27  4:02 UTC (permalink / raw)
  To: davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Add bpf_timer test that creates two timers. One in hash map and another global
timer in bss. It let global timer expire once and then re-arms it for 35
seconds. Then arms and re-arms hash timer 10 times and at the last invocation
cancels global timer.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 .../testing/selftests/bpf/prog_tests/timer.c  | 47 ++++++++++
 tools/testing/selftests/bpf/progs/timer.c     | 85 +++++++++++++++++++
 2 files changed, 132 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/timer.c
 create mode 100644 tools/testing/selftests/bpf/progs/timer.c

diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
new file mode 100644
index 000000000000..7be2aeba2dad
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/timer.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <test_progs.h>
+#include "timer.skel.h"
+
+static int timer(struct timer *timer_skel)
+{
+	int err, prog_fd;
+	__u32 duration = 0, retval;
+
+	err = timer__attach(timer_skel);
+	if (!ASSERT_OK(err, "timer_attach"))
+		return err;
+
+	ASSERT_EQ(timer_skel->data->callback_check, 52, "callback_check1");
+
+	prog_fd = bpf_program__fd(timer_skel->progs.test1);
+	err = bpf_prog_test_run(prog_fd, 1, NULL, 0,
+				NULL, NULL, &retval, &duration);
+	ASSERT_OK(err, "test_run");
+	ASSERT_EQ(retval, 0, "test_run");
+	timer__detach(timer_skel);
+
+	usleep(50 * 1000); /* 10 msecs should be enough, but give it extra */
+	/* check that timer_cb1() was executed 10 times */
+	ASSERT_EQ(timer_skel->data->callback_check, 42, "callback_check2");
+
+	/* check that timer_cb2() was executed once */
+	ASSERT_EQ(timer_skel->bss->bss_data, 15, "bss_data");
+
+	return 0;
+}
+
+void test_timer(void)
+{
+	struct timer *timer_skel = NULL;
+	int err;
+
+	timer_skel = timer__open_and_load();
+	if (!ASSERT_OK_PTR(timer_skel, "timer_skel_load"))
+		goto cleanup;
+
+	err = timer(timer_skel);
+	ASSERT_OK(err, "timer");
+cleanup:
+	timer__destroy(timer_skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c
new file mode 100644
index 000000000000..d20672cf61d6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/timer.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_tcp_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+struct map_elem {
+	int counter;
+	struct bpf_timer timer;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 1000);
+	__type(key, int);
+	__type(value, struct map_elem);
+} hmap SEC(".maps");
+
+__u64 bss_data;
+struct bpf_timer global_timer;
+
+__u64 callback_check = 52;
+
+static int timer_cb1(void *map, int *key, __u64 *data)
+{
+	/* increment the same bss variable twice */
+	bss_data += 5;
+	data[0] += 10; /* &data[1] == &bss_data */
+	/* note data[1] access will be rejected by the verifier,
+	 * since &data[1] points to the &global_timer.
+	 */
+
+	/* rearm self to be called again in ~35 seconds */
+	bpf_timer_start(&global_timer, 1ull << 35);
+	return 0;
+}
+
+SEC("fentry/bpf_fentry_test1")
+int BPF_PROG(test1, int a)
+{
+	bpf_timer_init(&global_timer, timer_cb1, 0);
+	bpf_timer_start(&global_timer, 0 /* call timer_cb1 asap */);
+	return 0;
+}
+
+static int timer_cb2(void *map, int *key, struct map_elem *val)
+{
+	callback_check--;
+	if (--val->counter)
+		/* re-arm the timer again to execute after 1 msec */
+		bpf_timer_start(&val->timer, 1000);
+	else {
+		/* cancel global_timer otherwise bpf_fentry_test1 prog
+		 * will stay alive forever.
+		 */
+		bpf_timer_cancel(&global_timer);
+		bpf_timer_cancel(&val->timer);
+	}
+	return 0;
+}
+
+int bpf_timer_test(void)
+{
+	struct map_elem *val;
+	int key = 0;
+
+	val = bpf_map_lookup_elem(&hmap, &key);
+	if (val) {
+		bpf_timer_init(&val->timer, timer_cb2, 0);
+		bpf_timer_start(&val->timer, 1000);
+	}
+	return 0;
+}
+
+SEC("fentry/bpf_fentry_test2")
+int BPF_PROG(test2, int a, int b)
+{
+	struct map_elem val = {};
+	int key = 0;
+
+	val.counter = 10; /* number of times to trigger timer_cb1 */
+	bpf_map_update_elem(&hmap, &key, &val, 0);
+	return bpf_timer_test();
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-05-27  4:02 ` [PATCH bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
@ 2021-05-27 16:57   ` Toke Høiland-Jørgensen
  2021-06-02  1:46     ` Alexei Starovoitov
  2021-06-02 22:08   ` Andrii Nakryiko
  1 sibling, 1 reply; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-05-27 16:57 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, andrii, netdev, bpf, kernel-team

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce 'struct bpf_timer { __u64 :64; };' that can be embedded
> in hash/array/lru maps as regular field and helpers to operate on it:
> long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags);
> long bpf_timer_start(struct bpf_timer *timer, u64 nsecs);
> long bpf_timer_cancel(struct bpf_timer *timer);
>
> Here is how BPF program might look like:
> struct map_elem {
>     int counter;
>     struct bpf_timer timer;
> };
>
> struct {
>     __uint(type, BPF_MAP_TYPE_HASH);
>     __uint(max_entries, 1000);
>     __type(key, int);
>     __type(value, struct map_elem);
> } hmap SEC(".maps");
>
> struct bpf_timer global_timer;
>
> static int timer_cb1(void *map, int *key, __u64 *data);
> /* global_timer is in bss which is special bpf array of one element.
>  * data points to beginning of bss.
>  */
>
> static int timer_cb2(void *map, int *key, struct map_elem *val);
> /* val points to particular map element that contains bpf_timer. */
>
> SEC("fentry/bpf_fentry_test1")
> int BPF_PROG(test1, int a)
> {
>     struct map_elem *val;
>     int key = 0;
>     bpf_timer_init(&global_timer, timer_cb1, 0);
>     bpf_timer_start(&global_timer, 0 /* call timer_cb1 asap */);
>
>     val = bpf_map_lookup_elem(&hmap, &key);
>     if (val) {
>         bpf_timer_init(&val->timer, timer_cb2, 0);
>         bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 msec */);

nit: there are 1M nanoseconds in a millisecond :)

>     }
> }
>
> This patch adds helper implementations that rely on hrtimers
> to call bpf functions as timers expire.
> The following patch adds necessary safety checks.
>
> Only programs with CAP_BPF are allowed to use bpf_timer.
>
> The amount of timers used by the program is constrained by
> the memcg recorded at map creation time.
>
> The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> program to make sure that program doesn't get freed while timer is armed.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Overall this LGTM, and I believe it will be usable for my intended use
case. One question:

With this, it will basically be possible to create a BPF daemon, won't
it? I.e., if a program includes a timer and the callback keeps re-arming
itself this will continue indefinitely even if userspace closes all refs
to maps and programs? Not saying this is a problem, just wanted to check
my understanding (i.e., that there's not some hidden requirement on
userspace keeping a ref open that I'm missing)...

A few other nits below:

> ---
>  include/linux/bpf.h            |   1 +
>  include/uapi/linux/bpf.h       |  26 ++++++
>  kernel/bpf/helpers.c           | 160 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          | 110 +++++++++++++++++++++++
>  kernel/trace/bpf_trace.c       |   2 +-
>  scripts/bpf_doc.py             |   2 +
>  tools/include/uapi/linux/bpf.h |  26 ++++++
>  7 files changed, 326 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 1e9a0ff3217b..925b8416ea0a 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -314,6 +314,7 @@ enum bpf_arg_type {
>  	ARG_PTR_TO_FUNC,	/* pointer to a bpf program function */
>  	ARG_PTR_TO_STACK_OR_NULL,	/* pointer to stack or NULL */
>  	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
> +	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
>  	__BPF_ARG_TYPE_MAX,
>  };
>  
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 562adeac1d67..3da901d5076b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4774,6 +4774,25 @@ union bpf_attr {
>   * 		Execute close syscall for given FD.
>   * 	Return
>   * 		A syscall result.
> + *
> + * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
> + *	Description
> + *		Initialize the timer to call given static function.
> + *	Return
> + *		zero
> + *
> + * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
> + *	Description
> + *		Start the timer and set its expiration N nanoseconds from
> + *		the current time.
> + *	Return
> + *		zero
> + *
> + * long bpf_timer_cancel(struct bpf_timer *timer)
> + *	Description
> + *		Deactivate the timer.
> + *	Return
> + *		zero
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -4945,6 +4964,9 @@ union bpf_attr {
>  	FN(sys_bpf),			\
>  	FN(btf_find_by_name_kind),	\
>  	FN(sys_close),			\
> +	FN(timer_init),			\
> +	FN(timer_start),		\
> +	FN(timer_cancel),		\
>  	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> @@ -6051,6 +6073,10 @@ struct bpf_spin_lock {
>  	__u32	val;
>  };
>  
> +struct bpf_timer {
> +	__u64 :64;
> +};
> +
>  struct bpf_sysctl {
>  	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
>  				 * Allows 1,2,4-byte read, but no write.
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 544773970dbc..6f9620cbe95d 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -985,6 +985,160 @@ const struct bpf_func_proto bpf_snprintf_proto = {
>  	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
>  };
>  
> +struct bpf_hrtimer {
> +	struct hrtimer timer;
> +	spinlock_t lock;
> +	struct bpf_map *map;
> +	struct bpf_prog *prog;
> +	void *callback_fn;
> +	void *key;
> +	void *value;
> +};
> +
> +/* the actual struct hidden inside uapi struct bpf_timer */
> +struct bpf_timer_kern {
> +	struct bpf_hrtimer *timer;
> +};
> +
> +static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
> +
> +static enum hrtimer_restart timer_cb(struct hrtimer *timer)
> +{
> +	struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
> +	unsigned long flags;
> +	int ret;
> +
> +	/* timer_cb() runs in hrtimer_run_softirq and doesn't migrate.
> +	 * Remember the timer this callback is servicing to prevent
> +	 * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
> +	 */
> +	this_cpu_write(hrtimer_running, t);
> +	ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)t->map,
> +					    (u64)(long)t->key,
> +					    (u64)(long)t->value, 0, 0);
> +	WARN_ON(ret != 0); /* todo: define 0 vs 1 or disallow 1 in the verifier */
> +	spin_lock_irqsave(&t->lock, flags);
> +	if (!hrtimer_is_queued(timer))
> +		bpf_prog_put(t->prog);
> +	spin_unlock_irqrestore(&t->lock, flags);
> +	this_cpu_write(hrtimer_running, NULL);
> +	return HRTIMER_NORESTART;
> +}
> +
> +BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flags,
> +	   struct bpf_map *, map, struct bpf_prog *, prog)
> +{
> +	struct bpf_hrtimer *t;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (READ_ONCE(timer->timer))
> +		return -EBUSY;
> +	/* allocate hrtimer via map_kmalloc to use memcg accounting */
> +	t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
> +	if (!t)
> +		return -ENOMEM;
> +	t->callback_fn = cb;
> +	t->value = (void *)timer /* - offset of bpf_timer inside elem */;
> +	t->key = t->value - round_up(map->key_size, 8);

For array-maps won't this just point to somewhere inside the previous value?

> +	t->map = map;
> +	t->prog = prog;
> +	spin_lock_init(&t->lock);
> +	hrtimer_init(&t->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_SOFT);
> +	t->timer.function = timer_cb;
> +	if (cmpxchg(&timer->timer, NULL, t)) {
> +		/* Parallel bpf_timer_init() calls raced. */
> +		kfree(t);
> +		return -EBUSY;
> +	}
> +	return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_init_proto = {
> +	.func		= bpf_timer_init,
> +	.gpl_only	= false,

hrtimer_init() is EXPORT_SYMBOL_GPL, should this be as well? Same with
the others below.

> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_TIMER,
> +	.arg2_type	= ARG_PTR_TO_FUNC,
> +	.arg3_type	= ARG_ANYTHING,
> +};
> +
> +BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)
> +{
> +	struct bpf_hrtimer *t;
> +	unsigned long flags;
> +
> +	t = READ_ONCE(timer->timer);
> +	if (!t)
> +		return -EINVAL;
> +	spin_lock_irqsave(&t->lock, flags);
> +	/* Keep the prog alive until callback is invoked */
> +	if (!hrtimer_active(&t->timer))
> +		bpf_prog_inc(t->prog);
> +	hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
> +	spin_unlock_irqrestore(&t->lock, flags);
> +	return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_start_proto = {
> +	.func		= bpf_timer_start,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_TIMER,
> +	.arg2_type	= ARG_ANYTHING,
> +};
> +
> +BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> +{
> +	struct bpf_hrtimer *t;
> +	unsigned long flags;
> +
> +	t = READ_ONCE(timer->timer);
> +	if (!t)
> +		return -EINVAL;
> +	if (this_cpu_read(hrtimer_running) == t)
> +		/* If bpf callback_fn is trying to bpf_timer_cancel()
> +		 * its own timer the hrtimer_cancel() will deadlock
> +		 * since it waits for callback_fn to finish
> +		 */
> +		return -EBUSY;
> +	spin_lock_irqsave(&t->lock, flags);
> +	/* Cancel the timer and wait for associated callback to finish
> +	 * if it was running.
> +	 */
> +	if (hrtimer_cancel(&t->timer) == 1)
> +		/* If the timer was active then drop the prog refcnt,
> +		 * since callback will not be invoked.
> +		 */
> +		bpf_prog_put(t->prog);
> +	spin_unlock_irqrestore(&t->lock, flags);
> +	return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_cancel_proto = {
> +	.func		= bpf_timer_cancel,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_TIMER,
> +};
> +
> +void bpf_timer_cancel_and_free(void *val)
> +{
> +	struct bpf_timer_kern *timer = val;
> +	struct bpf_hrtimer *t;
> +
> +	t = READ_ONCE(timer->timer);
> +	if (!t)
> +		return;
> +	/* Cancel the timer and wait for callback to complete
> +	 * if it was running
> +	 */
> +	if (hrtimer_cancel(&t->timer) == 1)
> +		bpf_prog_put(t->prog);
> +	kfree(t);
> +	WRITE_ONCE(timer->timer, NULL);
> +}
> +
>  const struct bpf_func_proto bpf_get_current_task_proto __weak;
>  const struct bpf_func_proto bpf_probe_read_user_proto __weak;
>  const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
> @@ -1051,6 +1205,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>  		return &bpf_per_cpu_ptr_proto;
>  	case BPF_FUNC_this_cpu_ptr:
>  		return &bpf_this_cpu_ptr_proto;
> +	case BPF_FUNC_timer_init:
> +		return &bpf_timer_init_proto;
> +	case BPF_FUNC_timer_start:
> +		return &bpf_timer_start_proto;
> +	case BPF_FUNC_timer_cancel:
> +		return &bpf_timer_cancel_proto;
>  	default:
>  		break;
>  	}
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 1de4b8c6ee42..f386f85aee5c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -4656,6 +4656,35 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
>  	return 0;
>  }
>  
> +static int process_timer_func(struct bpf_verifier_env *env, int regno,
> +			      struct bpf_call_arg_meta *meta)
> +{
> +	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> +	bool is_const = tnum_is_const(reg->var_off);
> +	struct bpf_map *map = reg->map_ptr;
> +	u64 val = reg->var_off.value;
> +
> +	if (!is_const) {
> +		verbose(env,
> +			"R%d doesn't have constant offset. bpf_timer has to be at the constant offset\n",
> +			regno);
> +		return -EINVAL;
> +	}
> +	if (!map->btf) {
> +		verbose(env, "map '%s' has to have BTF in order to use bpf_timer\n",
> +			map->name);
> +		return -EINVAL;
> +	}
> +	if (val) {
> +		/* todo: relax this requirement */
> +		verbose(env, "bpf_timer field can only be first in the map value element\n");
> +		return -EINVAL;
> +	}
> +	WARN_ON(meta->map_ptr);
> +	meta->map_ptr = map;
> +	return 0;
> +}
> +
>  static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
>  {
>  	return type == ARG_PTR_TO_MEM ||
> @@ -4788,6 +4817,7 @@ static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { PTR_TO_PER
>  static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
>  static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
>  static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
> +static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
>  
>  static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
>  	[ARG_PTR_TO_MAP_KEY]		= &map_key_value_types,
> @@ -4819,6 +4849,7 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
>  	[ARG_PTR_TO_FUNC]		= &func_ptr_types,
>  	[ARG_PTR_TO_STACK_OR_NULL]	= &stack_ptr_types,
>  	[ARG_PTR_TO_CONST_STR]		= &const_str_ptr_types,
> +	[ARG_PTR_TO_TIMER]		= &timer_types,
>  };
>  
>  static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
> @@ -5000,6 +5031,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
>  			verbose(env, "verifier internal error\n");
>  			return -EFAULT;
>  		}
> +	} else if (arg_type == ARG_PTR_TO_TIMER) {
> +		if (process_timer_func(env, regno, meta))
> +			return -EACCES;
>  	} else if (arg_type == ARG_PTR_TO_FUNC) {
>  		meta->subprogno = reg->subprogno;
>  	} else if (arg_type_is_mem_ptr(arg_type)) {
> @@ -5742,6 +5776,43 @@ static int set_map_elem_callback_state(struct bpf_verifier_env *env,
>  	return 0;
>  }
>  
> +static int set_timer_init_callback_state(struct bpf_verifier_env *env,
> +					 struct bpf_func_state *caller,
> +					 struct bpf_func_state *callee,
> +					 int insn_idx)
> +{
> +	struct bpf_insn_aux_data *insn_aux = &env->insn_aux_data[insn_idx];
> +	struct bpf_map *map_ptr;
> +
> +	if (bpf_map_ptr_poisoned(insn_aux)) {
> +		verbose(env, "bpf_timer_init abusing map_ptr\n");
> +		return -EINVAL;
> +	}
> +
> +	map_ptr = BPF_MAP_PTR(insn_aux->map_ptr_state);
> +
> +	/* bpf_timer_init(struct bpf_timer *timer, void *callback_fn, u64 flags);
> +	 * callback_fn(struct bpf_map *map, void *key, void *value);
> +	 */
> +	callee->regs[BPF_REG_1].type = CONST_PTR_TO_MAP;
> +	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
> +	callee->regs[BPF_REG_1].map_ptr = map_ptr;
> +
> +	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
> +	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
> +	callee->regs[BPF_REG_2].map_ptr = map_ptr;
> +
> +	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
> +	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
> +	callee->regs[BPF_REG_3].map_ptr = map_ptr;
> +
> +	/* unused */
> +	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
> +	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
> +	callee->in_callback_fn = true;
> +	return 0;
> +}
> +
>  static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>  {
>  	struct bpf_verifier_state *state = env->cur_state;
> @@ -5837,6 +5908,7 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>  	    func_id != BPF_FUNC_map_pop_elem &&
>  	    func_id != BPF_FUNC_map_peek_elem &&
>  	    func_id != BPF_FUNC_for_each_map_elem &&
> +	    func_id != BPF_FUNC_timer_init &&
>  	    func_id != BPF_FUNC_redirect_map)
>  		return 0;
>  
> @@ -6069,6 +6141,13 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  			return -EINVAL;
>  	}
>  
> +	if (func_id == BPF_FUNC_timer_init) {
> +		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
> +					set_timer_init_callback_state);
> +		if (err < 0)
> +			return -EINVAL;
> +	}
> +
>  	if (func_id == BPF_FUNC_snprintf) {
>  		err = check_bpf_snprintf_call(env, regs);
>  		if (err < 0)
> @@ -12526,6 +12605,37 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>  			insn      = new_prog->insnsi + i + delta;
>  			continue;
>  		}
> +		if (insn->imm == BPF_FUNC_timer_init) {
> +
> +			aux = &env->insn_aux_data[i + delta];
> +			if (bpf_map_ptr_poisoned(aux)) {
> +				verbose(env, "bpf_timer_init abusing map_ptr\n");
> +				return -EINVAL;
> +			}
> +			map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
> +			{
> +				struct bpf_insn ld_addrs[4] = {
> +					BPF_LD_IMM64(BPF_REG_4, (long)map_ptr),
> +					BPF_LD_IMM64(BPF_REG_5, (long)prog),
> +				};
> +
> +				insn_buf[0] = ld_addrs[0];
> +				insn_buf[1] = ld_addrs[1];
> +				insn_buf[2] = ld_addrs[2];
> +				insn_buf[3] = ld_addrs[3];
> +			}
> +			insn_buf[4] = *insn;
> +			cnt = 5;
> +
> +			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +			if (!new_prog)
> +				return -ENOMEM;
> +
> +			delta    += cnt - 1;
> +			env->prog = prog = new_prog;
> +			insn      = new_prog->insnsi + i + delta;
> +			goto patch_call_imm;
> +		}
>  
>  		/* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
>  		 * and other inlining handlers are currently limited to 64 bit
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index d2d7cf6cfe83..453a46c2d732 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1065,7 +1065,7 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  	case BPF_FUNC_snprintf:
>  		return &bpf_snprintf_proto;
>  	default:
> -		return NULL;
> +		return bpf_base_func_proto(func_id);
>  	}
>  }
>  
> diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
> index 2d94025b38e9..00ac7b79cddb 100755
> --- a/scripts/bpf_doc.py
> +++ b/scripts/bpf_doc.py
> @@ -547,6 +547,7 @@ COMMANDS
>              'struct inode',
>              'struct socket',
>              'struct file',
> +            'struct bpf_timer',
>      ]
>      known_types = {
>              '...',
> @@ -594,6 +595,7 @@ COMMANDS
>              'struct inode',
>              'struct socket',
>              'struct file',
> +            'struct bpf_timer',
>      }
>      mapped_types = {
>              'u8': '__u8',
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 562adeac1d67..3da901d5076b 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -4774,6 +4774,25 @@ union bpf_attr {
>   * 		Execute close syscall for given FD.
>   * 	Return
>   * 		A syscall result.
> + *
> + * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
> + *	Description
> + *		Initialize the timer to call given static function.
> + *	Return
> + *		zero
> + *
> + * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
> + *	Description
> + *		Start the timer and set its expiration N nanoseconds from
> + *		the current time.
> + *	Return
> + *		zero
> + *
> + * long bpf_timer_cancel(struct bpf_timer *timer)
> + *	Description
> + *		Deactivate the timer.
> + *	Return
> + *		zero
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -4945,6 +4964,9 @@ union bpf_attr {
>  	FN(sys_bpf),			\
>  	FN(btf_find_by_name_kind),	\
>  	FN(sys_close),			\
> +	FN(timer_init),			\
> +	FN(timer_start),		\
> +	FN(timer_cancel),		\
>  	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> @@ -6051,6 +6073,10 @@ struct bpf_spin_lock {
>  	__u32	val;
>  };
>  
> +struct bpf_timer {
> +	__u64 :64;
> +};
> +
>  struct bpf_sysctl {
>  	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
>  				 * Allows 1,2,4-byte read, but no write.
> -- 
> 2.30.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-05-27 16:57   ` Toke Høiland-Jørgensen
@ 2021-06-02  1:46     ` Alexei Starovoitov
  2021-06-02 22:21       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2021-06-02  1:46 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: davem, daniel, andrii, netdev, bpf, kernel-team

On Thu, May 27, 2021 at 06:57:17PM +0200, Toke Høiland-Jørgensen wrote:
> >     if (val) {
> >         bpf_timer_init(&val->timer, timer_cb2, 0);
> >         bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 msec */);
> 
> nit: there are 1M nanoseconds in a millisecond :)

oops :)

> >     }
> > }
> >
> > This patch adds helper implementations that rely on hrtimers
> > to call bpf functions as timers expire.
> > The following patch adds necessary safety checks.
> >
> > Only programs with CAP_BPF are allowed to use bpf_timer.
> >
> > The amount of timers used by the program is constrained by
> > the memcg recorded at map creation time.
> >
> > The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> > supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> > program to make sure that program doesn't get freed while timer is armed.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> 
> Overall this LGTM, and I believe it will be usable for my intended use
> case. One question:
> 
> With this, it will basically be possible to create a BPF daemon, won't
> it? I.e., if a program includes a timer and the callback keeps re-arming
> itself this will continue indefinitely even if userspace closes all refs
> to maps and programs? Not saying this is a problem, just wanted to check
> my understanding (i.e., that there's not some hidden requirement on
> userspace keeping a ref open that I'm missing)...

That is correct.
Another option would be to auto-cancel the timer when the last reference
to the prog drops. That may feel safer, since forever
running bpf daemon is a certainly new concept.
The main benefits of doing prog_refcnt++ from bpf_timer_start are ease
of use and no surprises.
Disappearing timer callback when prog unloads is outside of bpf prog control.
For example the tracing bpf prog might collect some data and periodically
flush it to user space. The prog would arm the timer and when callback
is invoked it would send the data via ring buffer and start another
data collection cycle.
When the user space part of the service exits it doesn't have
an ability to enforce the flush of the last chunk of data.
It could do prog_run cmd that will call the timer callback,
but it's racy.
The solution to this problem could be __init and __fini
sections that will be invoked when the prog is loaded
and when the last refcnt drops.
It's a complementary feature though.
The prog_refcnt++ from bpf_timer_start combined with a prog
explicitly doing bpf_timer_cancel from __fini
would be the most flexible combination.
This way the prog can choose to be a daemon or it can choose
to cancel its timers and do data flushing when the last prog
reference drops.
The prog refcnt would be split (similar to uref). The __fini callback
will be invoked when refcnt reaches zero, but all increments
done by bpf_timer_start will be counted separately.
The user space wouldn't need to do the prog_run command.
It would detach the prog and close(prog_fd).
That will trigger __fini callback that will cancel the timers
and the prog will be fully unloaded.
That would make bpf progs resemble kernel modules even more.

> > +BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flags,
> > +	   struct bpf_map *, map, struct bpf_prog *, prog)
> > +{
> > +	struct bpf_hrtimer *t;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +	if (READ_ONCE(timer->timer))
> > +		return -EBUSY;
> > +	/* allocate hrtimer via map_kmalloc to use memcg accounting */
> > +	t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
> > +	if (!t)
> > +		return -ENOMEM;
> > +	t->callback_fn = cb;
> > +	t->value = (void *)timer /* - offset of bpf_timer inside elem */;
> > +	t->key = t->value - round_up(map->key_size, 8);
> 
> For array-maps won't this just point to somewhere inside the previous value?

Excellent catch. Thank you. Will fix.

> > +	t->map = map;
> > +	t->prog = prog;
> > +	spin_lock_init(&t->lock);
> > +	hrtimer_init(&t->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_SOFT);
> > +	t->timer.function = timer_cb;
> > +	if (cmpxchg(&timer->timer, NULL, t)) {
> > +		/* Parallel bpf_timer_init() calls raced. */
> > +		kfree(t);
> > +		return -EBUSY;
> > +	}
> > +	return 0;
> > +}
> > +
> > +static const struct bpf_func_proto bpf_timer_init_proto = {
> > +	.func		= bpf_timer_init,
> > +	.gpl_only	= false,
> 
> hrtimer_init() is EXPORT_SYMBOL_GPL, should this be as well? Same with
> the others below.

Excellent catch as well! Will fix.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-05-27  4:02 ` [PATCH bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
  2021-05-27 16:57   ` Toke Høiland-Jørgensen
@ 2021-06-02 22:08   ` Andrii Nakryiko
  2021-06-03  1:53     ` Alexei Starovoitov
  1 sibling, 1 reply; 20+ messages in thread
From: Andrii Nakryiko @ 2021-06-02 22:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Kernel Team

On Wed, May 26, 2021 at 9:03 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce 'struct bpf_timer { __u64 :64; };' that can be embedded
> in hash/array/lru maps as regular field and helpers to operate on it:
> long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags);
> long bpf_timer_start(struct bpf_timer *timer, u64 nsecs);
> long bpf_timer_cancel(struct bpf_timer *timer);
>
> Here is how BPF program might look like:
> struct map_elem {
>     int counter;
>     struct bpf_timer timer;
> };
>
> struct {
>     __uint(type, BPF_MAP_TYPE_HASH);
>     __uint(max_entries, 1000);
>     __type(key, int);
>     __type(value, struct map_elem);
> } hmap SEC(".maps");
>
> struct bpf_timer global_timer;

Using bpf_timer as a global variable has at least two problems. We
discussed one offline but I realized another one reading the code in
this patch:
  1. this memory can and is memory-mapped as read-write, so user-space
can just write over this (intentionally or accidentally), so it's
quite unsafe
  2. with current restriction of having offset 0 for struct bpf_timer,
you have to use global variable for it, because clang will reorder
static variables after global variables.

I think it's better to disallow use of struct bpf_timer in
memory-mapped maps. Keep in mind that user can create memory-mapped
BPF_MAP_TYPE_ARRAY manually (without using global variables) and put
struct bpf_timer in it, so that should be disallowed as well.

>
> static int timer_cb1(void *map, int *key, __u64 *data);
> /* global_timer is in bss which is special bpf array of one element.
>  * data points to beginning of bss.
>  */
>
> static int timer_cb2(void *map, int *key, struct map_elem *val);
> /* val points to particular map element that contains bpf_timer. */
>
> SEC("fentry/bpf_fentry_test1")
> int BPF_PROG(test1, int a)
> {
>     struct map_elem *val;
>     int key = 0;
>     bpf_timer_init(&global_timer, timer_cb1, 0);
>     bpf_timer_start(&global_timer, 0 /* call timer_cb1 asap */);
>
>     val = bpf_map_lookup_elem(&hmap, &key);
>     if (val) {
>         bpf_timer_init(&val->timer, timer_cb2, 0);
>         bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 msec */);
>     }
> }
>
> This patch adds helper implementations that rely on hrtimers
> to call bpf functions as timers expire.
> The following patch adds necessary safety checks.
>
> Only programs with CAP_BPF are allowed to use bpf_timer.
>
> The amount of timers used by the program is constrained by
> the memcg recorded at map creation time.
>
> The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> program to make sure that program doesn't get freed while timer is armed.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/bpf.h            |   1 +
>  include/uapi/linux/bpf.h       |  26 ++++++
>  kernel/bpf/helpers.c           | 160 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          | 110 +++++++++++++++++++++++
>  kernel/trace/bpf_trace.c       |   2 +-
>  scripts/bpf_doc.py             |   2 +
>  tools/include/uapi/linux/bpf.h |  26 ++++++
>  7 files changed, 326 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 1e9a0ff3217b..925b8416ea0a 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -314,6 +314,7 @@ enum bpf_arg_type {
>         ARG_PTR_TO_FUNC,        /* pointer to a bpf program function */
>         ARG_PTR_TO_STACK_OR_NULL,       /* pointer to stack or NULL */
>         ARG_PTR_TO_CONST_STR,   /* pointer to a null terminated read-only string */
> +       ARG_PTR_TO_TIMER,       /* pointer to bpf_timer */
>         __BPF_ARG_TYPE_MAX,
>  };
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 562adeac1d67..3da901d5076b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4774,6 +4774,25 @@ union bpf_attr {
>   *             Execute close syscall for given FD.
>   *     Return
>   *             A syscall result.
> + *
> + * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
> + *     Description
> + *             Initialize the timer to call given static function.
> + *     Return
> + *             zero

-EBUSY is probably the most important to mention here, but generally
the way it's described right now it seems like it can't fail, which is
not true. Similar for bpf_timer_start() and bpf_timer_cancel().

> + *
> + * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
> + *     Description
> + *             Start the timer and set its expiration N nanoseconds from
> + *             the current time.

The case of nsecs == 0 is a bit special and interesting, it's useful
to explain what will happen in that case. I'm actually curious as
well, in the code you say "call ASAP", but does it mean after the BPF
program exits? Or can it start immediately on another CPU? Or will it
interrupt the currently running BPF program to run the callback
(unlikely, but that might be someone's expectation).

> + *     Return
> + *             zero
> + *
> + * long bpf_timer_cancel(struct bpf_timer *timer)
> + *     Description
> + *             Deactivate the timer.
> + *     Return
> + *             zero
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -4945,6 +4964,9 @@ union bpf_attr {
>         FN(sys_bpf),                    \
>         FN(btf_find_by_name_kind),      \
>         FN(sys_close),                  \
> +       FN(timer_init),                 \
> +       FN(timer_start),                \
> +       FN(timer_cancel),               \
>         /* */
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> @@ -6051,6 +6073,10 @@ struct bpf_spin_lock {
>         __u32   val;
>  };
>
> +struct bpf_timer {
> +       __u64 :64;
> +};
> +
>  struct bpf_sysctl {
>         __u32   write;          /* Sysctl is being read (= 0) or written (= 1).
>                                  * Allows 1,2,4-byte read, but no write.
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 544773970dbc..6f9620cbe95d 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -985,6 +985,160 @@ const struct bpf_func_proto bpf_snprintf_proto = {
>         .arg5_type      = ARG_CONST_SIZE_OR_ZERO,
>  };
>
> +struct bpf_hrtimer {
> +       struct hrtimer timer;
> +       spinlock_t lock;
> +       struct bpf_map *map;
> +       struct bpf_prog *prog;
> +       void *callback_fn;
> +       void *key;
> +       void *value;
> +};
> +
> +/* the actual struct hidden inside uapi struct bpf_timer */
> +struct bpf_timer_kern {
> +       struct bpf_hrtimer *timer;
> +};
> +
> +static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
> +
> +static enum hrtimer_restart timer_cb(struct hrtimer *timer)

nit: can you please call it bpf_timer_cb, so it might be possible to
trace it a bit easier due to bpf_ prefix?

> +{
> +       struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
> +       unsigned long flags;
> +       int ret;
> +
> +       /* timer_cb() runs in hrtimer_run_softirq and doesn't migrate.
> +        * Remember the timer this callback is servicing to prevent
> +        * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
> +        */
> +       this_cpu_write(hrtimer_running, t);
> +       ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)t->map,
> +                                           (u64)(long)t->key,
> +                                           (u64)(long)t->value, 0, 0);
> +       WARN_ON(ret != 0); /* todo: define 0 vs 1 or disallow 1 in the verifier */

if we define 0 vs 1, what would their meaning be?

> +       spin_lock_irqsave(&t->lock, flags);
> +       if (!hrtimer_is_queued(timer))
> +               bpf_prog_put(t->prog);
> +       spin_unlock_irqrestore(&t->lock, flags);
> +       this_cpu_write(hrtimer_running, NULL);

Don't know if it's a problem. Above you say that timer_cb doesn't
migrate, but can it be preempted? If yes, and timer_cb is called in
the meantime for another timer, setting hrtimer_running to NULL will
clobber the previous value, right? So no nesting is possible. Is this
a problem?

Also is there a chance for timer callback to be a sleepable BPF (sub-)program?

What if we add a field to struct bpf_hrtimer that will be inc/dec to
show whether it's active or not? That should bypass per-CPU
assumptions, but I haven't thought through races, worst case we might
need to take t->lock.

> +       return HRTIMER_NORESTART;
> +}
> +
> +BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flags,
> +          struct bpf_map *, map, struct bpf_prog *, prog)
> +{
> +       struct bpf_hrtimer *t;
> +
> +       if (flags)
> +               return -EINVAL;
> +       if (READ_ONCE(timer->timer))
> +               return -EBUSY;
> +       /* allocate hrtimer via map_kmalloc to use memcg accounting */
> +       t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
> +       if (!t)
> +               return -ENOMEM;
> +       t->callback_fn = cb;
> +       t->value = (void *)timer /* - offset of bpf_timer inside elem */;
> +       t->key = t->value - round_up(map->key_size, 8);
> +       t->map = map;
> +       t->prog = prog;
> +       spin_lock_init(&t->lock);
> +       hrtimer_init(&t->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_SOFT);
> +       t->timer.function = timer_cb;
> +       if (cmpxchg(&timer->timer, NULL, t)) {
> +               /* Parallel bpf_timer_init() calls raced. */
> +               kfree(t);
> +               return -EBUSY;
> +       }
> +       return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_init_proto = {
> +       .func           = bpf_timer_init,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_TIMER,
> +       .arg2_type      = ARG_PTR_TO_FUNC,
> +       .arg3_type      = ARG_ANYTHING,
> +};
> +
> +BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)

Not entirely sure, but it feels like adding flags would be good here as well?

> +{
> +       struct bpf_hrtimer *t;
> +       unsigned long flags;
> +
> +       t = READ_ONCE(timer->timer);
> +       if (!t)
> +               return -EINVAL;
> +       spin_lock_irqsave(&t->lock, flags);
> +       /* Keep the prog alive until callback is invoked */
> +       if (!hrtimer_active(&t->timer))
> +               bpf_prog_inc(t->prog);
> +       hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
> +       spin_unlock_irqrestore(&t->lock, flags);
> +       return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_start_proto = {
> +       .func           = bpf_timer_start,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_TIMER,
> +       .arg2_type      = ARG_ANYTHING,
> +};
> +
> +BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> +{
> +       struct bpf_hrtimer *t;
> +       unsigned long flags;
> +
> +       t = READ_ONCE(timer->timer);
> +       if (!t)
> +               return -EINVAL;
> +       if (this_cpu_read(hrtimer_running) == t)
> +               /* If bpf callback_fn is trying to bpf_timer_cancel()
> +                * its own timer the hrtimer_cancel() will deadlock
> +                * since it waits for callback_fn to finish
> +                */
> +               return -EBUSY;
> +       spin_lock_irqsave(&t->lock, flags);
> +       /* Cancel the timer and wait for associated callback to finish
> +        * if it was running.
> +        */
> +       if (hrtimer_cancel(&t->timer) == 1)
> +               /* If the timer was active then drop the prog refcnt,
> +                * since callback will not be invoked.
> +                */

So the fact whether the timer was cancelled or it's active/already
fired seems useful to know in BPF program (sometimes). I can't recall
an exact example, but in the past dealing with some timers (in
user-space, but the point stands) I remember it was important to know
this, so maybe we can communicate that as 0 or 1 returned from
bpf_timer_cancel?


> +               bpf_prog_put(t->prog);
> +       spin_unlock_irqrestore(&t->lock, flags);
> +       return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_cancel_proto = {
> +       .func           = bpf_timer_cancel,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_TIMER,
> +};
> +
> +void bpf_timer_cancel_and_free(void *val)
> +{
> +       struct bpf_timer_kern *timer = val;
> +       struct bpf_hrtimer *t;
> +
> +       t = READ_ONCE(timer->timer);
> +       if (!t)
> +               return;
> +       /* Cancel the timer and wait for callback to complete
> +        * if it was running
> +        */
> +       if (hrtimer_cancel(&t->timer) == 1)
> +               bpf_prog_put(t->prog);
> +       kfree(t);
> +       WRITE_ONCE(timer->timer, NULL);

this seems to race with bpf_timer_start, no? Doing WRITE_ONCE and then
kfree() timer would be a bit safer (we won't have dangling pointer at
any point in time), but I think that still is racy, because
bpf_start_timer can read timer->timer before WRITE_ONCE(NULL) here,
then we kfree(t), and then bpf_timer_start() proceeds to take t->lock
which might explode or might do whatever.

If there is a non-obvious mechanism preventing this (I haven't read
second patch thoroughly yet), it's probably a good idea to mention
that here.

A small nit, you added bpf_timer_cancel_and_free() prototype in the
next patch, but it should probably be done in this patch to avoid
unnecessary compiler warnings during bisecting.


> +}
> +
>  const struct bpf_func_proto bpf_get_current_task_proto __weak;
>  const struct bpf_func_proto bpf_probe_read_user_proto __weak;
>  const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
> @@ -1051,6 +1205,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>                 return &bpf_per_cpu_ptr_proto;
>         case BPF_FUNC_this_cpu_ptr:
>                 return &bpf_this_cpu_ptr_proto;
> +       case BPF_FUNC_timer_init:
> +               return &bpf_timer_init_proto;
> +       case BPF_FUNC_timer_start:
> +               return &bpf_timer_start_proto;
> +       case BPF_FUNC_timer_cancel:
> +               return &bpf_timer_cancel_proto;
>         default:
>                 break;
>         }
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 1de4b8c6ee42..f386f85aee5c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -4656,6 +4656,35 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
>         return 0;
>  }
>
> +static int process_timer_func(struct bpf_verifier_env *env, int regno,
> +                             struct bpf_call_arg_meta *meta)
> +{
> +       struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> +       bool is_const = tnum_is_const(reg->var_off);
> +       struct bpf_map *map = reg->map_ptr;
> +       u64 val = reg->var_off.value;
> +
> +       if (!is_const) {
> +               verbose(env,
> +                       "R%d doesn't have constant offset. bpf_timer has to be at the constant offset\n",
> +                       regno);
> +               return -EINVAL;
> +       }
> +       if (!map->btf) {
> +               verbose(env, "map '%s' has to have BTF in order to use bpf_timer\n",
> +                       map->name);
> +               return -EINVAL;
> +       }
> +       if (val) {
> +               /* todo: relax this requirement */

Yeah, that's quite a non-obvious requirement. How hard is it to get
rid of it? If not, we should probably leave a comment new struct
bpf_timer definition in UAPI header.

> +               verbose(env, "bpf_timer field can only be first in the map value element\n");
> +               return -EINVAL;
> +       }
> +       WARN_ON(meta->map_ptr);
> +       meta->map_ptr = map;
> +       return 0;
> +}
> +

[...]

>         if (func_id == BPF_FUNC_snprintf) {
>                 err = check_bpf_snprintf_call(env, regs);
>                 if (err < 0)
> @@ -12526,6 +12605,37 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>                         insn      = new_prog->insnsi + i + delta;
>                         continue;
>                 }
> +               if (insn->imm == BPF_FUNC_timer_init) {
> +

nit: why empty line here?

> +                       aux = &env->insn_aux_data[i + delta];
> +                       if (bpf_map_ptr_poisoned(aux)) {
> +                               verbose(env, "bpf_timer_init abusing map_ptr\n");
> +                               return -EINVAL;
> +                       }
> +                       map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
> +                       {
> +                               struct bpf_insn ld_addrs[4] = {
> +                                       BPF_LD_IMM64(BPF_REG_4, (long)map_ptr),
> +                                       BPF_LD_IMM64(BPF_REG_5, (long)prog),
> +                               };
> +
> +                               insn_buf[0] = ld_addrs[0];
> +                               insn_buf[1] = ld_addrs[1];
> +                               insn_buf[2] = ld_addrs[2];
> +                               insn_buf[3] = ld_addrs[3];
> +                       }
> +                       insn_buf[4] = *insn;
> +                       cnt = 5;
> +
> +                       new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +                       if (!new_prog)
> +                               return -ENOMEM;
> +
> +                       delta    += cnt - 1;
> +                       env->prog = prog = new_prog;
> +                       insn      = new_prog->insnsi + i + delta;
> +                       goto patch_call_imm;
> +               }
>
>                 /* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
>                  * and other inlining handlers are currently limited to 64 bit
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index d2d7cf6cfe83..453a46c2d732 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1065,7 +1065,7 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>         case BPF_FUNC_snprintf:
>                 return &bpf_snprintf_proto;
>         default:
> -               return NULL;
> +               return bpf_base_func_proto(func_id);
>         }
>  }
>

[...]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-02  1:46     ` Alexei Starovoitov
@ 2021-06-02 22:21       ` Toke Høiland-Jørgensen
  2021-06-03  1:58         ` Alexei Starovoitov
  0 siblings, 1 reply; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-06-02 22:21 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, daniel, andrii, netdev, bpf, kernel-team

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, May 27, 2021 at 06:57:17PM +0200, Toke Høiland-Jørgensen wrote:
>> >     if (val) {
>> >         bpf_timer_init(&val->timer, timer_cb2, 0);
>> >         bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 msec */);
>> 
>> nit: there are 1M nanoseconds in a millisecond :)
>
> oops :)
>
>> >     }
>> > }
>> >
>> > This patch adds helper implementations that rely on hrtimers
>> > to call bpf functions as timers expire.
>> > The following patch adds necessary safety checks.
>> >
>> > Only programs with CAP_BPF are allowed to use bpf_timer.
>> >
>> > The amount of timers used by the program is constrained by
>> > the memcg recorded at map creation time.
>> >
>> > The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
>> > supplied by the verifier. The prog pointer is needed to do refcnting of bpf
>> > program to make sure that program doesn't get freed while timer is armed.
>> >
>> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> 
>> Overall this LGTM, and I believe it will be usable for my intended use
>> case. One question:
>> 
>> With this, it will basically be possible to create a BPF daemon, won't
>> it? I.e., if a program includes a timer and the callback keeps re-arming
>> itself this will continue indefinitely even if userspace closes all refs
>> to maps and programs? Not saying this is a problem, just wanted to check
>> my understanding (i.e., that there's not some hidden requirement on
>> userspace keeping a ref open that I'm missing)...
>
> That is correct.
> Another option would be to auto-cancel the timer when the last reference
> to the prog drops. That may feel safer, since forever
> running bpf daemon is a certainly new concept.
> The main benefits of doing prog_refcnt++ from bpf_timer_start are ease
> of use and no surprises.
> Disappearing timer callback when prog unloads is outside of bpf prog control.
> For example the tracing bpf prog might collect some data and periodically
> flush it to user space. The prog would arm the timer and when callback
> is invoked it would send the data via ring buffer and start another
> data collection cycle.
> When the user space part of the service exits it doesn't have
> an ability to enforce the flush of the last chunk of data.
> It could do prog_run cmd that will call the timer callback,
> but it's racy.
> The solution to this problem could be __init and __fini
> sections that will be invoked when the prog is loaded
> and when the last refcnt drops.
> It's a complementary feature though.
> The prog_refcnt++ from bpf_timer_start combined with a prog
> explicitly doing bpf_timer_cancel from __fini
> would be the most flexible combination.
> This way the prog can choose to be a daemon or it can choose
> to cancel its timers and do data flushing when the last prog
> reference drops.
> The prog refcnt would be split (similar to uref). The __fini callback
> will be invoked when refcnt reaches zero, but all increments
> done by bpf_timer_start will be counted separately.
> The user space wouldn't need to do the prog_run command.
> It would detach the prog and close(prog_fd).
> That will trigger __fini callback that will cancel the timers
> and the prog will be fully unloaded.
> That would make bpf progs resemble kernel modules even more.

I like the idea of a "destructor" that will trigger on refcnt drop to
zero. And I do think a "bpf daemon" is potentially a useful, if novel,
concept.

The __fini thing kinda supposes a well-behaved program, though, right?
I.e., it would be fairly trivial to write a program that spins forever
by repeatedly scheduling the timer with a very short interval (whether
by malice or bugginess). So do we need a 'bpfkill' type utility to nuke
buggy programs, or how would resource constraints be enforced?

-Toke


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: Add verifier checks for bpf_timer.
  2021-05-27  4:02 ` [PATCH bpf-next 2/3] bpf: Add verifier checks for bpf_timer Alexei Starovoitov
@ 2021-06-02 22:34   ` Andrii Nakryiko
  2021-06-02 22:38     ` Andrii Nakryiko
  2021-06-03  2:04     ` Alexei Starovoitov
  0 siblings, 2 replies; 20+ messages in thread
From: Andrii Nakryiko @ 2021-06-02 22:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Kernel Team

On Wed, May 26, 2021 at 9:03 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Add appropriate safety checks for bpf_timer:
> - restrict to array, hash, lru. per-cpu maps cannot be supported.
> - kfree bpf_timer during map_delete_elem and map_free.
> - verifier btf checks.
> - safe interaction with lookup/update/delete operations and iterator.
> - relax the first field only requirement of the previous patch.
> - allow bpf_timer in global data and search for it in datasec.
> - check prog_rdonly, frozen flags.
> - mmap is allowed. otherwise global timer is not possible.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/bpf.h        | 36 +++++++++++++-----
>  include/linux/btf.h        |  1 +
>  kernel/bpf/arraymap.c      |  7 ++++
>  kernel/bpf/btf.c           | 77 +++++++++++++++++++++++++++++++-------
>  kernel/bpf/hashtab.c       | 53 ++++++++++++++++++++------
>  kernel/bpf/helpers.c       |  2 +-
>  kernel/bpf/local_storage.c |  4 +-
>  kernel/bpf/syscall.c       | 23 ++++++++++--
>  kernel/bpf/verifier.c      | 30 +++++++++++++--
>  9 files changed, 190 insertions(+), 43 deletions(-)
>

[...]

>  /* copy everything but bpf_spin_lock */
>  static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>  {
> +       u32 off = 0, size = 0;
> +
>         if (unlikely(map_value_has_spin_lock(map))) {
> -               u32 off = map->spin_lock_off;
> +               off = map->spin_lock_off;
> +               size = sizeof(struct bpf_spin_lock);
> +       } else if (unlikely(map_value_has_timer(map))) {
> +               off = map->timer_off;
> +               size = sizeof(struct bpf_timer);
> +       }

so the need to handle 0, 1, or 2 gaps seems to be the only reason to
disallow both bpf_spinlock and bpf_timer in one map element, right?
Isn't it worth addressing it from the very beginning to lift the
artificial restriction? E.g., for speed, you'd do:

if (likely(neither spinlock nor timer)) {
 /* fastest pass */
} else if (only one of spinlock or timer) {
  /* do what you do here */
} else {
  int off1, off2, sz1, sz2;

  if (spinlock_off < timer_off) {
    off1 = spinlock_off;
    sz1 = spinlock_sz;
    off2 = timer_off;
    sz2 = timer_sz;
  } else {
    ... you get the idea
  }

  memcpy(0, off1);
  memcpy(off1+sz1, off2);
  memcpy(off2+sz2, total_sz);
}

It's not that bad, right?

>
> +       if (unlikely(size)) {
>                 memcpy(dst, src, off);
> -               memcpy(dst + off + sizeof(struct bpf_spin_lock),
> -                      src + off + sizeof(struct bpf_spin_lock),
> -                      map->value_size - off - sizeof(struct bpf_spin_lock));
> +               memcpy(dst + off + size,
> +                      src + off + size,
> +                      map->value_size - off - size);
>         } else {
>                 memcpy(dst, src, map->value_size);
>         }

[...]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f386f85aee5c..0a828dc4968e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3241,6 +3241,15 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
>                         return -EACCES;
>                 }
>         }
> +       if (map_value_has_timer(map)) {
> +               u32 t = map->timer_off;
> +
> +               if (reg->smin_value + off < t + sizeof(struct bpf_timer) &&

<= ? Otherwise we allow accessing the first byte, unless I'm mistaken.

> +                    t < reg->umax_value + off + size) {
> +                       verbose(env, "bpf_timer cannot be accessed directly by load/store\n");
> +                       return -EACCES;
> +               }
> +       }
>         return err;
>  }
>
> @@ -4675,9 +4684,24 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
>                         map->name);
>                 return -EINVAL;
>         }
> -       if (val) {
> -               /* todo: relax this requirement */
> -               verbose(env, "bpf_timer field can only be first in the map value element\n");

ok, this was confusing, but now I see why you did that...

> +       if (!map_value_has_timer(map)) {
> +               if (map->timer_off == -E2BIG)
> +                       verbose(env,
> +                               "map '%s' has more than one 'struct bpf_timer'\n",
> +                               map->name);
> +               else if (map->timer_off == -ENOENT)
> +                       verbose(env,
> +                               "map '%s' doesn't have 'struct bpf_timer'\n",
> +                               map->name);
> +               else
> +                       verbose(env,
> +                               "map '%s' is not a struct type or bpf_timer is mangled\n",
> +                               map->name);
> +               return -EINVAL;
> +       }
> +       if (map->timer_off != val + reg->off) {
> +               verbose(env, "off %lld doesn't point to 'struct bpf_timer' that is at %d\n",
> +                       val + reg->off, map->timer_off);
>                 return -EINVAL;
>         }
>         WARN_ON(meta->map_ptr);
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: Add verifier checks for bpf_timer.
  2021-06-02 22:34   ` Andrii Nakryiko
@ 2021-06-02 22:38     ` Andrii Nakryiko
  2021-06-03  2:04     ` Alexei Starovoitov
  1 sibling, 0 replies; 20+ messages in thread
From: Andrii Nakryiko @ 2021-06-02 22:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Kernel Team

On Wed, Jun 2, 2021 at 3:34 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, May 26, 2021 at 9:03 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Add appropriate safety checks for bpf_timer:
> > - restrict to array, hash, lru. per-cpu maps cannot be supported.
> > - kfree bpf_timer during map_delete_elem and map_free.
> > - verifier btf checks.
> > - safe interaction with lookup/update/delete operations and iterator.
> > - relax the first field only requirement of the previous patch.
> > - allow bpf_timer in global data and search for it in datasec.

I'll mention it here for completeness. I don't think safety
implications are worth it to support timer or spinlock in
memory-mapped maps. It's way too easy to abuse it (or even
accidentally corrupt kernel state). Sure it's nice, but doing an
explicit single-element map for "global" timer is just fine. And it
generalizes nicely to having 2, 3, ..., N timers.

> > - check prog_rdonly, frozen flags.
> > - mmap is allowed. otherwise global timer is not possible.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  include/linux/bpf.h        | 36 +++++++++++++-----
> >  include/linux/btf.h        |  1 +
> >  kernel/bpf/arraymap.c      |  7 ++++
> >  kernel/bpf/btf.c           | 77 +++++++++++++++++++++++++++++++-------
> >  kernel/bpf/hashtab.c       | 53 ++++++++++++++++++++------
> >  kernel/bpf/helpers.c       |  2 +-
> >  kernel/bpf/local_storage.c |  4 +-
> >  kernel/bpf/syscall.c       | 23 ++++++++++--
> >  kernel/bpf/verifier.c      | 30 +++++++++++++--
> >  9 files changed, 190 insertions(+), 43 deletions(-)
> >
>
> [...]
>
> >  /* copy everything but bpf_spin_lock */
> >  static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
> >  {
> > +       u32 off = 0, size = 0;
> > +
> >         if (unlikely(map_value_has_spin_lock(map))) {
> > -               u32 off = map->spin_lock_off;
> > +               off = map->spin_lock_off;
> > +               size = sizeof(struct bpf_spin_lock);
> > +       } else if (unlikely(map_value_has_timer(map))) {
> > +               off = map->timer_off;
> > +               size = sizeof(struct bpf_timer);
> > +       }
>
> so the need to handle 0, 1, or 2 gaps seems to be the only reason to
> disallow both bpf_spinlock and bpf_timer in one map element, right?
> Isn't it worth addressing it from the very beginning to lift the
> artificial restriction? E.g., for speed, you'd do:
>
> if (likely(neither spinlock nor timer)) {
>  /* fastest pass */
> } else if (only one of spinlock or timer) {
>   /* do what you do here */
> } else {
>   int off1, off2, sz1, sz2;
>
>   if (spinlock_off < timer_off) {
>     off1 = spinlock_off;
>     sz1 = spinlock_sz;
>     off2 = timer_off;
>     sz2 = timer_sz;
>   } else {
>     ... you get the idea
>   }
>
>   memcpy(0, off1);
>   memcpy(off1+sz1, off2);
>   memcpy(off2+sz2, total_sz);
> }
>
> It's not that bad, right?
>
> >
> > +       if (unlikely(size)) {
> >                 memcpy(dst, src, off);
> > -               memcpy(dst + off + sizeof(struct bpf_spin_lock),
> > -                      src + off + sizeof(struct bpf_spin_lock),
> > -                      map->value_size - off - sizeof(struct bpf_spin_lock));
> > +               memcpy(dst + off + size,
> > +                      src + off + size,
> > +                      map->value_size - off - size);
> >         } else {
> >                 memcpy(dst, src, map->value_size);
> >         }
>
> [...]
>
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index f386f85aee5c..0a828dc4968e 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -3241,6 +3241,15 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> >                         return -EACCES;
> >                 }
> >         }
> > +       if (map_value_has_timer(map)) {
> > +               u32 t = map->timer_off;
> > +
> > +               if (reg->smin_value + off < t + sizeof(struct bpf_timer) &&
>
> <= ? Otherwise we allow accessing the first byte, unless I'm mistaken.
>
> > +                    t < reg->umax_value + off + size) {
> > +                       verbose(env, "bpf_timer cannot be accessed directly by load/store\n");
> > +                       return -EACCES;
> > +               }
> > +       }
> >         return err;
> >  }
> >
> > @@ -4675,9 +4684,24 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
> >                         map->name);
> >                 return -EINVAL;
> >         }
> > -       if (val) {
> > -               /* todo: relax this requirement */
> > -               verbose(env, "bpf_timer field can only be first in the map value element\n");
>
> ok, this was confusing, but now I see why you did that...
>
> > +       if (!map_value_has_timer(map)) {
> > +               if (map->timer_off == -E2BIG)
> > +                       verbose(env,
> > +                               "map '%s' has more than one 'struct bpf_timer'\n",
> > +                               map->name);
> > +               else if (map->timer_off == -ENOENT)
> > +                       verbose(env,
> > +                               "map '%s' doesn't have 'struct bpf_timer'\n",
> > +                               map->name);
> > +               else
> > +                       verbose(env,
> > +                               "map '%s' is not a struct type or bpf_timer is mangled\n",
> > +                               map->name);
> > +               return -EINVAL;
> > +       }
> > +       if (map->timer_off != val + reg->off) {
> > +               verbose(env, "off %lld doesn't point to 'struct bpf_timer' that is at %d\n",
> > +                       val + reg->off, map->timer_off);
> >                 return -EINVAL;
> >         }
> >         WARN_ON(meta->map_ptr);
> > --
> > 2.30.2
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-02 22:08   ` Andrii Nakryiko
@ 2021-06-03  1:53     ` Alexei Starovoitov
  2021-06-03 17:10       ` Andrii Nakryiko
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2021-06-03  1:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Kernel Team

On Wed, Jun 02, 2021 at 03:08:08PM -0700, Andrii Nakryiko wrote:
> On Wed, May 26, 2021 at 9:03 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce 'struct bpf_timer { __u64 :64; };' that can be embedded
> > in hash/array/lru maps as regular field and helpers to operate on it:
> > long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags);
> > long bpf_timer_start(struct bpf_timer *timer, u64 nsecs);
> > long bpf_timer_cancel(struct bpf_timer *timer);
> >
> > Here is how BPF program might look like:
> > struct map_elem {
> >     int counter;
> >     struct bpf_timer timer;
> > };
> >
> > struct {
> >     __uint(type, BPF_MAP_TYPE_HASH);
> >     __uint(max_entries, 1000);
> >     __type(key, int);
> >     __type(value, struct map_elem);
> > } hmap SEC(".maps");
> >
> > struct bpf_timer global_timer;
> 
> Using bpf_timer as a global variable has at least two problems. We
> discussed one offline but I realized another one reading the code in
> this patch:
>   1. this memory can and is memory-mapped as read-write, so user-space
> can just write over this (intentionally or accidentally), so it's
> quite unsafe

yep.

>   2. with current restriction of having offset 0 for struct bpf_timer,
> you have to use global variable for it, because clang will reorder
> static variables after global variables.

that is addressed in 2nd patch.

> > + * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
> > + *     Description
> > + *             Initialize the timer to call given static function.
> > + *     Return
> > + *             zero
> 
> -EBUSY is probably the most important to mention here, but generally
> the way it's described right now it seems like it can't fail, which is
> not true. Similar for bpf_timer_start() and bpf_timer_cancel().

good point.

> > + *
> > + * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
> > + *     Description
> > + *             Start the timer and set its expiration N nanoseconds from
> > + *             the current time.
> 
> The case of nsecs == 0 is a bit special and interesting, it's useful
> to explain what will happen in that case. I'm actually curious as
> well, in the code you say "call ASAP", but does it mean after the BPF
> program exits? Or can it start immediately on another CPU? Or will it
> interrupt the currently running BPF program to run the callback
> (unlikely, but that might be someone's expectation).

nsecs == 0 is not special. nsecs is a relative expiry time.
1 nanosecond is not much different from zero :)
Most likely this timer will be the first one to run once it's added
to per-cpu rb-tree.

I think going too much into implementation details in the helper
description is unnecessary.
" Start the timer and set its expiration N nanoseconds from
  the current time. "
is probably about right amount of details.
I can add that the time clock is monotonic
and callback is called in softirq.

> > +static enum hrtimer_restart timer_cb(struct hrtimer *timer)
> 
> nit: can you please call it bpf_timer_cb, so it might be possible to
> trace it a bit easier due to bpf_ prefix?

sure.

> > +{
> > +       struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
> > +       unsigned long flags;
> > +       int ret;
> > +
> > +       /* timer_cb() runs in hrtimer_run_softirq and doesn't migrate.
> > +        * Remember the timer this callback is servicing to prevent
> > +        * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
> > +        */
> > +       this_cpu_write(hrtimer_running, t);
> > +       ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)t->map,
> > +                                           (u64)(long)t->key,
> > +                                           (u64)(long)t->value, 0, 0);
> > +       WARN_ON(ret != 0); /* todo: define 0 vs 1 or disallow 1 in the verifier */
> 
> if we define 0 vs 1, what would their meaning be?

How would you define it?
The hrtimer has support for HRTIMER_NORESTART vs HRTIMER_RESTART.
This feature of hrtimer is already exposed into user space via timerfd,
so no concerns there.
But to use HRTIMER_RESTART in a meaningful way the bpf prog
would need to be able to call to hrtimer_forward_now() to set
the new expiry.
That function has an interesting caution comment in hrtimer.h:
 * Can be safely called from the callback function of @timer. If
 * called from other contexts @timer must neither be enqueued nor
 * running the callback and the caller needs to take care of
 * serialization.
I'm not sure how to teach the verifier to enforce that.. yet...

As an alternative we can interpret bpf timer callback return value
inside bpf_timer_cb() kernel function as:
0 - return HRTIMER_NORESTART
N - hrtimer_forward_now(,N); return HRTIMER_RESTART.

but the same can be achieved by calling bpf_timer_start()
from the bpf prog. The speed of re-arming is roughly the same
in both cases.
Doing the same functionality two different ways doesn't seem
necessary.

I couldn't come up with other ways to use the return value
and currently thinking to allow 0 only for now.
Other ideas?

> > +       spin_lock_irqsave(&t->lock, flags);
> > +       if (!hrtimer_is_queued(timer))
> > +               bpf_prog_put(t->prog);
> > +       spin_unlock_irqrestore(&t->lock, flags);
> > +       this_cpu_write(hrtimer_running, NULL);
> 
> Don't know if it's a problem. Above you say that timer_cb doesn't
> migrate, but can it be preempted? 

My understanding that no, it cannot be. Even in RT.

> If yes, and timer_cb is called in
> the meantime for another timer, setting hrtimer_running to NULL will
> clobber the previous value, right? So no nesting is possible. Is this
> a problem?

Based on my current understanding of hrtimer implemention we're safe here.

> Also is there a chance for timer callback to be a sleepable BPF (sub-)program?

Eventually yes. The whole bpf program can be sleepable, but the
timer callback cannot be. So the verifier would need to
treat different functions of the bpf prog as sleepable and non-sleepable.
Easy enough to implement. Eventually.

> What if we add a field to struct bpf_hrtimer that will be inc/dec to
> show whether it's active or not? That should bypass per-CPU
> assumptions, but I haven't thought through races, worst case we might
> need to take t->lock.

You mean to get rid of per-cpu hrtimer_running and replace
with bool flag inside bpf_hrtimer. Like is_callback_fn_running ?
That won't work as-is, since bpf_timer_cancel will be called
sooner or later when another cpu is inside the callback.
Something like:
bpf_timer_cb()
{
  bpf_hrtimer->running_on_cpu = smp_processor_id();
  BPF_CAST_CALL(t->callback_fn)
  bpf_hrtimer->running_on_cpu = -1;
}

bpf_timer_cancel()
{ 
  if (bpf_hrtimer->running_on_cpu == smp_processor_id())
    return -EBUSY;
}

will work, but it's potentially wasting more memory
if there are millions of timers
than single per-cpu hrtimer_running and seems less clean.

> > +BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)
> 
> Not entirely sure, but it feels like adding flags would be good here as well?

Same here. Not entirely sure, but I think it's good without it.
Here is why:
hrtimer_start() has 3rd argument which allows some flexibility
to change the mode of the timer after hrtimer_init.
But I think it's too flexible part of hrtimer api.
Like io_uring is exposing hrtimers, but remembers the mode specified
during hrtimer_init phase and uses it during hrtimer_start.
I think eventually the bpf_hrtimer_init() might support
not only clock_monotonic and relative expiry, but other
features of hrtimer as well, but I think it's better to
follow what io_uring did and remember the mode during bpf_hrtimer_init()
and use the same mode in bpf_hrtimer_start().
Because of that there is nothing extra to pass into hrtimer_start()
and hence no use for 3rd argument in bpf_timer_start.

The flags argument in bpf_timer_init() will eventually
be able to specify monotonic vs boottime and
relative vs absolute expiry.

> > +BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> > +{
> > +       struct bpf_hrtimer *t;
> > +       unsigned long flags;
> > +
> > +       t = READ_ONCE(timer->timer);
> > +       if (!t)
> > +               return -EINVAL;
> > +       if (this_cpu_read(hrtimer_running) == t)
> > +               /* If bpf callback_fn is trying to bpf_timer_cancel()
> > +                * its own timer the hrtimer_cancel() will deadlock
> > +                * since it waits for callback_fn to finish
> > +                */
> > +               return -EBUSY;
> > +       spin_lock_irqsave(&t->lock, flags);
> > +       /* Cancel the timer and wait for associated callback to finish
> > +        * if it was running.
> > +        */
> > +       if (hrtimer_cancel(&t->timer) == 1)
> > +               /* If the timer was active then drop the prog refcnt,
> > +                * since callback will not be invoked.
> > +                */
> 
> So the fact whether the timer was cancelled or it's active/already
> fired seems useful to know in BPF program (sometimes). I can't recall
> an exact example, but in the past dealing with some timers (in
> user-space, but the point stands) I remember it was important to know
> this, so maybe we can communicate that as 0 or 1 returned from
> bpf_timer_cancel?

Good idea!

> > +void bpf_timer_cancel_and_free(void *val)
> > +{
> > +       struct bpf_timer_kern *timer = val;
> > +       struct bpf_hrtimer *t;
> > +
> > +       t = READ_ONCE(timer->timer);
> > +       if (!t)
> > +               return;
> > +       /* Cancel the timer and wait for callback to complete
> > +        * if it was running
> > +        */
> > +       if (hrtimer_cancel(&t->timer) == 1)
> > +               bpf_prog_put(t->prog);
> > +       kfree(t);
> > +       WRITE_ONCE(timer->timer, NULL);
> 
> this seems to race with bpf_timer_start, no? Doing WRITE_ONCE and then
> kfree() timer would be a bit safer (we won't have dangling pointer at
> any point in time), but I think that still is racy, because
> bpf_start_timer can read timer->timer before WRITE_ONCE(NULL) here,
> then we kfree(t), and then bpf_timer_start() proceeds to take t->lock
> which might explode or might do whatever.

This race is not possible with bpf_timer inside array
and inside non-prealloc htab.
The bpf_timer_cancel_and_free() is called when element being
deleted in htab or the whole array/htab is destroyed.
When element is deleted the bpf prog cannot look it up.
Hence it cannot reach bpf_timer pointer and call bpf_timer_start() on it.
In case of preallocated htab there is race.
The bpf prog can do a lookup then delete an element
while still using earlier value pointer. Since all elements
are preallocated the elem could be reused for another value at that time.
I need to think more about ways to address it.
Currently thinking to do cmpxchg in bpf_timer_start() and
bpf_timer_cancel*() similar to bpf_timer_init() to address it.
Kinda sucks to use atomic ops to address a race by deliberately 
malicious prog. bpf prog writers cannot just stumble on such race.

> A small nit, you added bpf_timer_cancel_and_free() prototype in the
> next patch, but it should probably be done in this patch to avoid
> unnecessary compiler warnings during bisecting.

There is no warning in default build.
It's just an unused (yet) global function. That's why I've sent
the patch this way, but since you asked I've tried few other ways
and found that "make W=1" indeed warns.
I'll move the proto to shut it up.

> > +               if (insn->imm == BPF_FUNC_timer_init) {
> > +
> 
> nit: why empty line here?

no clue. will fix.

Thank you for the review!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-02 22:21       ` Toke Høiland-Jørgensen
@ 2021-06-03  1:58         ` Alexei Starovoitov
  2021-06-03 10:59           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2021-06-03  1:58 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: davem, daniel, andrii, netdev, bpf, kernel-team

On Thu, Jun 03, 2021 at 12:21:05AM +0200, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Thu, May 27, 2021 at 06:57:17PM +0200, Toke Høiland-Jørgensen wrote:
> >> >     if (val) {
> >> >         bpf_timer_init(&val->timer, timer_cb2, 0);
> >> >         bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 msec */);
> >> 
> >> nit: there are 1M nanoseconds in a millisecond :)
> >
> > oops :)
> >
> >> >     }
> >> > }
> >> >
> >> > This patch adds helper implementations that rely on hrtimers
> >> > to call bpf functions as timers expire.
> >> > The following patch adds necessary safety checks.
> >> >
> >> > Only programs with CAP_BPF are allowed to use bpf_timer.
> >> >
> >> > The amount of timers used by the program is constrained by
> >> > the memcg recorded at map creation time.
> >> >
> >> > The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> >> > supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> >> > program to make sure that program doesn't get freed while timer is armed.
> >> >
> >> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >> 
> >> Overall this LGTM, and I believe it will be usable for my intended use
> >> case. One question:
> >> 
> >> With this, it will basically be possible to create a BPF daemon, won't
> >> it? I.e., if a program includes a timer and the callback keeps re-arming
> >> itself this will continue indefinitely even if userspace closes all refs
> >> to maps and programs? Not saying this is a problem, just wanted to check
> >> my understanding (i.e., that there's not some hidden requirement on
> >> userspace keeping a ref open that I'm missing)...
> >
> > That is correct.
> > Another option would be to auto-cancel the timer when the last reference
> > to the prog drops. That may feel safer, since forever
> > running bpf daemon is a certainly new concept.
> > The main benefits of doing prog_refcnt++ from bpf_timer_start are ease
> > of use and no surprises.
> > Disappearing timer callback when prog unloads is outside of bpf prog control.
> > For example the tracing bpf prog might collect some data and periodically
> > flush it to user space. The prog would arm the timer and when callback
> > is invoked it would send the data via ring buffer and start another
> > data collection cycle.
> > When the user space part of the service exits it doesn't have
> > an ability to enforce the flush of the last chunk of data.
> > It could do prog_run cmd that will call the timer callback,
> > but it's racy.
> > The solution to this problem could be __init and __fini
> > sections that will be invoked when the prog is loaded
> > and when the last refcnt drops.
> > It's a complementary feature though.
> > The prog_refcnt++ from bpf_timer_start combined with a prog
> > explicitly doing bpf_timer_cancel from __fini
> > would be the most flexible combination.
> > This way the prog can choose to be a daemon or it can choose
> > to cancel its timers and do data flushing when the last prog
> > reference drops.
> > The prog refcnt would be split (similar to uref). The __fini callback
> > will be invoked when refcnt reaches zero, but all increments
> > done by bpf_timer_start will be counted separately.
> > The user space wouldn't need to do the prog_run command.
> > It would detach the prog and close(prog_fd).
> > That will trigger __fini callback that will cancel the timers
> > and the prog will be fully unloaded.
> > That would make bpf progs resemble kernel modules even more.
> 
> I like the idea of a "destructor" that will trigger on refcnt drop to
> zero. And I do think a "bpf daemon" is potentially a useful, if novel,
> concept.

I think so too. Long ago folks requested periodic bpf progs to
do sampling in tracing. All these years attaching bpf prog
to a perf_event was a workaround for such feature request.
perf_event bpf prog can be pinned in perf_event array,
so "bpf daemon" kinda exist today. Just more convoluted.

> The __fini thing kinda supposes a well-behaved program, though, right?
> I.e., it would be fairly trivial to write a program that spins forever
> by repeatedly scheduling the timer with a very short interval (whether
> by malice or bugginess).

It's already possible without bpf_timer.

> So do we need a 'bpfkill' type utility to nuke
> buggy programs, or how would resource constraints be enforced?

That is possible without 'bpfkill'.
bpftool can delete map element that contains bpf_timer and
that will cancel it. I'll add tests to make sure it's the case.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: Add verifier checks for bpf_timer.
  2021-06-02 22:34   ` Andrii Nakryiko
  2021-06-02 22:38     ` Andrii Nakryiko
@ 2021-06-03  2:04     ` Alexei Starovoitov
  2021-06-03 17:35       ` Andrii Nakryiko
  1 sibling, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2021-06-03  2:04 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Kernel Team

On Wed, Jun 02, 2021 at 03:34:29PM -0700, Andrii Nakryiko wrote:
> 
> >  /* copy everything but bpf_spin_lock */
> >  static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
> >  {
> > +       u32 off = 0, size = 0;
> > +
> >         if (unlikely(map_value_has_spin_lock(map))) {
> > -               u32 off = map->spin_lock_off;
> > +               off = map->spin_lock_off;
> > +               size = sizeof(struct bpf_spin_lock);
> > +       } else if (unlikely(map_value_has_timer(map))) {
> > +               off = map->timer_off;
> > +               size = sizeof(struct bpf_timer);
> > +       }
> 
> so the need to handle 0, 1, or 2 gaps seems to be the only reason to
> disallow both bpf_spinlock and bpf_timer in one map element, right?

exactly.

> Isn't it worth addressing it from the very beginning to lift the
> artificial restriction? E.g., for speed, you'd do:
> 
> if (likely(neither spinlock nor timer)) {
>  /* fastest pass */
> } else if (only one of spinlock or timer) {
>   /* do what you do here */
> } else {
>   int off1, off2, sz1, sz2;
> 
>   if (spinlock_off < timer_off) {
>     off1 = spinlock_off;
>     sz1 = spinlock_sz;
>     off2 = timer_off;
>     sz2 = timer_sz;
>   } else {
>     ... you get the idea

Not really :)
Are you suggesting to support one bpf_spin_lock and one
bpf_timer inside single map element, but not two spin_locks
and/or not two bpf_timers?
Two me it's either one or support any.
Anything in-between doesn't seem worth extra code.

> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index f386f85aee5c..0a828dc4968e 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -3241,6 +3241,15 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> >                         return -EACCES;
> >                 }
> >         }
> > +       if (map_value_has_timer(map)) {
> > +               u32 t = map->timer_off;
> > +
> > +               if (reg->smin_value + off < t + sizeof(struct bpf_timer) &&
> 
> <= ? Otherwise we allow accessing the first byte, unless I'm mistaken.

I don't think so. See the comment above in if (map_value_has_spin_lock(map))
I didn't copy-paste it, because it's the same logic.

> > -       if (val) {
> > -               /* todo: relax this requirement */
> > -               verbose(env, "bpf_timer field can only be first in the map value element\n");
> 
> ok, this was confusing, but now I see why you did that...

I'll clarify the comment to say that the next patch fixes it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-03  1:58         ` Alexei Starovoitov
@ 2021-06-03 10:59           ` Toke Høiland-Jørgensen
  2021-06-03 17:21             ` Andrii Nakryiko
  0 siblings, 1 reply; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-06-03 10:59 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, daniel, andrii, netdev, bpf, kernel-team

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Jun 03, 2021 at 12:21:05AM +0200, Toke Høiland-Jørgensen wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>> 
>> > On Thu, May 27, 2021 at 06:57:17PM +0200, Toke Høiland-Jørgensen wrote:
>> >> >     if (val) {
>> >> >         bpf_timer_init(&val->timer, timer_cb2, 0);
>> >> >         bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 msec */);
>> >> 
>> >> nit: there are 1M nanoseconds in a millisecond :)
>> >
>> > oops :)
>> >
>> >> >     }
>> >> > }
>> >> >
>> >> > This patch adds helper implementations that rely on hrtimers
>> >> > to call bpf functions as timers expire.
>> >> > The following patch adds necessary safety checks.
>> >> >
>> >> > Only programs with CAP_BPF are allowed to use bpf_timer.
>> >> >
>> >> > The amount of timers used by the program is constrained by
>> >> > the memcg recorded at map creation time.
>> >> >
>> >> > The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
>> >> > supplied by the verifier. The prog pointer is needed to do refcnting of bpf
>> >> > program to make sure that program doesn't get freed while timer is armed.
>> >> >
>> >> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> >> 
>> >> Overall this LGTM, and I believe it will be usable for my intended use
>> >> case. One question:
>> >> 
>> >> With this, it will basically be possible to create a BPF daemon, won't
>> >> it? I.e., if a program includes a timer and the callback keeps re-arming
>> >> itself this will continue indefinitely even if userspace closes all refs
>> >> to maps and programs? Not saying this is a problem, just wanted to check
>> >> my understanding (i.e., that there's not some hidden requirement on
>> >> userspace keeping a ref open that I'm missing)...
>> >
>> > That is correct.
>> > Another option would be to auto-cancel the timer when the last reference
>> > to the prog drops. That may feel safer, since forever
>> > running bpf daemon is a certainly new concept.
>> > The main benefits of doing prog_refcnt++ from bpf_timer_start are ease
>> > of use and no surprises.
>> > Disappearing timer callback when prog unloads is outside of bpf prog control.
>> > For example the tracing bpf prog might collect some data and periodically
>> > flush it to user space. The prog would arm the timer and when callback
>> > is invoked it would send the data via ring buffer and start another
>> > data collection cycle.
>> > When the user space part of the service exits it doesn't have
>> > an ability to enforce the flush of the last chunk of data.
>> > It could do prog_run cmd that will call the timer callback,
>> > but it's racy.
>> > The solution to this problem could be __init and __fini
>> > sections that will be invoked when the prog is loaded
>> > and when the last refcnt drops.
>> > It's a complementary feature though.
>> > The prog_refcnt++ from bpf_timer_start combined with a prog
>> > explicitly doing bpf_timer_cancel from __fini
>> > would be the most flexible combination.
>> > This way the prog can choose to be a daemon or it can choose
>> > to cancel its timers and do data flushing when the last prog
>> > reference drops.
>> > The prog refcnt would be split (similar to uref). The __fini callback
>> > will be invoked when refcnt reaches zero, but all increments
>> > done by bpf_timer_start will be counted separately.
>> > The user space wouldn't need to do the prog_run command.
>> > It would detach the prog and close(prog_fd).
>> > That will trigger __fini callback that will cancel the timers
>> > and the prog will be fully unloaded.
>> > That would make bpf progs resemble kernel modules even more.
>> 
>> I like the idea of a "destructor" that will trigger on refcnt drop to
>> zero. And I do think a "bpf daemon" is potentially a useful, if novel,
>> concept.
>
> I think so too. Long ago folks requested periodic bpf progs to
> do sampling in tracing. All these years attaching bpf prog
> to a perf_event was a workaround for such feature request.
> perf_event bpf prog can be pinned in perf_event array,
> so "bpf daemon" kinda exist today. Just more convoluted.

Right, agreed - triggering periodic sampling directly from BPF does seem
like the right direction.

>> The __fini thing kinda supposes a well-behaved program, though, right?
>> I.e., it would be fairly trivial to write a program that spins forever
>> by repeatedly scheduling the timer with a very short interval (whether
>> by malice or bugginess).
>
> It's already possible without bpf_timer.

Hmm, fair point.

>> So do we need a 'bpfkill' type utility to nuke
>> buggy programs, or how would resource constraints be enforced?
>
> That is possible without 'bpfkill'.
> bpftool can delete map element that contains bpf_timer and
> that will cancel it. I'll add tests to make sure it's the case.

Ah, right, of course! Thanks, LGTM then :)

-Toke


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-03  1:53     ` Alexei Starovoitov
@ 2021-06-03 17:10       ` Andrii Nakryiko
  2021-06-04  1:12         ` Alexei Starovoitov
  0 siblings, 1 reply; 20+ messages in thread
From: Andrii Nakryiko @ 2021-06-03 17:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Kernel Team

On Wed, Jun 2, 2021 at 6:53 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 02, 2021 at 03:08:08PM -0700, Andrii Nakryiko wrote:
> > On Wed, May 26, 2021 at 9:03 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Introduce 'struct bpf_timer { __u64 :64; };' that can be embedded
> > > in hash/array/lru maps as regular field and helpers to operate on it:
> > > long bpf_timer_init(struct bpf_timer *timer, void *callback, int flags);
> > > long bpf_timer_start(struct bpf_timer *timer, u64 nsecs);
> > > long bpf_timer_cancel(struct bpf_timer *timer);
> > >
> > > Here is how BPF program might look like:
> > > struct map_elem {
> > >     int counter;
> > >     struct bpf_timer timer;
> > > };
> > >
> > > struct {
> > >     __uint(type, BPF_MAP_TYPE_HASH);
> > >     __uint(max_entries, 1000);
> > >     __type(key, int);
> > >     __type(value, struct map_elem);
> > > } hmap SEC(".maps");
> > >
> > > struct bpf_timer global_timer;
> >
> > Using bpf_timer as a global variable has at least two problems. We
> > discussed one offline but I realized another one reading the code in
> > this patch:
> >   1. this memory can and is memory-mapped as read-write, so user-space
> > can just write over this (intentionally or accidentally), so it's
> > quite unsafe
>
> yep.
>
> >   2. with current restriction of having offset 0 for struct bpf_timer,
> > you have to use global variable for it, because clang will reorder
> > static variables after global variables.
>
> that is addressed in 2nd patch.
>
> > > + * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
> > > + *     Description
> > > + *             Initialize the timer to call given static function.
> > > + *     Return
> > > + *             zero
> >
> > -EBUSY is probably the most important to mention here, but generally
> > the way it's described right now it seems like it can't fail, which is
> > not true. Similar for bpf_timer_start() and bpf_timer_cancel().
>
> good point.
>
> > > + *
> > > + * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
> > > + *     Description
> > > + *             Start the timer and set its expiration N nanoseconds from
> > > + *             the current time.
> >
> > The case of nsecs == 0 is a bit special and interesting, it's useful
> > to explain what will happen in that case. I'm actually curious as
> > well, in the code you say "call ASAP", but does it mean after the BPF
> > program exits? Or can it start immediately on another CPU? Or will it
> > interrupt the currently running BPF program to run the callback
> > (unlikely, but that might be someone's expectation).
>
> nsecs == 0 is not special. nsecs is a relative expiry time.
> 1 nanosecond is not much different from zero :)
> Most likely this timer will be the first one to run once it's added
> to per-cpu rb-tree.
>
> I think going too much into implementation details in the helper
> description is unnecessary.
> " Start the timer and set its expiration N nanoseconds from
>   the current time. "
> is probably about right amount of details.
> I can add that the time clock is monotonic
> and callback is called in softirq.

I think mentioning whether it's going to be run on the same CPU or
another CPU is the most important part. I'm honestly still not sure
which one is the case, because I don't completely understand all the
implications of what "called in softirq" implies.

>
> > > +static enum hrtimer_restart timer_cb(struct hrtimer *timer)
> >
> > nit: can you please call it bpf_timer_cb, so it might be possible to
> > trace it a bit easier due to bpf_ prefix?
>
> sure.
>
> > > +{
> > > +       struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
> > > +       unsigned long flags;
> > > +       int ret;
> > > +
> > > +       /* timer_cb() runs in hrtimer_run_softirq and doesn't migrate.
> > > +        * Remember the timer this callback is servicing to prevent
> > > +        * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
> > > +        */
> > > +       this_cpu_write(hrtimer_running, t);
> > > +       ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)t->map,
> > > +                                           (u64)(long)t->key,
> > > +                                           (u64)(long)t->value, 0, 0);
> > > +       WARN_ON(ret != 0); /* todo: define 0 vs 1 or disallow 1 in the verifier */
> >
> > if we define 0 vs 1, what would their meaning be?
>
> How would you define it?
> The hrtimer has support for HRTIMER_NORESTART vs HRTIMER_RESTART.
> This feature of hrtimer is already exposed into user space via timerfd,
> so no concerns there.
> But to use HRTIMER_RESTART in a meaningful way the bpf prog
> would need to be able to call to hrtimer_forward_now() to set
> the new expiry.
> That function has an interesting caution comment in hrtimer.h:
>  * Can be safely called from the callback function of @timer. If
>  * called from other contexts @timer must neither be enqueued nor
>  * running the callback and the caller needs to take care of
>  * serialization.
> I'm not sure how to teach the verifier to enforce that.. yet...
>
> As an alternative we can interpret bpf timer callback return value
> inside bpf_timer_cb() kernel function as:
> 0 - return HRTIMER_NORESTART
> N - hrtimer_forward_now(,N); return HRTIMER_RESTART.
>
> but the same can be achieved by calling bpf_timer_start()
> from the bpf prog. The speed of re-arming is roughly the same
> in both cases.
> Doing the same functionality two different ways doesn't seem
> necessary.
>
> I couldn't come up with other ways to use the return value
> and currently thinking to allow 0 only for now.
> Other ideas?

Yeah, I was thinking to enforce 0 always, unless we have a strong case
for this HRTIMER_RESTART. But if you can achieve the same with
bpf_timer_start() from the callback, I don't think that's necessary.
So I'd force 0.

>
> > > +       spin_lock_irqsave(&t->lock, flags);
> > > +       if (!hrtimer_is_queued(timer))
> > > +               bpf_prog_put(t->prog);
> > > +       spin_unlock_irqrestore(&t->lock, flags);
> > > +       this_cpu_write(hrtimer_running, NULL);
> >
> > Don't know if it's a problem. Above you say that timer_cb doesn't
> > migrate, but can it be preempted?
>
> My understanding that no, it cannot be. Even in RT.
>
> > If yes, and timer_cb is called in
> > the meantime for another timer, setting hrtimer_running to NULL will
> > clobber the previous value, right? So no nesting is possible. Is this
> > a problem?
>
> Based on my current understanding of hrtimer implemention we're safe here.
>
> > Also is there a chance for timer callback to be a sleepable BPF (sub-)program?
>
> Eventually yes. The whole bpf program can be sleepable, but the
> timer callback cannot be. So the verifier would need to
> treat different functions of the bpf prog as sleepable and non-sleepable.
> Easy enough to implement. Eventually.

Ok, so non-sleepable callback is hrtimer's implementation restriction
due to softirq, right? Too bad, of course, I can imagine sleepable
callbacks being useful, but it's not a deal breaker.

>
> > What if we add a field to struct bpf_hrtimer that will be inc/dec to
> > show whether it's active or not? That should bypass per-CPU
> > assumptions, but I haven't thought through races, worst case we might
> > need to take t->lock.
>
> You mean to get rid of per-cpu hrtimer_running and replace
> with bool flag inside bpf_hrtimer. Like is_callback_fn_running ?
> That won't work as-is, since bpf_timer_cancel will be called
> sooner or later when another cpu is inside the callback.
> Something like:
> bpf_timer_cb()
> {
>   bpf_hrtimer->running_on_cpu = smp_processor_id();
>   BPF_CAST_CALL(t->callback_fn)
>   bpf_hrtimer->running_on_cpu = -1;
> }
>
> bpf_timer_cancel()
> {
>   if (bpf_hrtimer->running_on_cpu == smp_processor_id())
>     return -EBUSY;
> }
>
> will work, but it's potentially wasting more memory
> if there are millions of timers
> than single per-cpu hrtimer_running and seems less clean.

I disagree about less clean, but given there is no way we have
sleepable timer callbacks it doesn't matter. Per-cpu will work.

>
> > > +BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)
> >
> > Not entirely sure, but it feels like adding flags would be good here as well?
>
> Same here. Not entirely sure, but I think it's good without it.
> Here is why:
> hrtimer_start() has 3rd argument which allows some flexibility
> to change the mode of the timer after hrtimer_init.
> But I think it's too flexible part of hrtimer api.
> Like io_uring is exposing hrtimers, but remembers the mode specified
> during hrtimer_init phase and uses it during hrtimer_start.
> I think eventually the bpf_hrtimer_init() might support
> not only clock_monotonic and relative expiry, but other
> features of hrtimer as well, but I think it's better to
> follow what io_uring did and remember the mode during bpf_hrtimer_init()
> and use the same mode in bpf_hrtimer_start().
> Because of that there is nothing extra to pass into hrtimer_start()
> and hence no use for 3rd argument in bpf_timer_start.
>
> The flags argument in bpf_timer_init() will eventually
> be able to specify monotonic vs boottime and
> relative vs absolute expiry.

Yeah, I was thinking about relative vs absolute expiry case, but we
never know what kind of extensibility we'll need, so there might be
something that we don't see as a possibility yet. Seems simple enough
to reserve flags argument here (we usually do not regret adding flags
argument for extensibility), but I'm fine either way.

>
> > > +BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> > > +{
> > > +       struct bpf_hrtimer *t;
> > > +       unsigned long flags;
> > > +
> > > +       t = READ_ONCE(timer->timer);
> > > +       if (!t)
> > > +               return -EINVAL;
> > > +       if (this_cpu_read(hrtimer_running) == t)
> > > +               /* If bpf callback_fn is trying to bpf_timer_cancel()
> > > +                * its own timer the hrtimer_cancel() will deadlock
> > > +                * since it waits for callback_fn to finish
> > > +                */
> > > +               return -EBUSY;
> > > +       spin_lock_irqsave(&t->lock, flags);
> > > +       /* Cancel the timer and wait for associated callback to finish
> > > +        * if it was running.
> > > +        */
> > > +       if (hrtimer_cancel(&t->timer) == 1)
> > > +               /* If the timer was active then drop the prog refcnt,
> > > +                * since callback will not be invoked.
> > > +                */
> >
> > So the fact whether the timer was cancelled or it's active/already
> > fired seems useful to know in BPF program (sometimes). I can't recall
> > an exact example, but in the past dealing with some timers (in
> > user-space, but the point stands) I remember it was important to know
> > this, so maybe we can communicate that as 0 or 1 returned from
> > bpf_timer_cancel?
>
> Good idea!
>
> > > +void bpf_timer_cancel_and_free(void *val)
> > > +{
> > > +       struct bpf_timer_kern *timer = val;
> > > +       struct bpf_hrtimer *t;
> > > +
> > > +       t = READ_ONCE(timer->timer);
> > > +       if (!t)
> > > +               return;
> > > +       /* Cancel the timer and wait for callback to complete
> > > +        * if it was running
> > > +        */
> > > +       if (hrtimer_cancel(&t->timer) == 1)
> > > +               bpf_prog_put(t->prog);
> > > +       kfree(t);
> > > +       WRITE_ONCE(timer->timer, NULL);
> >
> > this seems to race with bpf_timer_start, no? Doing WRITE_ONCE and then
> > kfree() timer would be a bit safer (we won't have dangling pointer at
> > any point in time), but I think that still is racy, because
> > bpf_start_timer can read timer->timer before WRITE_ONCE(NULL) here,
> > then we kfree(t), and then bpf_timer_start() proceeds to take t->lock
> > which might explode or might do whatever.
>
> This race is not possible with bpf_timer inside array
> and inside non-prealloc htab.
> The bpf_timer_cancel_and_free() is called when element being
> deleted in htab or the whole array/htab is destroyed.
> When element is deleted the bpf prog cannot look it up.
> Hence it cannot reach bpf_timer pointer and call bpf_timer_start() on it.
> In case of preallocated htab there is race.
> The bpf prog can do a lookup then delete an element
> while still using earlier value pointer. Since all elements
> are preallocated the elem could be reused for another value at that time.
> I need to think more about ways to address it.

Yep, this is the case I was suspicious about. We can have multiple BPF
programs (or same program on multiple CPUs) working with the same
value pointer while some other CPU is attempting to delete it.

> Currently thinking to do cmpxchg in bpf_timer_start() and
> bpf_timer_cancel*() similar to bpf_timer_init() to address it.
> Kinda sucks to use atomic ops to address a race by deliberately
> malicious prog. bpf prog writers cannot just stumble on such race.

Why deliberately malicious? Just sufficiently sloppy or even just a
clever concurrent BPF program. I suspect BPF map iterators can further
make it more probable, no?

So your idea is to cmpxchg() to NULL while bpf_timer_start() or
bpf_timer_cancel() works with the timer? Wouldn't that cause
bpf_timer_init() believe that that timer is not yet initialized and
not return -EBUSY. Granted that's a corner-case race, but still.

What if the spinlock was moved out of struct bpf_hrtimer into struct
bpf_timer instead? Then all that cmpxchg and READ_ONCE/WRITE_ONCE
could go away and be more "obviously correct" and race-free? We'd just
need to make sure that the size of that spinlock struct doesn't change
depending on kernel config (so no diagnostics should be embedded).

>
> > A small nit, you added bpf_timer_cancel_and_free() prototype in the
> > next patch, but it should probably be done in this patch to avoid
> > unnecessary compiler warnings during bisecting.
>
> There is no warning in default build.
> It's just an unused (yet) global function. That's why I've sent
> the patch this way, but since you asked I've tried few other ways
> and found that "make W=1" indeed warns.
> I'll move the proto to shut it up.

yeah, I didn't see it, just remember from experience running into this
warning sometimes

>
> > > +               if (insn->imm == BPF_FUNC_timer_init) {
> > > +
> >
> > nit: why empty line here?
>
> no clue. will fix.
>
> Thank you for the review!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-03 10:59           ` Toke Høiland-Jørgensen
@ 2021-06-03 17:21             ` Andrii Nakryiko
  0 siblings, 0 replies; 20+ messages in thread
From: Andrii Nakryiko @ 2021-06-03 17:21 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, David S. Miller, Daniel Borkmann,
	Andrii Nakryiko, Networking, bpf, Kernel Team

On Thu, Jun 3, 2021 at 3:59 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Thu, Jun 03, 2021 at 12:21:05AM +0200, Toke Høiland-Jørgensen wrote:
> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >>
> >> > On Thu, May 27, 2021 at 06:57:17PM +0200, Toke Høiland-Jørgensen wrote:
> >> >> >     if (val) {
> >> >> >         bpf_timer_init(&val->timer, timer_cb2, 0);
> >> >> >         bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 msec */);
> >> >>
> >> >> nit: there are 1M nanoseconds in a millisecond :)
> >> >
> >> > oops :)
> >> >
> >> >> >     }
> >> >> > }
> >> >> >
> >> >> > This patch adds helper implementations that rely on hrtimers
> >> >> > to call bpf functions as timers expire.
> >> >> > The following patch adds necessary safety checks.
> >> >> >
> >> >> > Only programs with CAP_BPF are allowed to use bpf_timer.
> >> >> >
> >> >> > The amount of timers used by the program is constrained by
> >> >> > the memcg recorded at map creation time.
> >> >> >
> >> >> > The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> >> >> > supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> >> >> > program to make sure that program doesn't get freed while timer is armed.
> >> >> >
> >> >> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >> >>
> >> >> Overall this LGTM, and I believe it will be usable for my intended use
> >> >> case. One question:
> >> >>
> >> >> With this, it will basically be possible to create a BPF daemon, won't
> >> >> it? I.e., if a program includes a timer and the callback keeps re-arming
> >> >> itself this will continue indefinitely even if userspace closes all refs
> >> >> to maps and programs? Not saying this is a problem, just wanted to check
> >> >> my understanding (i.e., that there's not some hidden requirement on
> >> >> userspace keeping a ref open that I'm missing)...
> >> >
> >> > That is correct.
> >> > Another option would be to auto-cancel the timer when the last reference
> >> > to the prog drops. That may feel safer, since forever
> >> > running bpf daemon is a certainly new concept.
> >> > The main benefits of doing prog_refcnt++ from bpf_timer_start are ease
> >> > of use and no surprises.
> >> > Disappearing timer callback when prog unloads is outside of bpf prog control.
> >> > For example the tracing bpf prog might collect some data and periodically
> >> > flush it to user space. The prog would arm the timer and when callback
> >> > is invoked it would send the data via ring buffer and start another
> >> > data collection cycle.
> >> > When the user space part of the service exits it doesn't have
> >> > an ability to enforce the flush of the last chunk of data.
> >> > It could do prog_run cmd that will call the timer callback,
> >> > but it's racy.
> >> > The solution to this problem could be __init and __fini
> >> > sections that will be invoked when the prog is loaded
> >> > and when the last refcnt drops.
> >> > It's a complementary feature though.
> >> > The prog_refcnt++ from bpf_timer_start combined with a prog
> >> > explicitly doing bpf_timer_cancel from __fini
> >> > would be the most flexible combination.
> >> > This way the prog can choose to be a daemon or it can choose
> >> > to cancel its timers and do data flushing when the last prog
> >> > reference drops.
> >> > The prog refcnt would be split (similar to uref). The __fini callback
> >> > will be invoked when refcnt reaches zero, but all increments
> >> > done by bpf_timer_start will be counted separately.
> >> > The user space wouldn't need to do the prog_run command.
> >> > It would detach the prog and close(prog_fd).
> >> > That will trigger __fini callback that will cancel the timers
> >> > and the prog will be fully unloaded.
> >> > That would make bpf progs resemble kernel modules even more.
> >>
> >> I like the idea of a "destructor" that will trigger on refcnt drop to
> >> zero. And I do think a "bpf daemon" is potentially a useful, if novel,
> >> concept.
> >
> > I think so too. Long ago folks requested periodic bpf progs to
> > do sampling in tracing. All these years attaching bpf prog
> > to a perf_event was a workaround for such feature request.
> > perf_event bpf prog can be pinned in perf_event array,
> > so "bpf daemon" kinda exist today. Just more convoluted.
>
> Right, agreed - triggering periodic sampling directly from BPF does seem
> like the right direction.

This is one of the cases where knowing (or being able to specify) the
CPU to run on is very important, btw.

Which made me think about scheduling timeout on a specific CPU from
another CPU. Is it possible with hrtimer? If yes, should it be done
through bpf_timer_init() or bpf_timer_start()?

>
> >> The __fini thing kinda supposes a well-behaved program, though, right?
> >> I.e., it would be fairly trivial to write a program that spins forever
> >> by repeatedly scheduling the timer with a very short interval (whether
> >> by malice or bugginess).
> >
> > It's already possible without bpf_timer.
>
> Hmm, fair point.
>
> >> So do we need a 'bpfkill' type utility to nuke
> >> buggy programs, or how would resource constraints be enforced?
> >
> > That is possible without 'bpfkill'.
> > bpftool can delete map element that contains bpf_timer and
> > that will cancel it. I'll add tests to make sure it's the case.
>
> Ah, right, of course! Thanks, LGTM then :)
>
> -Toke
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: Add verifier checks for bpf_timer.
  2021-06-03  2:04     ` Alexei Starovoitov
@ 2021-06-03 17:35       ` Andrii Nakryiko
  0 siblings, 0 replies; 20+ messages in thread
From: Andrii Nakryiko @ 2021-06-03 17:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Kernel Team

On Wed, Jun 2, 2021 at 7:04 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 02, 2021 at 03:34:29PM -0700, Andrii Nakryiko wrote:
> >
> > >  /* copy everything but bpf_spin_lock */
> > >  static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
> > >  {
> > > +       u32 off = 0, size = 0;
> > > +
> > >         if (unlikely(map_value_has_spin_lock(map))) {
> > > -               u32 off = map->spin_lock_off;
> > > +               off = map->spin_lock_off;
> > > +               size = sizeof(struct bpf_spin_lock);
> > > +       } else if (unlikely(map_value_has_timer(map))) {
> > > +               off = map->timer_off;
> > > +               size = sizeof(struct bpf_timer);
> > > +       }
> >
> > so the need to handle 0, 1, or 2 gaps seems to be the only reason to
> > disallow both bpf_spinlock and bpf_timer in one map element, right?
>
> exactly.
>
> > Isn't it worth addressing it from the very beginning to lift the
> > artificial restriction? E.g., for speed, you'd do:
> >
> > if (likely(neither spinlock nor timer)) {
> >  /* fastest pass */
> > } else if (only one of spinlock or timer) {
> >   /* do what you do here */
> > } else {
> >   int off1, off2, sz1, sz2;
> >
> >   if (spinlock_off < timer_off) {
> >     off1 = spinlock_off;
> >     sz1 = spinlock_sz;
> >     off2 = timer_off;
> >     sz2 = timer_sz;
> >   } else {
> >     ... you get the idea
>
> Not really :)

hm, really? I meant that else will be:

     off1 = timer_off;
     sz1 = timer_sz;
     off2 = spinlock_off;
     sz2 = spinlock_sz;

Just making sure that off1 < off2 always and sz1 and sz2 are matching

> Are you suggesting to support one bpf_spin_lock and one
> bpf_timer inside single map element, but not two spin_locks
> and/or not two bpf_timers?

Yes, exactly. I see bpf_spinlock and bpf_timer as two independent
orthogonal features and I don't understand why we restrict using just
one of them in a given map element. I think those 20 lines of code
that decouples them and removes artificial restriction that users need
to remember (or discover with surprise) is totally worth it.

> Two me it's either one or support any.

I think it's fine to start with supporting one. But one of each. They
are independent of each other.

> Anything in-between doesn't seem worth extra code.

Up to you, but I disagree, obviously. It's possible to work-around
that limitation with extra maps/complexity, so if I ever need to both
lock an element and schedule the timer with it, it's not going to stop
me. :)

>
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index f386f85aee5c..0a828dc4968e 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -3241,6 +3241,15 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> > >                         return -EACCES;
> > >                 }
> > >         }
> > > +       if (map_value_has_timer(map)) {
> > > +               u32 t = map->timer_off;
> > > +
> > > +               if (reg->smin_value + off < t + sizeof(struct bpf_timer) &&
> >
> > <= ? Otherwise we allow accessing the first byte, unless I'm mistaken.
>
> I don't think so. See the comment above in if (map_value_has_spin_lock(map))
> I didn't copy-paste it, because it's the same logic.

Oh, I didn't realize that this is the interval intersection check I
suggested a long time ago :) yeah, that still looks correct

>
> > > -       if (val) {
> > > -               /* todo: relax this requirement */
> > > -               verbose(env, "bpf_timer field can only be first in the map value element\n");
> >
> > ok, this was confusing, but now I see why you did that...
>
> I'll clarify the comment to say that the next patch fixes it.

ok, thanks

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-03 17:10       ` Andrii Nakryiko
@ 2021-06-04  1:12         ` Alexei Starovoitov
  2021-06-04  4:17           ` Andrii Nakryiko
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2021-06-04  1:12 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Kernel Team

On Thu, Jun 03, 2021 at 10:10:38AM -0700, Andrii Nakryiko wrote:
> >
> > I think going too much into implementation details in the helper
> > description is unnecessary.
> > " Start the timer and set its expiration N nanoseconds from
> >   the current time. "
> > is probably about right amount of details.
> > I can add that the time clock is monotonic
> > and callback is called in softirq.
> 
> I think mentioning whether it's going to be run on the same CPU or
> another CPU is the most important part. I'm honestly still not sure
> which one is the case, because I don't completely understand all the
> implications of what "called in softirq" implies.

"called in softirq" means that timer callback will be executing
in softirq context on some cpu. That's all.
The proposed API doesn't provide a way to call a timer on a given cpu
or to pin it to a cpu.
There are few places in the kernel that use ABS_PINNED and REL_PINNED
variants of hrtimer.
One such example is napi timer.
The initial cpu is picked during hrtimer_init and it's always
current cpu. napi is bound to a cpu. So when it calls hrtimer_init(.., _PINNED);
it wants the callback to stay on the same cpu.
The hrtimer doc says that _PINNED is ignored during hrtimer_init :)
It is ignored, kinda, since initial target cpu is picked as current.
Then during hrtimer_start the actual cpu will be selected.
If it's called as hrtimer_start(,_PINNED); then the cpu will stay
as it was during hrtimer_init.
If hrtimer_start() is called without _PINNED the hrtimer algorithm can 
migrate the timer to a more appropriate cpu depending on idle and no_hz. 
See get_nohz_timer_target.
In case of napi it's necessary to stay on the same cpu,
so it's using _PINNED in hrtimer_init and in hrtimer_start.
TCP is using pinned timers for compressed acks and pacing.
I'm guessing it's done to improve performance. I suspect TCP doesn't
need the timers pinned.
All other _PINNED cases of hrtimer are similar to napi.

In bpf world we don't have a way to deterministically
execute on a given cpu and the hrtimer infra won't give us such
possibility.

We can potentailly hack it on top of it. Like
bpf_timer_init(..., int cpu, ...)
{
  smp_call_function_single(cpu, rest_of_bpf_timer_init)
}

rest_of_bpf_timer_init()
{
  hrtimer_init
}

But there are lots of things to consider with such api.
It feels like two building blocks collapsed into one already.
If we do anything like this we should expose smp_call_function_single()
directly as bpf helper instead of side effect of bpf_timer.

> > > Also is there a chance for timer callback to be a sleepable BPF (sub-)program?
> >
> > Eventually yes. The whole bpf program can be sleepable, but the
> > timer callback cannot be. So the verifier would need to
> > treat different functions of the bpf prog as sleepable and non-sleepable.
> > Easy enough to implement. Eventually.
> 
> Ok, so non-sleepable callback is hrtimer's implementation restriction
> due to softirq, right? Too bad, of course, I can imagine sleepable
> callbacks being useful, but it's not a deal breaker.

Sleeping in softirq is no-go. The timer callback will be always
non-sleepable. If it would need to do extra work that requires sleeping
we'd need to create bpf kthreads similar to io_uring worker threads.
bpf program would have to ask for such things explicitly.

> > The flags argument in bpf_timer_init() will eventually
> > be able to specify monotonic vs boottime and
> > relative vs absolute expiry.
> 
> Yeah, I was thinking about relative vs absolute expiry case, but we
> never know what kind of extensibility we'll need, so there might be
> something that we don't see as a possibility yet. Seems simple enough
> to reserve flags argument here (we usually do not regret adding flags
> argument for extensibility), but I'm fine either way.

We cannot predict the future of bpf_timer, but the way it's going
so far there is a certainty that bpf_timer_start will be limited by
what hrtimer_start can do.
If we ever decide to use jiffy based timer as well they most likely
going to be represented as new set of helpers.
May be both hidden in the same 'struct bpf_timer',
but helper names would have to be different not to confuse users.

> > Currently thinking to do cmpxchg in bpf_timer_start() and
> > bpf_timer_cancel*() similar to bpf_timer_init() to address it.
> > Kinda sucks to use atomic ops to address a race by deliberately
> > malicious prog. bpf prog writers cannot just stumble on such race.
> 
> Why deliberately malicious? Just sufficiently sloppy or even just a
> clever concurrent BPF program. 

because hrtimers internals don't have protection for concurrent access.
It's assumed by kernel apis that the same hrtimer is not concurrently
accessed on multiple cpus.
Like doing hrtimer_init in parallel will certainly crash the kernel.
That's why bpf_timer_init() has extra cmpxchg safety builtin.
bpf_timer_start and bpf_timer_cancel don't have extra safety,
because the first thing hrtimer_start and hrtimer_cancel do
is they grab the lock, so everything will be safe even in bpf
programs try to access the same timer in parallel.

The malicicous part comes when bpf prog does bpf_timer_start
on the pointer from a deleted map element. It's clearly a bug.
Concurrent bpf_timer_start and bpf_timer_cancel is ok to do
and it's safe.
The malicious situation is concurrent lookup+bpf_timer_start
with parallel delete of that element.
It's wrong thing to do with map element regardless of timers.

> So your idea is to cmpxchg() to NULL while bpf_timer_start() or
> bpf_timer_cancel() works with the timer? Wouldn't that cause
> bpf_timer_init() believe that that timer is not yet initialized and
> not return -EBUSY. Granted that's a corner-case race, but still.

Not following.
bpf prog should do bpf_timer_init only once.
bpf_timer_init after bpf_timer_cancel is a wrong usage.
hrtimer api doesn't have any protection for such use.
while bpf_timer_init returns EBUSY.
2nd bpf_timer_init is just a misuse of bpf_timer api.

> What if the spinlock was moved out of struct bpf_hrtimer into struct
> bpf_timer instead? Then all that cmpxchg and READ_ONCE/WRITE_ONCE
> could go away and be more "obviously correct" and race-free? We'd just
> need to make sure that the size of that spinlock struct doesn't change
> depending on kernel config (so no diagnostics should be embedded).

We already tackled that concern with bpf_spin_lock.
So moving the lock into 'struct bpf_timer' is easy and it's a great idea.
I suspect it should simplify the code. Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-04  1:12         ` Alexei Starovoitov
@ 2021-06-04  4:17           ` Andrii Nakryiko
  2021-06-04 18:16             ` Alexei Starovoitov
  0 siblings, 1 reply; 20+ messages in thread
From: Andrii Nakryiko @ 2021-06-04  4:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Kernel Team

On Thu, Jun 3, 2021 at 6:12 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jun 03, 2021 at 10:10:38AM -0700, Andrii Nakryiko wrote:
> > >
> > > I think going too much into implementation details in the helper
> > > description is unnecessary.
> > > " Start the timer and set its expiration N nanoseconds from
> > >   the current time. "
> > > is probably about right amount of details.
> > > I can add that the time clock is monotonic
> > > and callback is called in softirq.
> >
> > I think mentioning whether it's going to be run on the same CPU or
> > another CPU is the most important part. I'm honestly still not sure
> > which one is the case, because I don't completely understand all the
> > implications of what "called in softirq" implies.
>
> "called in softirq" means that timer callback will be executing
> in softirq context on some cpu. That's all.
> The proposed API doesn't provide a way to call a timer on a given cpu
> or to pin it to a cpu.
> There are few places in the kernel that use ABS_PINNED and REL_PINNED
> variants of hrtimer.
> One such example is napi timer.
> The initial cpu is picked during hrtimer_init and it's always
> current cpu. napi is bound to a cpu. So when it calls hrtimer_init(.., _PINNED);
> it wants the callback to stay on the same cpu.
> The hrtimer doc says that _PINNED is ignored during hrtimer_init :)
> It is ignored, kinda, since initial target cpu is picked as current.
> Then during hrtimer_start the actual cpu will be selected.
> If it's called as hrtimer_start(,_PINNED); then the cpu will stay
> as it was during hrtimer_init.
> If hrtimer_start() is called without _PINNED the hrtimer algorithm can
> migrate the timer to a more appropriate cpu depending on idle and no_hz.
> See get_nohz_timer_target.
> In case of napi it's necessary to stay on the same cpu,
> so it's using _PINNED in hrtimer_init and in hrtimer_start.
> TCP is using pinned timers for compressed acks and pacing.
> I'm guessing it's done to improve performance. I suspect TCP doesn't
> need the timers pinned.
> All other _PINNED cases of hrtimer are similar to napi.
>
> In bpf world we don't have a way to deterministically
> execute on a given cpu and the hrtimer infra won't give us such
> possibility.
>
> We can potentailly hack it on top of it. Like
> bpf_timer_init(..., int cpu, ...)
> {
>   smp_call_function_single(cpu, rest_of_bpf_timer_init)
> }
>
> rest_of_bpf_timer_init()
> {
>   hrtimer_init
> }
>
> But there are lots of things to consider with such api.
> It feels like two building blocks collapsed into one already.
> If we do anything like this we should expose smp_call_function_single()
> directly as bpf helper instead of side effect of bpf_timer.

Yeah, all makes sense. And I agree that an orthogonal API is better.
Never mind then.

>
> > > > Also is there a chance for timer callback to be a sleepable BPF (sub-)program?
> > >
> > > Eventually yes. The whole bpf program can be sleepable, but the
> > > timer callback cannot be. So the verifier would need to
> > > treat different functions of the bpf prog as sleepable and non-sleepable.
> > > Easy enough to implement. Eventually.
> >
> > Ok, so non-sleepable callback is hrtimer's implementation restriction
> > due to softirq, right? Too bad, of course, I can imagine sleepable
> > callbacks being useful, but it's not a deal breaker.
>
> Sleeping in softirq is no-go. The timer callback will be always
> non-sleepable. If it would need to do extra work that requires sleeping
> we'd need to create bpf kthreads similar to io_uring worker threads.
> bpf program would have to ask for such things explicitly.

yep

>
> > > The flags argument in bpf_timer_init() will eventually
> > > be able to specify monotonic vs boottime and
> > > relative vs absolute expiry.
> >
> > Yeah, I was thinking about relative vs absolute expiry case, but we
> > never know what kind of extensibility we'll need, so there might be
> > something that we don't see as a possibility yet. Seems simple enough
> > to reserve flags argument here (we usually do not regret adding flags
> > argument for extensibility), but I'm fine either way.
>
> We cannot predict the future of bpf_timer, but the way it's going
> so far there is a certainty that bpf_timer_start will be limited by
> what hrtimer_start can do.
> If we ever decide to use jiffy based timer as well they most likely
> going to be represented as new set of helpers.
> May be both hidden in the same 'struct bpf_timer',
> but helper names would have to be different not to confuse users.
>

ok

> > > Currently thinking to do cmpxchg in bpf_timer_start() and
> > > bpf_timer_cancel*() similar to bpf_timer_init() to address it.
> > > Kinda sucks to use atomic ops to address a race by deliberately
> > > malicious prog. bpf prog writers cannot just stumble on such race.
> >
> > Why deliberately malicious? Just sufficiently sloppy or even just a
> > clever concurrent BPF program.
>
> because hrtimers internals don't have protection for concurrent access.
> It's assumed by kernel apis that the same hrtimer is not concurrently
> accessed on multiple cpus.
> Like doing hrtimer_init in parallel will certainly crash the kernel.
> That's why bpf_timer_init() has extra cmpxchg safety builtin.
> bpf_timer_start and bpf_timer_cancel don't have extra safety,
> because the first thing hrtimer_start and hrtimer_cancel do
> is they grab the lock, so everything will be safe even in bpf
> programs try to access the same timer in parallel.
>
> The malicicous part comes when bpf prog does bpf_timer_start
> on the pointer from a deleted map element. It's clearly a bug.
> Concurrent bpf_timer_start and bpf_timer_cancel is ok to do
> and it's safe.
> The malicious situation is concurrent lookup+bpf_timer_start
> with parallel delete of that element.
> It's wrong thing to do with map element regardless of timers.

I don't know if I agree with calling a buggy BPF program "malicious",
but I think that won't matter if we embed spinlock as suggested below,
right? Because then we can free the pointer, but spinlock is always
there and is always "usable", so can be taken first thing before doing
anything with that extra pointer.

>
> > So your idea is to cmpxchg() to NULL while bpf_timer_start() or
> > bpf_timer_cancel() works with the timer? Wouldn't that cause
> > bpf_timer_init() believe that that timer is not yet initialized and
> > not return -EBUSY. Granted that's a corner-case race, but still.
>
> Not following.
> bpf prog should do bpf_timer_init only once.
> bpf_timer_init after bpf_timer_cancel is a wrong usage.
> hrtimer api doesn't have any protection for such use.
> while bpf_timer_init returns EBUSY.
> 2nd bpf_timer_init is just a misuse of bpf_timer api.

Yes, clearly a bad use of API, but it's not prevented by verifier. So
maybe I misunderstood what you meant by

> > > Currently thinking to do cmpxchg in bpf_timer_start() and
> > > bpf_timer_cancel*() similar to bpf_timer_init() to address it.

because that seemed like you were going to exchange (temporarily) a
pointer to NULL while doing bpf_timer_start() or bpf_timer_cancel(),
and then setting NULL -> valid ptr back again (this sequence would
open up a window when bpf_timer_init() can be used twice on the same
element). But again, with spinlock embedded doesn't matter anymore.

>
> > What if the spinlock was moved out of struct bpf_hrtimer into struct
> > bpf_timer instead? Then all that cmpxchg and READ_ONCE/WRITE_ONCE
> > could go away and be more "obviously correct" and race-free? We'd just
> > need to make sure that the size of that spinlock struct doesn't change
> > depending on kernel config (so no diagnostics should be embedded).
>
> We already tackled that concern with bpf_spin_lock.
> So moving the lock into 'struct bpf_timer' is easy and it's a great idea.
> I suspect it should simplify the code. Thanks!

sure, glad it helped

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Introduce bpf_timer
  2021-06-04  4:17           ` Andrii Nakryiko
@ 2021-06-04 18:16             ` Alexei Starovoitov
  0 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2021-06-04 18:16 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: David S. Miller, Daniel Borkmann, Andrii Nakryiko, Networking,
	bpf, Kernel Team

On Thu, Jun 3, 2021 at 9:17 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> >
> > > So your idea is to cmpxchg() to NULL while bpf_timer_start() or
> > > bpf_timer_cancel() works with the timer? Wouldn't that cause
> > > bpf_timer_init() believe that that timer is not yet initialized and
> > > not return -EBUSY. Granted that's a corner-case race, but still.
> >
> > Not following.
> > bpf prog should do bpf_timer_init only once.
> > bpf_timer_init after bpf_timer_cancel is a wrong usage.
> > hrtimer api doesn't have any protection for such use.
> > while bpf_timer_init returns EBUSY.
> > 2nd bpf_timer_init is just a misuse of bpf_timer api.
>
> Yes, clearly a bad use of API, but it's not prevented by verifier.

not prevented only because it's hard to do in the verifier.

> > > > Currently thinking to do cmpxchg in bpf_timer_start() and
> > > > bpf_timer_cancel*() similar to bpf_timer_init() to address it.
>
> because that seemed like you were going to exchange (temporarily) a
> pointer to NULL while doing bpf_timer_start() or bpf_timer_cancel(),
> and then setting NULL -> valid ptr back again (this sequence would
> open up a window when bpf_timer_init() can be used twice on the same
> element). But again, with spinlock embedded doesn't matter anymore.

Right, except bpf_timer_start and bpf_timer_cancel would xchg with -1 or similar
and bpf_timer_init won't get confused.
If two bpf_timer_start()s race on the same timer one would receive
-EMISUSEOFAPI right away.
Whereas with spin_lock inside bpf_timer both will be serialized and
both will succeed.
One can argue that bpf_timer_start and bpf_timer_cancel on different cpus
is a realistic scenario. So xchg approach would need two special
pointers -1 and -2
to distinguish start/start bad race vs start/cancel good race.
And everything gets too clever. spin_lock is "obviously correct",
though it doesn't have an advantage of informing users of api misuse.
I coded it up and it's surviving the tests so far :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-06-04 18:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-27  4:02 [PATCH bpf-next 0/3] bpf: Introduce BPF timers Alexei Starovoitov
2021-05-27  4:02 ` [PATCH bpf-next 1/3] bpf: Introduce bpf_timer Alexei Starovoitov
2021-05-27 16:57   ` Toke Høiland-Jørgensen
2021-06-02  1:46     ` Alexei Starovoitov
2021-06-02 22:21       ` Toke Høiland-Jørgensen
2021-06-03  1:58         ` Alexei Starovoitov
2021-06-03 10:59           ` Toke Høiland-Jørgensen
2021-06-03 17:21             ` Andrii Nakryiko
2021-06-02 22:08   ` Andrii Nakryiko
2021-06-03  1:53     ` Alexei Starovoitov
2021-06-03 17:10       ` Andrii Nakryiko
2021-06-04  1:12         ` Alexei Starovoitov
2021-06-04  4:17           ` Andrii Nakryiko
2021-06-04 18:16             ` Alexei Starovoitov
2021-05-27  4:02 ` [PATCH bpf-next 2/3] bpf: Add verifier checks for bpf_timer Alexei Starovoitov
2021-06-02 22:34   ` Andrii Nakryiko
2021-06-02 22:38     ` Andrii Nakryiko
2021-06-03  2:04     ` Alexei Starovoitov
2021-06-03 17:35       ` Andrii Nakryiko
2021-05-27  4:02 ` [PATCH bpf-next 3/3] selftests/bpf: Add bpf_timer test Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.