On Tue, Jun 15, 2021 at 10:54:40PM -0700, Andrii Nakryiko wrote: > > It could be the case, of course. But let's try to think this through > to the end before giving up. I think it's mostly because we are trying > to be too clever with lockless synchronization. imo your proposed code fits "too clever" too ;) Just a reminder that few emails ago you've argued about "obviously correct" approach, but now... > I had a feeling that bpf_timer_cb needs to take lock as well. But once > we add that, refcounting becomes simpler and more deterministic, IMO. > Here's what I have in mind. I keep only important parts of the code, > so it's not a complete implementation. Please take a look below, I > left a few comments here and there. > > > struct bpf_hrtimer { > struct hrtimer timer; > struct bpf_map *map; > void *value; > > struct bpf_prog *prog; > void *callback_fn; > > /* pointer to that lock in struct bpf_timer_kern > * so that we can access it from bpf_timer_cb() > */ > struct bpf_spin_lock *lock; > }; > > BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, int, flags, > struct bpf_map *, map) > { > struct bpf_hrtimer *t; > int ret = 0; > > ____bpf_spin_lock(&timer->lock); > t = timer->timer; > if (t) { > ret = -EBUSY; > goto out; > } > /* allocate hrtimer via map_kmalloc to use memcg accounting */ > t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE); > if (!t) { > ret = -ENOMEM; > goto out; > } > t->value = (void *)timer /* - offset of bpf_timer inside elem */; > t->map = map; > t->timer.function = bpf_timer_cb; > > /* we'll init them in bpf_timer_start */ > t->prog = NULL; > t->callback_fn = NULL; > > hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT); > timer->timer = t; > out: > ____bpf_spin_unlock(&timer->lock); > return ret; > } > > > BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, > void *, cb, struct bpf_prog *, prog) > { > struct bpf_hrtimer *t; > int ret = 0; > > ____bpf_spin_lock(&timer->lock); > t = timer->timer; > if (!t) { > ret = -EINVAL; > goto out; > } > > /* doesn't matter what it returns, we just request cancellation */ > hrtimer_try_to_cancel(&t->timer); > > /* t->prog might not be the same as prog (!) */ > if (prog != t->prog) { > /* callback hasn't yet dropped refcnt */ > if (t->prog) /* if it's null bpf_timer_cb() is running and > will put it later */ > bpf_prog_put(t->prog); > > if (IS_ERR(bpf_prog_inc_not_zero(prog))) { > /* this will only happen if prog is still running (and > it's actually us), > * but it was already put to zero, e.g., by closing last FD, > * so there is no point in scheduling a new run > */ I have a bit of mind explosion here... everything will be alright. > t->prog = NULL; > t->callback_fn = NULL; > ret = -E_WE_ARE_SHUTTING_DOWN; > goto out; > } > } /* otherwise we keep existing refcnt on t->prog == prog */ > > /* potentially new combination of prog and cb */ > t->prog = prog; > t->callback_fn = cb; > > hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT); > out: > ____bpf_spin_unlock(&timer->lock); > return ret; > } > > BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer) > { > struct bpf_hrtimer *t; > int ret = 0; > > ____bpf_spin_lock(&timer->lock); > t = timer->timer; > if (!t) { > ret = -EINVAL; > goto out; > } > > /* this part I still worry about due to possibility of cpu migration, > * we need to think if we should migrate_disable() in bpf_timer_cb() > * and bpf_timer_* helpers(), but that's a separate topic > */ > if (this_cpu_read(hrtimer_running) == t) { > ret = -EDEADLK; > goto out; > } > > ret = hrtimer_cancel(&t->timer); > > if (t->prog) { > /* bpf_timer_cb hasn't put it yet (and now won't) */ > bpf_prog_put(t->prog); > t->prog = NULL; > t->callback_fn = NULL; > } > out: > ____bpf_spin_unlock(&timer->lock); > return ret; > } > > static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer) > { > struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer); > struct bpf_map *map = t->map; > struct bpf_prog *prog; > void *key, *callback_fn; > u32 idx; > int ret; > > /* this is very IMPORTANT */ > ____bpf_spin_lock(t->lock); > > prog = t->prog; > if (!prog) { > /* we were cancelled, prog is put already, exit early */ > ____bpf_spin_unlock(&timer->lock); > return HRTIMER_NORESTART; > } > callback_fn = t->callback_fn; > > /* make sure bpf_timer_cancel/bpf_timer_start won't > bpf_prog_put our prog */ > t->prog = NULL; > t->callback_fn = NULL; > > ____bpf_spin_unlock(t->lock); > > /* at this point we "own" prog's refcnt decrement */ > > this_cpu_write(hrtimer_running, t); > > ... > > ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map, > (u64)(long)key, > (u64)(long)value, 0, 0); > WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */ > > bpf_prog_put(prog); /* always correct and non-racy */ > > this_cpu_write(hrtimer_running, NULL); > > return HRTIMER_NORESTART; > } > > bpf_timer_cancel_and_free() is mostly the same with t->prog NULL check > as everywhere else I haven't started detailed analysis of above proposal, but looks overly complicated on the first glance. Not saying it's bad or good. Just complexity and races are striking. > > > There is no need to complicate bpf_timer with crazy refcnting schemes. > > The user space can simply pin the program in bpffs. In the future we might > > introduce a self-pinning helper that would pin the program and create a file. > > Sort-of like syscall prog type would pin self. > > That would be explicit and clean api instead of obscure hacks inside bpf_timer*(). > > Do I understand correctly that the alternative that you are proposing > is to keep some linked list of all map_values across all maps in the > system that have initialized bpf_hrtimer with that particular bpf_prog > in them? And when bpf_prog is put to zero you'll go and destruct them > all in a race-free way? > > I have a bit of a hard time imagining how that will be implemented > exactly, so I might be overcomplicating that in my mind. Will be happy > to see the working code. Here is working code... Note how patch 1 is so much simpler without complicated refcnting. And how patch 2 removes for_each_map_element that was necessary earlier. Also note that link list approach is an optimization. Instead of keeping a link list the bpf_free_used_timers() could call a map specific op to iterate all elems and free timers with timer->prog == prog_going_away. That was my initial proposal couple month ago. link_list is purely an optimization instead of for_each_map_elem.