On Tue, Jun 15, 2021 at 10:54:40PM -0700, Andrii Nakryiko wrote:
> 
> It could be the case, of course. But let's try to think this through
> to the end before giving up. I think it's mostly because we are trying
> to be too clever with lockless synchronization.

imo your proposed code fits "too clever" too ;)
Just a reminder that few emails ago you've argued 
about "obviously correct" approach, but now...

> I had a feeling that bpf_timer_cb needs to take lock as well. But once
> we add that, refcounting becomes simpler and more deterministic, IMO.
> Here's what I have in mind. I keep only important parts of the code,
> so it's not a complete implementation. Please take a look below, I
> left a few comments here and there.
> 
> 
> struct bpf_hrtimer {
>        struct hrtimer timer;
>        struct bpf_map *map;
>        void *value;
> 
>        struct bpf_prog *prog;
>        void *callback_fn;
> 
>        /* pointer to that lock in struct bpf_timer_kern
>         * so that we can access it from bpf_timer_cb()
>         */
>        struct bpf_spin_lock *lock;
> };
> 
> BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, int, flags,
>            struct bpf_map *, map)
> {
>        struct bpf_hrtimer *t;
>        int ret = 0;
> 
>        ____bpf_spin_lock(&timer->lock);
>        t = timer->timer;
>        if (t) {
>                ret = -EBUSY;
>                goto out;
>        }
>        /* allocate hrtimer via map_kmalloc to use memcg accounting */
>        t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
>        if (!t) {
>                ret = -ENOMEM;
>                goto out;
>        }
>        t->value = (void *)timer /* - offset of bpf_timer inside elem */;
>        t->map = map;
>        t->timer.function = bpf_timer_cb;
> 
>        /* we'll init them in bpf_timer_start */
>        t->prog = NULL;
>        t->callback_fn = NULL;
> 
>        hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
>        timer->timer = t;
> out:
>        ____bpf_spin_unlock(&timer->lock);
>        return ret;
> }
> 
> 
> BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs,
>            void *, cb, struct bpf_prog *, prog)
> {
>        struct bpf_hrtimer *t;
>        int ret = 0;
> 
>        ____bpf_spin_lock(&timer->lock);
>        t = timer->timer;
>        if (!t) {
>                ret = -EINVAL;
>                goto out;
>        }
> 
>        /* doesn't matter what it returns, we just request cancellation */
>        hrtimer_try_to_cancel(&t->timer);
> 
>        /* t->prog might not be the same as prog (!) */
>        if (prog != t->prog) {
>             /* callback hasn't yet dropped refcnt */
>            if (t->prog) /* if it's null bpf_timer_cb() is running and
> will put it later */
>                bpf_prog_put(t->prog);
> 
>            if (IS_ERR(bpf_prog_inc_not_zero(prog))) {
>                /* this will only happen if prog is still running (and
> it's actually us),
>                 * but it was already put to zero, e.g., by closing last FD,
>                 * so there is no point in scheduling a new run
>                 */

I have a bit of mind explosion here... everything will be alright.

>                t->prog = NULL;
>                t->callback_fn = NULL;
>                ret = -E_WE_ARE_SHUTTING_DOWN;
>                goto out;
>            }
>        } /* otherwise we keep existing refcnt on t->prog == prog */
> 
>        /* potentially new combination of prog and cb */
>        t->prog = prog;
>        t->callback_fn = cb;
> 
>        hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
> out:
>        ____bpf_spin_unlock(&timer->lock);
>        return ret;
> }
> 
> BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> {
>        struct bpf_hrtimer *t;
>        int ret = 0;
> 
>        ____bpf_spin_lock(&timer->lock);
>        t = timer->timer;
>        if (!t) {
>                ret = -EINVAL;
>                goto out;
>        }
> 
>        /* this part I still worry about due to possibility of cpu migration,
>         * we need to think if we should migrate_disable() in bpf_timer_cb()
>         * and bpf_timer_* helpers(), but that's a separate topic
>         */
>        if (this_cpu_read(hrtimer_running) == t) {
>                ret = -EDEADLK;
>                goto out;
>        }
> 
>        ret = hrtimer_cancel(&t->timer);
> 
>        if (t->prog) {
>             /* bpf_timer_cb hasn't put it yet (and now won't) */
>             bpf_prog_put(t->prog);
>             t->prog = NULL;
>             t->callback_fn = NULL;
>        }
> out:
>        ____bpf_spin_unlock(&timer->lock);
>        return ret;
> }
> 
> static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
> {
>        struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
>        struct bpf_map *map = t->map;
>        struct bpf_prog *prog;
>        void *key, *callback_fn;
>        u32 idx;
>        int ret;
> 
>        /* this is very IMPORTANT  */
>        ____bpf_spin_lock(t->lock);
> 
>        prog = t->prog;
>        if (!prog) {
>            /* we were cancelled, prog is put already, exit early */
>            ____bpf_spin_unlock(&timer->lock);
>            return HRTIMER_NORESTART;
>        }
>        callback_fn = t->callback_fn;
> 
>        /* make sure bpf_timer_cancel/bpf_timer_start won't
> bpf_prog_put our prog */
>        t->prog = NULL;
>        t->callback_fn = NULL;
> 
>        ____bpf_spin_unlock(t->lock);
> 
>        /* at this point we "own" prog's refcnt decrement */
> 
>        this_cpu_write(hrtimer_running, t);
> 
>        ...
> 
>        ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
>                                            (u64)(long)key,
>                                            (u64)(long)value, 0, 0);
>        WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> 
>        bpf_prog_put(prog); /* always correct and non-racy */
> 
>        this_cpu_write(hrtimer_running, NULL);
> 
>        return HRTIMER_NORESTART;
> }
> 
> bpf_timer_cancel_and_free() is mostly the same with t->prog NULL check
> as everywhere else

I haven't started detailed analysis of above proposal, but looks overly
complicated on the first glance. Not saying it's bad or good.
Just complexity and races are striking.

> 
> > There is no need to complicate bpf_timer with crazy refcnting schemes.
> > The user space can simply pin the program in bpffs. In the future we might
> > introduce a self-pinning helper that would pin the program and create a file.
> > Sort-of like syscall prog type would pin self.
> > That would be explicit and clean api instead of obscure hacks inside bpf_timer*().
> 
> Do I understand correctly that the alternative that you are proposing
> is to keep some linked list of all map_values across all maps in the
> system that have initialized bpf_hrtimer with that particular bpf_prog
> in them? And when bpf_prog is put to zero you'll go and destruct them
> all in a race-free way?
> 
> I have a bit of a hard time imagining how that will be implemented
> exactly, so I might be overcomplicating that in my mind. Will be happy
> to see the working code.

Here is working code...
Note how patch 1 is so much simpler without complicated refcnting.
And how patch 2 removes for_each_map_element that was necessary earlier.
Also note that link list approach is an optimization.
Instead of keeping a link list the bpf_free_used_timers() could call
a map specific op to iterate all elems and free timers with
timer->prog == prog_going_away.
That was my initial proposal couple month ago.
link_list is purely an optimization instead of for_each_map_elem.