Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer

From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Yonghong Song <yhs@fb.com>,
	"David S. Miller" <davem@davemloft.net>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Network Development <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>, Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
Date: Tue, 15 Jun 2021 08:24:13 -0700	[thread overview]
Message-ID: <CAEf4BzZWr7HhKn3opxHeaZqkgo4gsYYhDQ4d4HuNhx-i8XgjCg@mail.gmail.com> (raw)
In-Reply-To: <CAADnVQJ3CQ=WnsantyEy6GB58rdsd7q=aJv93WPsZZJmXdJGzQ@mail.gmail.com>

On Mon, Jun 14, 2021 at 10:41 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jun 14, 2021 at 10:31 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Jun 14, 2021 at 8:29 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > > > > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > > > > +                                         (u64)(long)key,
> > > > > +                                         (u64)(long)t->value, 0, 0);
> > > > > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> > > >
> > > > I didn't find that next patch disallows callback return value 1 in the
> > > > verifier. If we indeed disallows return value 1 in the verifier. We
> > > > don't need WARN_ON here. Did I miss anything?
> > >
> > > Ohh. I forgot to address this bit in the verifier. Will fix.
> > >
> > > > > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > > > > +             /* If the timer wasn't active or callback already executing
> > > > > +              * bump the prog refcnt to keep it alive until
> > > > > +              * callback is invoked (again).
> > > > > +              */
> > > > > +             bpf_prog_inc(t->prog);
> > > >
> > > > I am not 100% sure. But could we have race condition here?
> > > >     cpu 1: running bpf_timer_start() helper call
> > > >     cpu 2: doing hrtimer work (calling callback etc.)
> > > >
> > > > Is it possible that
> > > >    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> > > > may be true and then right before bpf_prog_inc(t->prog), it becomes
> > > > true? If hrtimer_callback_running() is called, it is possible that
> > > > callback function could have dropped the reference count for t->prog,
> > > > so we could already go into the body of the function
> > > > __bpf_prog_put()?
> > >
> > > you're correct. Indeed there is a race.
> > > Circular dependency is a never ending headache.
> > > That's the same design mistake as with tail_calls.
> > > It felt that this case would be simpler than tail_calls and a bpf program
> > > pinning itself with bpf_prog_inc can be made to work... nope.
> > > I'll get rid of this and switch to something 'obviously correct'.
> > > Probably a link list with a lock to keep a set of init-ed timers and
> > > auto-cancel them on prog refcnt going to zero.
> > > To do 'bpf daemon' the prog would need to be pinned.
> >
> > Hm.. wouldn't this eliminate that race:
> >
> > switch (hrtimer_try_to_cancel(&t->timer))
> > {
> > case 0:
> >     /* nothing was queued */
> >     bpf_prog_inc(t->prog);
> >     break;
> > case 1:
> >     /* already have refcnt and it won't be bpf_prog_put by callback */
> >     break;
> > case -1:
> >     /* callback is running and will bpf_prog_put, so we need to take
> > another refcnt */
> >     bpf_prog_inc(t->prog);
> >     break;
> > }
> > hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
> >
> > So instead of guessing (racily) whether there is a queued callback or
> > not, try to cancel just in case there is. Then rely on the nice
> > guarantees that hrtimer cancellation API provides.
>
> I haven't thought it through yet, but the above approach could
> indeed solve this particular race. Unfortunately there are other races.
> There is an issue with bpf_timer_init. Since it doesn't take refcnt
> another program might do lookup and bpf_timer_start
> while the first prog got to refcnt=0 and got freed.

I think it's because of an API design. bpf_timer_init() takes a
callback (i.e., bpf_prog) but doesn't really do anything with it (so
doesn't take refcnt). It's both problematic, as you point out, and
unnecessarily restricting because it doesn't allow to change the
callback (e.g., when map is shared and bpf_program has to be changed).
If you change API to be:

long bpf_timer_init(struct bpf_timer *timer, int flags);
long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsecs);

You'll avoid this problem because bpf_timer_start will take refcnt
when arming (or re-arming) the timer. bpf_timer_init() will only take
care of initial memory allocation and hrtimer_init, but will leave
timer->prog as NULL until bpf_timer_start(). Wouldn't that solve all
the problems and be more flexible/powerful? If necessary, we can teach
bpf_timer_cb() to take spinlock briefly to avoid races fetching prog
pointer, but I haven't thought much about whether that's necessary.

If we wanted to push this to extreme, btw, we don't really need
bpf_timer_init(), bpf_timer_start() can do bpf_hrtimer allocation the
very first time (having pre-allocated spinlock makes this non-racy and
easy). But I don't know how expensive hrtimer_init() is, so it might
still make sense to split those two operations. Further, merging
bpf_timer_init() and bpf_timer_start() would require 6 input
arguments, which is a bit problematic. I have an idea how to get rid
of the necessity to pass in bpf_prog (so we'll be fine with just 5
explicit arguments), which simplifies other things (like
bpf_cgroup_storage implementation) as well, but I don't have patches
yet.

> Adding refcnt to bpf_timer_init() makes the prog self pinned
> and no callback might ever be executed (if there were no bpf_timer_start),
> so that will cause a high chance of bpf prog stuck in the kernel.
> There could be ref+uref schemes similar to tail_calls to address all that,
> but it gets ugly quickly.
> imo all these issues and races is a sign that such self pinning
> shouldn't be allowed.

I think prog refcounting is actually the saner and less surprising
approach here and we just need to spend a bit more time thinking how
to make everything work reliably. hrtimer API seems to be designed to
handle cases like this, which makes everything much easier.