Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Yonghong Song <yhs@fb.com>,
	"David S. Miller" <davem@davemloft.net>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Networking <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>,
	Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH v3 bpf-next 1/8] bpf: Introduce bpf timers.
Date: Wed, 30 Jun 2021 10:38:57 -0700	[thread overview]
Message-ID: <20210630173854.ofe2rvbghmkn4w6k@ast-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <CAEf4BzbpF7S2861ueTHC7u4avzFZU7vXkujNX+bLewd4hN5trw@mail.gmail.com>

On Wed, Jun 30, 2021 at 01:08:08PM +0300, Andrii Nakryiko wrote:
> On Tue, Jun 29, 2021 at 4:28 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Jun 28, 2021 at 11:34 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > Have you considered alternatively to implement something like
> > > bpf_ringbuf_query() for BPF ringbuf that will allow to query various
> > > things about the timer (e.g., whether it is active or not, and, of
> > > course, remaining expiry time). That will be more general, easier to
> > > extend, and will cover this use case:
> > >
> > > long exp = bpf_timer_query(&t->timer, BPF_TIMER_EXPIRY);
> > > bpf_timer_start(&t->timer, new_callback, exp);
> >
> > yes, but...
> > hrtimer_get_remaining + timer_start to that value is racy
> > and not accurate.
> 
> yes, but even though we specify expiration in nanosecond precision, no
> one should expect that precision w.r.t. when callback is actually
> fired. So fetching current expiration, adding new one, and re-setting
> it shouldn't be a problem in practice, IMO.

Just because we're dealing with time? Sounds sloppy. I suspect RT
folks take pride to make nsec precision as accurate as possible.

> I just think the most common case is to set a timer once, so ideally
> usability is optimized for that (so taken to extreme it would be just
> bpf_timer_start without any bpf_timer_init, but we've already
> discussed this, no need to do that again here). Needing bpf_timer_init
> + bpf_timer_set_callbcack + bpf_timer_start for a common case feels
> suboptimal usability-wise.

init+set+start could be one helper. See more below.

> 
> There is also a new race with bpf_timer_set_callback +
> bpf_timer_start. Callback can fire inbetween those two operations, so
> we could get new callback at old expiration or old callback with new
> expiration. 

sure, but that's a different issue.
When XDP prog is being replaced some packets might hit old one
even though there is an atomic replace of the pointer to a prog.
Two XDP progs and two timer callbacks running on different cpus
is an inevitable situation.

> To do full update reliably, you'd need to explicitly
> bpf_timer_cancel() first, at which point separate
> bpf_timer_set_callback() doesn't help at all.
> 
> > hrtimer_get_expires_ns + timer_start(MODE_ABS)
> > would be accurate, but that's an unnecessary complication.
> > To live replace old bpf prog with new one
> > bpf_for_each_map_elem() { bpf_timer_set_callback(new_prog); }
> > is much faster, since timers don't need to be dequeue, enqueue.
> > No need to worry about hrtimer machinery internal changes, etc.
> > bpf prog being replaced shouldn't be affecting the rest of the system.
> 
> That's a good property, but if it was done as a
> bpf_timer_set_callback() in addition to current
> bpf_timer_start(callback_fn) it would still allow to have a simple
> typical use.
> 
> Another usability consideration. With mandatory
> bpf_timer_set_callback(), bpf_timer_start() will need to return some
> error code if the callback wasn't set yet, right? I'm afraid that in
> practice it will be the situation similar to bpf_trace_printk() where
> people expect that it always succeeds and will never check the return
> code. It's obviously debuggable, but a friction point nevertheless.

It sucks that you had this printk experience. We screwed up.
It's definitely something to watch out for in the future.
But this analogy doesn't apply here.
bpf_timer_init/set_callback/start/cancel matches one to one to hrtimer api.
In earlier patches the callback setting was part of 'init' and then later
it was part of 'start'. imo that was 'reinvent the wheel' moment.
Not sure why such simple and elegant solution as indepdent
bpf_timer_set_callback wasn't obvious earlier.
There are tons of examples of hrtimer usage in the kernel
and it's safe to assume that bpf usage will be similar.
Typically the kernel does init + set_callback once and then start/cancel a lot.
Including doing 'start' from the callback.
I found one driver where callback is being selected dynamically.
So even without 'bpf prog live replace' use case there could be
use cases for dynamic set_callback for bpf timers too.
In all cases I've examined the 'start' is the most used function.
Coupling it with setting callback looks quite wrong to me from api pov.
Like there are examples in the kernel where 'start' is done in one .c file
while callback is defined in a different .c file.
Doing 'extern .. callback();' just to pass it into hrimter_start()
would have been ugly. Same thing we can expect to happen with bpf timers.

But if you really really want bpf_timer_start with callback I wouldn't
mind to have:
static inline int bpf_timer_start2(timer, callback, nsec)
{
  int err = bpf_timer_set_callback(timer, callback);
  if (err)...
  err = bpf_timer_start(timer, nsec, 0);
  ...
}
to be defined in libbpf's bpf_helpers.h file.
That could be a beginning of new way of defining helpers.
But based on the kernel usage of the hrimter I suspect that this helper
won't be used too much and people will stick to independent
steps to set callback and start it.

There could be a helper that does init+set+start too.

> >
> > > This will keep common timer scenarios to just two steps, init + start,
> > > but won't prevent more complicated ones. Things like extending
> > > expiration by one second relative that what was remaining will be
> > > possible as well.
> >
> > Extending expiration would be more accurate with hrtimer_forward_now().
> >
> > All of the above points are minor compared to the verifier advantage.
> > bpf_timer_set_callback() typically won't be called from the callback.
> > So verifier's insn_procssed will be drastically lower.
> > The combinatorial explosion of states even for this small
> > selftests/bpf/progs/timer.c is significant.
> > With bpf_timer_set_callback() is done outside of callback the verifier
> > behavior will be predictable.
> > To some degree patches 4-6 could have been delayed, but since the
> > the algo is understood and it's working, I'm going to keep them.
> > It's nice to have that flexibility, but the less pressure on the
> > verifier the better.
> 
> I haven't had time to understand those new patches yet, sorry, so not
> sure where the state explosion is coming from. I'll get to it for real
> next week. But improving verifier internals can be done transparently,
> while changing/fixing BPF UAPI is much harder and more disruptive.

It's not about the inadequate implementation of the async callback
verification in patches 4-6 (as you're hinting).
It's path aware property of the verifier that requires more passes
when callback is set from callback. Even with brand new verifier 2.0
the more passes issue will remain the same (unless it's not path
aware and can merge different branches and states).