Re: [RFC Patch bpf-next] bpf: introduce bpf timer

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>,
	duanxiongchun@bytedance.com,
	Dongdong Wang <wangdongdong.6@bytedance.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Cong Wang <cong.wang@bytedance.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>
Subject: Re: [RFC Patch bpf-next] bpf: introduce bpf timer
Date: Fri, 2 Apr 2021 16:45:00 -0700	[thread overview]
Message-ID: <20210402234500.by3wigegeluy5w7j@ast-mbp> (raw)
In-Reply-To: <CAM_iQpUfv7c19zFN1Y5-cSUiVwpk0bmtBMSxZoELgDOFCQ=qAw@mail.gmail.com>

On Fri, Apr 02, 2021 at 02:24:51PM -0700, Cong Wang wrote:
> > > where the key is the timer ID and the value is the timer expire
> > > timer.
> >
> > The timer ID is unnecessary. We cannot introduce new IDR for every new
> > bpf object. It doesn't scale.
> 
> The IDR is per map, not per timer.

Per-map is not acceptable. One IDR for all maps with timers is not acceptable either.
We have 3 IDRs now: for progs, for maps, and for links.
No other objects need IDRs.

> > Here is how more general timers might look like:
> > https://lore.kernel.org/bpf/20210310011905.ozz4xahpkqbfkkvd@ast-mbp.dhcp.thefacebook.com/
> >
> > include/uapi/linux/bpf.h:
> > struct bpf_timer {
> >   u64 opaque;
> > };
> > The 'opaque' field contains a pointer to dynamically allocated struct timer_list and other data.
> 
> This is my initial design as we already discussed, it does not work,
> please see below.

It does work. The perceived "issue" you referred to is a misunderstanding. See below.

> >
> > The prog would do:
> > struct map_elem {
> >     int stuff;
> >     struct bpf_timer timer;
> > };
> >
> > struct {
> >     __uint(type, BPF_MAP_TYPE_HASH);
> >     __uint(max_entries, 1);
> >     __type(key, int);
> >     __type(value, struct map_elem);
> > } hmap SEC(".maps");
> >
> > static int timer_cb(struct map_elem *elem)
> > {
> >     if (whatever && elem->stuff)
> >         bpf_timer_mod(&elem->timer, new_expire);
> > }
> >
> > int bpf_timer_test(...)
> > {
> >     struct map_elem *val;
> >
> >     val = bpf_map_lookup_elem(&hmap, &key);
> >     if (val) {
> >         bpf_timer_init(&val->timer, timer_cb, flags);
> >         val->stuff = 123;
> >         bpf_timer_mod(&val->timer, expires);
> >     }
> > }
> >
> > bpf_map_update_elem() either from bpf prog or from user space
> > allocates map element and zeros 8 byte space for the timer pointer.
> > bpf_timer_init() allocates timer_list and stores it into opaque if opaque == 0.
> > The validation of timer_cb() is done by the verifier.
> > bpf_map_delete_elem() either from bpf prog or from user space
> > does del_timer() if elem->opaque != 0.
> > If prog refers such hmap as above during prog free the kernel does
> > for_each_map_elem {if (elem->opaque) del_timer().}
> > I think that is the simplest way of prevent timers firing past the prog life time.
> > There could be other ways to solve it (like prog_array and ref/uref).
> >
> > Pseudo code:
> > int bpf_timer_init(struct bpf_timer *timer, void *timer_cb, int flags)
> > {
> >   if (timer->opaque)
> >     return -EBUSY;
> >   t = alloc timer_list
> >   t->cb = timer_cb;
> >   t->..
> >   timer->opaque = (long)t;
> > }
> >
> > int bpf_timer_mod(struct bpf_timer *timer, u64 expires)
> > {
> >   if (!time->opaque)
> >     return -EINVAL;
> >   t = (struct timer_list *)timer->opaque;
> >   mod_timer(t,..);
> > }
> >
> > int bpf_timer_del(struct bpf_timer *timer)
> > {
> >   if (!time->opaque)
> >     return -EINVAL;
> >   t = (struct timer_list *)timer->opaque;
> >   del_timer(t);
> > }
> >
> > The verifier would need to check that 8 bytes occupied by bpf_timer and not accessed
> > via load/store by the program. The same way it does it for bpf_spin_lock.
> 
> This does not work, because bpf_timer_del() has to be matched
> with bpf_timer_init(), otherwise we would leak timer resources.
> For example:
> 
> SEC("foo")
> bad_ebpf_code()
> {
>   struct bpf_timer t;
>   bpf_timer_init(&t, ...); // allocate a timer
>   bpf_timer_mod(&t, ..);
>   // end of BPF program
>   // now the timer is leaked, no one will delete it
> }
> 
> We can not enforce the matching in the verifier, because users would
> have to call bpf_timer_del() before exiting, which is not what we want
> either.

```
bad_ebpf_code()
{
  struct bpf_timer t;
```
is not at all what was proposed. This kind of code will be rejected by the verifier.

'struct bpf_timer' has to be part of the map element and the verifier will enforce that
just like it does so for bpf_spin_lock.
Try writing the following program:
```
bad_ebpf_code()
{
  struct bpf_spin_lock t;
  bpf_spin_lock(&t);
}
``
and then follow the code to see why the verifier rejects it.

The implementation of what I'm proposing is straightforward.
I certainly understand that it might look intimidating and "impossible",
but it's really quite simple.