bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* bpf timer design
@ 2021-03-09  0:11 Cong Wang
  2021-03-10  1:19 ` Alexei Starovoitov
  0 siblings, 1 reply; 6+ messages in thread
From: Cong Wang @ 2021-03-09  0:11 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov
  Cc: Dongdong Wang, duanxiongchun, Yonghong Song, Andrii Nakryiko

Hi, all

I have been thinking about eBPF timer APIs, it looks harder than I thought.

The API's themselves are not hard, here is what I have:

int bpf_timer_setup(struct bpf_timer *timer, void *callback_fn,
                    void *callback_ctx, u64 flags);
int bpf_timer_mod(struct bpf_timer *timer, u64 expires);
int bpf_timer_del(struct bpf_timer *timer);

which is pretty much similar to the current kernel timer API's.

The struct bpf_timer is a bit tricky, we still have to save the kernel timer
there but we do not want eBPF programs to touch it. So I simply save a pointer
to kernel timer inside:

struct bpf_timer {
       void *ptr;
};

but we obviously have to prevent eBPF programs from dereferencing it
with the verifier.

The hardest part is on the verifier side, we have to check:

1. Whether a timer is initialized before use. For example:

struct bpf_timer t;
bpf_timer_mod(&t, bpf_jiffies64() + HZ);

2. Whether a timer is still active before exiting. For example:

struct bpf_timer t;
bpf_setup_timer(&t, ....);
bpf_timer_mod(&t, bpf_jiffies64() + HZ);
//end of the eBPF program

I do not see any existing mechanism for checks like these, so potentially need
a lot of work.

And, unlike bpf_for_each_map_elem(), the timer callback is asynchronous, this
makes it harder to check its callback context etc..

Any thoughts and suggestions?

Thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bpf timer design
  2021-03-09  0:11 bpf timer design Cong Wang
@ 2021-03-10  1:19 ` Alexei Starovoitov
  2021-03-13 19:19   ` Cong Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Alexei Starovoitov @ 2021-03-10  1:19 UTC (permalink / raw)
  To: Cong Wang
  Cc: bpf, Alexei Starovoitov, Dongdong Wang, duanxiongchun,
	Yonghong Song, Andrii Nakryiko

On Mon, Mar 08, 2021 at 04:11:40PM -0800, Cong Wang wrote:
> Hi, all
> 
> I have been thinking about eBPF timer APIs, it looks harder than I thought.
> 
> The API's themselves are not hard, here is what I have:
> 
> int bpf_timer_setup(struct bpf_timer *timer, void *callback_fn,
>                     void *callback_ctx, u64 flags);
> int bpf_timer_mod(struct bpf_timer *timer, u64 expires);
> int bpf_timer_del(struct bpf_timer *timer);
> 
> which is pretty much similar to the current kernel timer API's.
> 
> The struct bpf_timer is a bit tricky, we still have to save the kernel timer
> there but we do not want eBPF programs to touch it. So I simply save a pointer
> to kernel timer inside:
> 
> struct bpf_timer {
>        void *ptr;
> };
> 
> but we obviously have to prevent eBPF programs from dereferencing it
> with the verifier.
> 
> The hardest part is on the verifier side, we have to check:
> 
> 1. Whether a timer is initialized before use. For example:
> 
> struct bpf_timer t;
> bpf_timer_mod(&t, bpf_jiffies64() + HZ);

relatively easy. see below.

> 2. Whether a timer is still active before exiting. For example:

also easy to do, but "must do bpf_timer_del before exiting" restriction
is probably too limiting to be usable.

> struct bpf_timer t;
> bpf_setup_timer(&t, ....);
> bpf_timer_mod(&t, bpf_jiffies64() + HZ);
> //end of the eBPF program
> 
> I do not see any existing mechanism for checks like these, so potentially need
> a lot of work.

There are two ways to handle the ordering constraints:
- bpf_spin_lock style which is simple.
- may_be_acquire_function/is_release_function which is more advanced.
but the ordering issue is a small one comparing to the issue of life time of
the struct bpf_timer.
I'm assuming that it will be related one to one with struct timer_list.
Ideally the api would be similar to kernel and struct bpf_spin_lock
demonstrated how it can be done. Unlike bpf_spin_lock which is
always unlocked by the time program ends the bpf_timer will be still
enqueued in the timer wheel when prog exits. So its life time should
be separate from the program execution life time.
The only bpf concept with such properties is a bpf map.
Therefore one the ways would be to do each timer as a map element.
Both array of timers and hash of timers would be needed.
The map_lookup_elem would return a pointer to opaque struct bpf_timer.
And then bpf_timer_init/mod/del can operate on that pointer.
The verifier can be taught to check that timer_init() is called
before timer_mod(), but it's simple enough to do in run-time.
The timer_mod() operation is expensive. Few run-time checks
will be negligible. For example bpf_time_mod() can check that
'function' pointer was not-NULL. Otherwise run-time EINVAL.
Since struct bpf_timer is in a map, it's either zero-inited
at element creation time or bpf_timer_init() was called on it.
The bpf_spin_lock has a limitation that it can only be one per map
element which allowed to simplify the verifier code a lot.
I'm suggesting to use the same restriction for bpf_timer.
Implementation wise I hope it won't be as hard coded as process_spin_lock().
High level I'm proposing 'struct bpf_timer { u64 opaque; }'
as part of uapi/bpf.h.
The bpf program can place it inside normal array/hash maps.
The 'opaque' field will contain a pointer to dynamically allocated
'struct timer_list'.
The prog would do:
struct map_elem {
    int stuff;
    struct bpf_timer timer;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1);
    __type(key, int);
    __type(value, struct map_elem);
} hmap SEC(".maps");

static int timer_cb(struct map_elem *elem)
{
    if (whatever && elem->stuff)
        bpf_timer_mod(&elem->timer, new_expire);
}

int bpf_timer_test(...)
{
    struct map_elem *val;

    val = bpf_map_lookup_elem(&hmap, &key);
    if (val) {
        bpf_timer_init(&val->timer, timer_cb, flags);
        val->stuff = 123;
        bpf_timer_mod(&val->timer, expires);
    }
}

The map and prog destruction process would need to do del_timer()
on all map elements that have embedded struct bpf_timer before
freeing prog and map. Currently we don't have such constraint
with bpf_spin_lock, but it shouldn't be hard to add.
Similarly bpf_map_delete_elem() would need to do del_timer too.
bpf_map_update_elem can skip touching part of the value
that has struct bpf_timer. Again similar to bpf_spin_lock.
See copy_map_value.
Too really simplify the implementation we can restrict that
either bpf_spin_lock or bpf_timer can be inside the element.
(not both at the same time).

Of course there are other ways to design bpf_timer api.
I think the main advantage of connecting bpf_timer with
a map element is the control of timer life time and
additional data that timer_cb() will receive.
The 'void *callback_ctx' in the beginning of your email has the same
life time issue. It's difficult to make it part of bpf_timer_init().
Instead bpf prog can store additional data into map element.
It's not as flexible as bpf_for_each_map_elem that can
take a pointer to stack, but with timers it's not trivial
to guarantee that the stack is valid at the time callback is fired.
I think it should be flexible enough timer api.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bpf timer design
  2021-03-10  1:19 ` Alexei Starovoitov
@ 2021-03-13 19:19   ` Cong Wang
  2021-03-17  4:21     ` Cong Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Cong Wang @ 2021-03-13 19:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Dongdong Wang, duanxiongchun,
	Yonghong Song, Andrii Nakryiko

On Tue, Mar 9, 2021 at 5:19 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Mar 08, 2021 at 04:11:40PM -0800, Cong Wang wrote:
> > Hi, all
> >
> > I have been thinking about eBPF timer APIs, it looks harder than I thought.
> >
> > The API's themselves are not hard, here is what I have:
> >
> > int bpf_timer_setup(struct bpf_timer *timer, void *callback_fn,
> >                     void *callback_ctx, u64 flags);
> > int bpf_timer_mod(struct bpf_timer *timer, u64 expires);
> > int bpf_timer_del(struct bpf_timer *timer);
> >
> > which is pretty much similar to the current kernel timer API's.
> >
> > The struct bpf_timer is a bit tricky, we still have to save the kernel timer
> > there but we do not want eBPF programs to touch it. So I simply save a pointer
> > to kernel timer inside:
> >
> > struct bpf_timer {
> >        void *ptr;
> > };
> >
> > but we obviously have to prevent eBPF programs from dereferencing it
> > with the verifier.
> >
> > The hardest part is on the verifier side, we have to check:
> >
> > 1. Whether a timer is initialized before use. For example:
> >
> > struct bpf_timer t;
> > bpf_timer_mod(&t, bpf_jiffies64() + HZ);
>
> relatively easy. see below.
>
> > 2. Whether a timer is still active before exiting. For example:
>
> also easy to do, but "must do bpf_timer_del before exiting" restriction
> is probably too limiting to be usable.

Well, if the timer callback uses, for example, a pointer to a variable
on stack, then I am afraid we have to delete it before returning.


>
> > struct bpf_timer t;
> > bpf_setup_timer(&t, ....);
> > bpf_timer_mod(&t, bpf_jiffies64() + HZ);
> > //end of the eBPF program
> >
> > I do not see any existing mechanism for checks like these, so potentially need
> > a lot of work.
>
> There are two ways to handle the ordering constraints:
> - bpf_spin_lock style which is simple.

This is the first I looked at, unfortunately bpf spinlock is too limited,
it does not even allow nesting, but for timer, "nesting" should be fine:

bpf_timer_setup(&t1...);
bpf_timer_setup(&t2...);
bpf_timer_mod(&t1...);
bpf_timer_mod(&t2...);
bpf_timer_del(&t1...);
bpf_timer_del(&t2...);


> - may_be_acquire_function/is_release_function which is more advanced.
> but the ordering issue is a small one comparing to the issue of life time of
> the struct bpf_timer.

Yeah, this is what I have been looking at. The major difference is we
should be able to init a timer without even using it:

bpf_timer_setup(&t...);
//end of program

With acquire/release syntax, they have to be paired.

> I'm assuming that it will be related one to one with struct timer_list.

Yes, this is what I meant by saving a pointer to struct timer_list
inside struct bpf_timer.

> Ideally the api would be similar to kernel and struct bpf_spin_lock
> demonstrated how it can be done. Unlike bpf_spin_lock which is
> always unlocked by the time program ends the bpf_timer will be still
> enqueued in the timer wheel when prog exits. So its life time should
> be separate from the program execution life time.

Right, I am sure we can take a refcnt to the program itself, however
it looks really weird if we still let timer run after the program exits,
because the timer could run infinitely by arming itself in the callback.


> The only bpf concept with such properties is a bpf map.
> Therefore one the ways would be to do each timer as a map element.
> Both array of timers and hash of timers would be needed.
> The map_lookup_elem would return a pointer to opaque struct bpf_timer.
> And then bpf_timer_init/mod/del can operate on that pointer.
> The verifier can be taught to check that timer_init() is called
> before timer_mod(), but it's simple enough to do in run-time.
> The timer_mod() operation is expensive. Few run-time checks
> will be negligible. For example bpf_time_mod() can check that
> 'function' pointer was not-NULL. Otherwise run-time EINVAL.
> Since struct bpf_timer is in a map, it's either zero-inited
> at element creation time or bpf_timer_init() was called on it.

Interestingly, we discussed this solution at Bytedance before, our
conclusion is using a map is not as flexible as making it independent.


> The bpf_spin_lock has a limitation that it can only be one per map
> element which allowed to simplify the verifier code a lot.
> I'm suggesting to use the same restriction for bpf_timer.

This limit is fine, at least for timeout hashmap, but the limitation of
nesting mentioned above is not.

> Implementation wise I hope it won't be as hard coded as process_spin_lock().
> High level I'm proposing 'struct bpf_timer { u64 opaque; }'
> as part of uapi/bpf.h.
> The bpf program can place it inside normal array/hash maps.
> The 'opaque' field will contain a pointer to dynamically allocated
> 'struct timer_list'.
> The prog would do:
> struct map_elem {
>     int stuff;
>     struct bpf_timer timer;
> };
>
> struct {
>     __uint(type, BPF_MAP_TYPE_HASH);
>     __uint(max_entries, 1);
>     __type(key, int);
>     __type(value, struct map_elem);
> } hmap SEC(".maps");
>
> static int timer_cb(struct map_elem *elem)
> {
>     if (whatever && elem->stuff)
>         bpf_timer_mod(&elem->timer, new_expire);
> }
>
> int bpf_timer_test(...)
> {
>     struct map_elem *val;
>
>     val = bpf_map_lookup_elem(&hmap, &key);
>     if (val) {
>         bpf_timer_init(&val->timer, timer_cb, flags);
>         val->stuff = 123;
>         bpf_timer_mod(&val->timer, expires);
>     }
> }

I see, it looks like you use a map to limit the lifetime of the timers.
Our internal discussion actually went further, we wanted to introduce
a special type of map just for timers, for example, key could be callback,
value could be expires.

>
> The map and prog destruction process would need to do del_timer()
> on all map elements that have embedded struct bpf_timer before
> freeing prog and map. Currently we don't have such constraint
> with bpf_spin_lock, but it shouldn't be hard to add.
> Similarly bpf_map_delete_elem() would need to do del_timer too.
> bpf_map_update_elem can skip touching part of the value
> that has struct bpf_timer. Again similar to bpf_spin_lock.
> See copy_map_value.

If we have a timer map, all of these can be hidden under the map
ops.

> Too really simplify the implementation we can restrict that
> either bpf_spin_lock or bpf_timer can be inside the element.
> (not both at the same time).
>
> Of course there are other ways to design bpf_timer api.
> I think the main advantage of connecting bpf_timer with
> a map element is the control of timer life time and
> additional data that timer_cb() will receive.
> The 'void *callback_ctx' in the beginning of your email has the same
> life time issue. It's difficult to make it part of bpf_timer_init().
> Instead bpf prog can store additional data into map element.
> It's not as flexible as bpf_for_each_map_elem that can
> take a pointer to stack, but with timers it's not trivial
> to guarantee that the stack is valid at the time callback is fired.
> I think it should be flexible enough timer api.

Agreed. If we enforce a map here, users would not be able to
use a standalone timer, but I think this is okay, we have to
make a trade-off anyway.

Please let me know what you think about introducing a timer
map, something like below:

struct {
     __uint(type, BPF_MAP_TYPE_TIMER);
} map SEC(".maps");

struct bpf_timer t;

static int timer_cb(void *arg)
{
  // show how to rearm a timer
  u64 new_expires = ...;
  bpf_map_update_elem(&map, &t, &new_expires, 0);
}

int bpf_timer_test(...)
{
  u64 expires;

  bpf_timer_init(&t, timer_cb, arg);

  // bpf_map_update_elem() rejects it if uninitialized
  bpf_map_update_elem(&map, &t, &expires, 0);

  // wait for timer deletion synchronously
  bpf_map_delete_elem(&map, &t);
}


Thanks a lot!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bpf timer design
  2021-03-13 19:19   ` Cong Wang
@ 2021-03-17  4:21     ` Cong Wang
  2021-03-17 17:26       ` Cong Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Cong Wang @ 2021-03-17  4:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Dongdong Wang, duanxiongchun,
	Yonghong Song, Andrii Nakryiko

On Sat, Mar 13, 2021 at 11:19 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Please let me know what you think about introducing a timer
> map, something like below:
>
> struct {
>      __uint(type, BPF_MAP_TYPE_TIMER);
> } map SEC(".maps");
>
> struct bpf_timer t;

After some thoughts, I think the following solution is much better.

We still need a timer map:

struct {
     __uint(type, BPF_MAP_TYPE_TIMER);
} map SEC(".maps");

However, its key is not a pointer to timer, it is a timer ID allocated with

u32 bpf_timer_create(void *callback, void *arg, u64 flags);

which returns a globally unique ID. So, we end up having code like
this:

u32 timer_id;

static int timer_cb(void *arg)
{
  // show how to rearm a timer
  u64 new_expires = ...;
  bpf_map_update_elem(&map, &timer_id, &new_expires, 0);
}

int bpf_timer_test(...)
{
  u64 expires = ...;

  timer_id = bpf_timer_create(timer_cb, arg, 0);
  bpf_map_update_elem(&map, &timer_id, &expires, 0);

  // wait for timer deletion synchronously
  bpf_map_delete_elem(&map, &timer_id);
}

In kernel, we can use an IDR to allocate these ID's and save a kernel
timer pointer for each ID there.

With this solution, we don't need to change much in the verifier,
probably only verifying the callback arg pointer for bpf_timer_create().

Any thoughts on this proposal?

Thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bpf timer design
  2021-03-17  4:21     ` Cong Wang
@ 2021-03-17 17:26       ` Cong Wang
  2021-03-17 18:20         ` Alexei Starovoitov
  0 siblings, 1 reply; 6+ messages in thread
From: Cong Wang @ 2021-03-17 17:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Dongdong Wang, duanxiongchun,
	Yonghong Song, Andrii Nakryiko

On Tue, Mar 16, 2021 at 9:21 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> We still need a timer map:
>
> struct {
>      __uint(type, BPF_MAP_TYPE_TIMER);
> } map SEC(".maps");
>
> However, its key is not a pointer to timer, it is a timer ID allocated with
>
> u32 bpf_timer_create(void *callback, void *arg, u64 flags);

Hmm, we do not need a map at all, because the verifier could check
whether create() and delete() are paired correctly, so we can just
have the following API's:

u32 bpf_timer_create(void *callback, void *arg, u32 flags);
void bpf_timer_settime(u32 id, u64 expires);
u64 bpf_timer_gettime(u32 id);
int bpf_timer_delete(u32 id);

Pretty much similar to Linux user-space timer API's. I will probably
go this direction, unless there is any objection.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bpf timer design
  2021-03-17 17:26       ` Cong Wang
@ 2021-03-17 18:20         ` Alexei Starovoitov
  0 siblings, 0 replies; 6+ messages in thread
From: Alexei Starovoitov @ 2021-03-17 18:20 UTC (permalink / raw)
  To: Cong Wang
  Cc: bpf, Alexei Starovoitov, Dongdong Wang, duanxiongchun,
	Yonghong Song, Andrii Nakryiko

On Wed, Mar 17, 2021 at 10:26 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Tue, Mar 16, 2021 at 9:21 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > We still need a timer map:
> >
> > struct {
> >      __uint(type, BPF_MAP_TYPE_TIMER);
> > } map SEC(".maps");
> >
> > However, its key is not a pointer to timer, it is a timer ID allocated with
> >
> > u32 bpf_timer_create(void *callback, void *arg, u64 flags);
>
> Hmm, we do not need a map at all, because the verifier could check
> whether create() and delete() are paired correctly, so we can just
> have the following API's:
>
> u32 bpf_timer_create(void *callback, void *arg, u32 flags);
> void bpf_timer_settime(u32 id, u64 expires);
> u64 bpf_timer_gettime(u32 id);
> int bpf_timer_delete(u32 id);
>
> Pretty much similar to Linux user-space timer API's. I will probably
> go this direction, unless there is any objection.

I think a specialized map or hidden map that returns id like above
has plenty of downsides.
Please reconsider what I was proposing.
In the previous email I outlined the reasons why 'struct bpf_timer'
embedded in any normal map is more user friendly and more flexible.
I'd like to discuss those points first. It sounds to me that you disagreed,
but I couldn't find an articulation on why you disagreed.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-03-17 18:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-09  0:11 bpf timer design Cong Wang
2021-03-10  1:19 ` Alexei Starovoitov
2021-03-13 19:19   ` Cong Wang
2021-03-17  4:21     ` Cong Wang
2021-03-17 17:26       ` Cong Wang
2021-03-17 18:20         ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).