All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>,
	duanxiongchun@bytedance.com,
	Dongdong Wang <wangdongdong.6@bytedance.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Cong Wang <cong.wang@bytedance.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>
Subject: Re: [RFC Patch bpf-next] bpf: introduce bpf timer
Date: Wed, 14 Apr 2021 21:25:10 -0700	[thread overview]
Message-ID: <CAADnVQLsmULxJYq9rHS4xyg=VAUeexJTh35vTWTVgjeqwX4D6g@mail.gmail.com> (raw)
In-Reply-To: <CAM_iQpWePmmpr0RKqCrQ=NPiGrq2Tx9OU9y3e4CTzFjvh5t47w@mail.gmail.com>

On Wed, Apr 14, 2021 at 9:02 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Mon, Apr 12, 2021 at 4:01 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Apr 05, 2021 at 05:36:27PM -0700, Cong Wang wrote:
> > > On Fri, Apr 2, 2021 at 4:45 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Fri, Apr 02, 2021 at 02:24:51PM -0700, Cong Wang wrote:
> > > > > > > where the key is the timer ID and the value is the timer expire
> > > > > > > timer.
> > > > > >
> > > > > > The timer ID is unnecessary. We cannot introduce new IDR for every new
> > > > > > bpf object. It doesn't scale.
> > > > >
> > > > > The IDR is per map, not per timer.
> > > >
> > > > Per-map is not acceptable. One IDR for all maps with timers is not acceptable either.
> > > > We have 3 IDRs now: for progs, for maps, and for links.
> > > > No other objects need IDRs.
> > > >
> > > > > > Here is how more general timers might look like:
> > > > > > https://lore.kernel.org/bpf/20210310011905.ozz4xahpkqbfkkvd@ast-mbp.dhcp.thefacebook.com/
> > > > > >
> > > > > > include/uapi/linux/bpf.h:
> > > > > > struct bpf_timer {
> > > > > >   u64 opaque;
> > > > > > };
> > > > > > The 'opaque' field contains a pointer to dynamically allocated struct timer_list and other data.
> > > > >
> > > > > This is my initial design as we already discussed, it does not work,
> > > > > please see below.
> > > >
> > > > It does work. The perceived "issue" you referred to is a misunderstanding. See below.
> > > >
> > > > > >
> > > > > > The prog would do:
> > > > > > struct map_elem {
> > > > > >     int stuff;
> > > > > >     struct bpf_timer timer;
> > > > > > };
> > > > > >
> > > > > > struct {
> > > > > >     __uint(type, BPF_MAP_TYPE_HASH);
> > > > > >     __uint(max_entries, 1);
> > > > > >     __type(key, int);
> > > > > >     __type(value, struct map_elem);
> > > > > > } hmap SEC(".maps");
> > > > > >
> > > > > > static int timer_cb(struct map_elem *elem)
> > > > > > {
> > > > > >     if (whatever && elem->stuff)
> > > > > >         bpf_timer_mod(&elem->timer, new_expire);
> > > > > > }
> > > > > >
> > > > > > int bpf_timer_test(...)
> > > > > > {
> > > > > >     struct map_elem *val;
> > > > > >
> > > > > >     val = bpf_map_lookup_elem(&hmap, &key);
> > > > > >     if (val) {
> > > > > >         bpf_timer_init(&val->timer, timer_cb, flags);
> > > > > >         val->stuff = 123;
> > > > > >         bpf_timer_mod(&val->timer, expires);
> > > > > >     }
> > > > > > }
> > > > > >
> > > > > > bpf_map_update_elem() either from bpf prog or from user space
> > > > > > allocates map element and zeros 8 byte space for the timer pointer.
> > > > > > bpf_timer_init() allocates timer_list and stores it into opaque if opaque == 0.
> > > > > > The validation of timer_cb() is done by the verifier.
> > > > > > bpf_map_delete_elem() either from bpf prog or from user space
> > > > > > does del_timer() if elem->opaque != 0.
> > > > > > If prog refers such hmap as above during prog free the kernel does
> > > > > > for_each_map_elem {if (elem->opaque) del_timer().}
> > > > > > I think that is the simplest way of prevent timers firing past the prog life time.
> > > > > > There could be other ways to solve it (like prog_array and ref/uref).
> > > > > >
> > > > > > Pseudo code:
> > > > > > int bpf_timer_init(struct bpf_timer *timer, void *timer_cb, int flags)
> > > > > > {
> > > > > >   if (timer->opaque)
> > > > > >     return -EBUSY;
> > > > > >   t = alloc timer_list
> > > > > >   t->cb = timer_cb;
> > > > > >   t->..
> > > > > >   timer->opaque = (long)t;
> > > > > > }
> > > > > >
> > > > > > int bpf_timer_mod(struct bpf_timer *timer, u64 expires)
> > > > > > {
> > > > > >   if (!time->opaque)
> > > > > >     return -EINVAL;
> > > > > >   t = (struct timer_list *)timer->opaque;
> > > > > >   mod_timer(t,..);
> > > > > > }
> > > > > >
> > > > > > int bpf_timer_del(struct bpf_timer *timer)
> > > > > > {
> > > > > >   if (!time->opaque)
> > > > > >     return -EINVAL;
> > > > > >   t = (struct timer_list *)timer->opaque;
> > > > > >   del_timer(t);
> > > > > > }
> > > > > >
> > > > > > The verifier would need to check that 8 bytes occupied by bpf_timer and not accessed
> > > > > > via load/store by the program. The same way it does it for bpf_spin_lock.
> > > > >
> > > > > This does not work, because bpf_timer_del() has to be matched
> > > > > with bpf_timer_init(), otherwise we would leak timer resources.
> > > > > For example:
> > > > >
> > > > > SEC("foo")
> > > > > bad_ebpf_code()
> > > > > {
> > > > >   struct bpf_timer t;
> > > > >   bpf_timer_init(&t, ...); // allocate a timer
> > > > >   bpf_timer_mod(&t, ..);
> > > > >   // end of BPF program
> > > > >   // now the timer is leaked, no one will delete it
> > > > > }
> > > > >
> > > > > We can not enforce the matching in the verifier, because users would
> > > > > have to call bpf_timer_del() before exiting, which is not what we want
> > > > > either.
> > > >
> > > > ```
> > > > bad_ebpf_code()
> > > > {
> > > >   struct bpf_timer t;
> > > > ```
> > > > is not at all what was proposed. This kind of code will be rejected by the verifier.
> > > >
> > > > 'struct bpf_timer' has to be part of the map element and the verifier will enforce that
> > > > just like it does so for bpf_spin_lock.
> > > > Try writing the following program:
> > > > ```
> > > > bad_ebpf_code()
> > > > {
> > > >   struct bpf_spin_lock t;
> > > >   bpf_spin_lock(&t);
> > > > }
> > > > ``
> > > > and then follow the code to see why the verifier rejects it.
> > >
> > > Well, embedding a spinlock makes sense as it is used to protect
> > > the value it is associated with, but for a timer, no, it has no value
> > > to associate.
> >
> > The way kernel code is using timers is alwasy by embedding timer_list
> > into another data structure and then using container_of() in a callback.
> > So all existing use cases of timers disagree with your point.
>
> Not always. Data can be passed as a global data structure visible to
> timer callback.

global data is racy. That's not an option at all.

> >
> > > Even if it does, updating it requires a lock as the
> > > callback can run concurrently with value update.
> >
> > No lock is necessary.
> > map_value_update_elem can either return EBUSY if timer_list != NULL
> > or it can del_timer() before updating the whole value.
> > Both choices can be expressed with flags.
>
> This sounds problematic, because the hash map is visible to
> users but not the timers associated, hence in user-space users
> just unexpectedly get EBUSY.

As I said earlier:
"
bpf_map_update_elem() either from bpf prog or from user space
allocates map element and zeros 8 byte space for the timer pointer.
"
and also said that EBUSY could be default or non default behavior
expressed with flags passed into update.

> >
> > > So, they are very
> > > different hence should be treated differently rather than similarly.
> > >
> > > >
> > > > The implementation of what I'm proposing is straightforward.
> > > > I certainly understand that it might look intimidating and "impossible",
> > > > but it's really quite simple.
> > >
> > > How do you refcnt the struct bpf_prog with your approach? Or with
> >
> > you don't. More so prog must not be refcnted otherwise it's a circular
> > dependency between progs and maps.
> > We did that in the past with prog_array and the api became unpleasant
> > and not user friendly. Not going to repeat the same mistake again.
>
> Then how do you prevent prog being unloaded when the timer callback
> is still active?

As I said earlier:
"
If prog refers such hmap as above during prog free the kernel does
for_each_map_elem {if (elem->opaque) del_timer().}
"

>
> >
> > > actually any attempt to create timers in kernel-space. I am not intimidated
> > > but quite happy to hear. If you do it in the verifier, we do not know which
> > > code path is actually executed when running it. If you do it with JIT, I do
> > > not see how JIT can even get the right struct bpf_prog pointer in context.
> >
> > Neither. See pseudo code for bpf_timer_init/bpf_timer_mod in the earlier email.
> >
> > > This is how I concluded it looks impossible.
> >
> > Please explain what 'impossible' or buggy you see in the pseudo code.
>
> Your pseudo code never shows how to refcnt the struct bpf_prog, which
> is critical to the bpf timer design.

As I said earlier: nack to refcnt progs.

  reply	other threads:[~2021-04-15  4:25 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-01  4:26 [RFC Patch bpf-next] bpf: introduce bpf timer Cong Wang
2021-04-01  6:38 ` Song Liu
2021-04-01 17:28   ` Cong Wang
2021-04-01 20:17     ` Song Liu
2021-04-02 17:34       ` Cong Wang
2021-04-02 17:57         ` Song Liu
2021-04-02 19:08           ` Cong Wang
2021-04-02 19:43             ` Song Liu
2021-04-02 20:57               ` Cong Wang
2021-04-02 23:31                 ` Song Liu
2021-04-05 23:49                   ` Cong Wang
2021-04-06  1:07                     ` Song Liu
2021-04-06  1:24                       ` Cong Wang
2021-04-06  6:17                         ` Song Liu
2021-04-06 16:48                           ` Cong Wang
2021-04-06 23:36                             ` Song Liu
2021-04-08 22:45                               ` Cong Wang
2021-04-02 19:28 ` Alexei Starovoitov
2021-04-02 21:24   ` Cong Wang
2021-04-02 23:45     ` Alexei Starovoitov
2021-04-06  0:36       ` Cong Wang
2021-04-12 23:01         ` Alexei Starovoitov
2021-04-15  4:02           ` Cong Wang
2021-04-15  4:25             ` Alexei Starovoitov [this message]
2021-04-15 15:51               ` Cong Wang
2021-04-26 23:00               ` Cong Wang
2021-04-26 23:05                 ` Alexei Starovoitov
2021-04-26 23:37                   ` Cong Wang
2021-04-27  2:01                     ` Alexei Starovoitov
2021-04-27 11:52                       ` Jamal Hadi Salim
2021-04-27 16:36                       ` Cong Wang
2021-04-27 18:33                         ` Alexei Starovoitov
2021-05-09  5:37                           ` Cong Wang
2021-05-10 20:55                             ` Jamal Hadi Salim
2021-05-11 21:29                               ` Cong Wang
2021-05-12 22:56                                 ` Jamal Hadi Salim
2021-05-11  5:05                             ` Joe Stringer
2021-05-11 21:08                               ` Cong Wang
2021-05-12 22:43                               ` Jamal Hadi Salim
2021-05-13 18:45                                 ` Jamal Hadi Salim
2021-05-14  2:53                                   ` Cong Wang
2021-08-11 21:03                                     ` Joe Stringer
2021-05-20 18:55 [RFC PATCH bpf-next] bpf: Introduce bpf_timer Alexei Starovoitov
2021-05-21 14:38 ` Alexei Starovoitov
2021-05-21 21:37 ` Cong Wang
2021-05-23 16:01   ` Alexei Starovoitov
2021-05-24  8:45     ` Lorenz Bauer
2021-05-25  3:16     ` Cong Wang
2021-05-25  4:59       ` Cong Wang
2021-05-25 18:21         ` Alexei Starovoitov
2021-05-25 19:35           ` Jamal Hadi Salim
2021-05-25 19:57             ` Alexei Starovoitov
2021-05-25 21:09               ` Jamal Hadi Salim
2021-05-25 22:08                 ` Alexei Starovoitov
2021-05-26 15:34                   ` Jamal Hadi Salim
2021-05-26 16:58                     ` Alexei Starovoitov
2021-05-26 18:25                       ` Jamal Hadi Salim
2021-05-30  6:36           ` Cong Wang
2021-06-02  2:00             ` Alexei Starovoitov
2021-06-02  8:48               ` Toke Høiland-Jørgensen
2021-06-02 17:54                 ` Martin KaFai Lau
2021-06-02 18:13                   ` Kumar Kartikeya Dwivedi
2021-06-02 18:26                     ` Alexei Starovoitov
2021-06-02 18:30                       ` Kumar Kartikeya Dwivedi
2021-06-02 18:46                     ` John Fastabend
2021-05-22 16:06 ` kernel test robot
2021-05-22 16:41 ` kernel test robot
2021-05-23 11:48 ` Toke Høiland-Jørgensen
2021-05-23 15:58   ` Alexei Starovoitov
2021-05-24  8:42     ` Lorenz Bauer
2021-05-24 14:48       ` Alexei Starovoitov
2021-05-24 17:33     ` Alexei Starovoitov
2021-05-24 18:39       ` Toke Høiland-Jørgensen
2021-05-24 18:38     ` Toke Høiland-Jørgensen
2021-05-24 11:49 ` Lorenz Bauer
2021-05-24 14:56   ` Alexei Starovoitov
2021-05-24 19:13     ` Andrii Nakryiko
2021-05-25  5:22       ` Cong Wang
2021-05-25 19:47         ` Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAADnVQLsmULxJYq9rHS4xyg=VAUeexJTh35vTWTVgjeqwX4D6g@mail.gmail.com' \
    --to=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=daniel@iogearbox.net \
    --cc=duanxiongchun@bytedance.com \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    --cc=songliubraving@fb.com \
    --cc=songmuchun@bytedance.com \
    --cc=wangdongdong.6@bytedance.com \
    --cc=xiyou.wangcong@gmail.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.