Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David Miller <davem@davemloft.net>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>, kernel-team <kernel-team@fb.com>
Subject: Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
Date: Tue, 15 Jun 2021 21:53:47 -0700	[thread overview]
Message-ID: <20210616045347.rpxbze7trw7lzrg5@ast-mbp> (raw)
In-Reply-To: <CAM_iQpV2fv=MMhf3w+YpGDXCYaMKVO_hoACL0=oXmn_pDUVexg@mail.gmail.com>

On Mon, Jun 14, 2021 at 11:10:46PM -0700, Cong Wang wrote:
> On Fri, Jun 11, 2021 at 11:45 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jun 10, 2021 at 11:42:24PM -0700, Cong Wang wrote:
> > > On Thu, Jun 10, 2021 at 9:27 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Please stick to one email thread in the future, ok?
> >
> > I'll consolidate them here:
> >
> > > What is your use case to justify your own code? Asking because
> > > you deny mine, so clearly my use case is not yours.
> >
> > I mentioned several use cases in the prior threads.
> > tldr: any periodic event in tracing, networking, security.
> > Garbage collection falls into this category as well, but please internalize
> > that implementing conntrack as it is today in the kernel is an explicit non-goal.
> 
> You need to read my use case again, it is for the conntrack
> in Cilium, not the kernel one.

Cong,

the toxicity in your reply is a bit too much.
Clearly you're very upset with something.
You cannot make peace that your timer patches were rejected?
You realize that they were 'changes requested' with specific
feedback of what you can do?
You've said that it's not possible and came up with reasons to believe so.
I had no choice, but to implement the suggested changes myself.
Why? Because the requests for bpf timers came up many times over the years.
The first time it was in 2013 when bpf didn't even exist upstream.
It was still in RFC stages. Then during XDP development, etc.
Everytime folks who needed timers found a workaround.
Even today the garbage collection can be implemented in the bpf prog
without PROG_RUN cmd and user space. The prog can be attached
to periodic perf event.
If cilium conntrack needed garbage collection they could have
implemented it long ago without user space daemons and kernel additions.

> > > 1. Can t->prog be freed between bpf_timer_init() and bpf_timer_start()?
> >
> > yes.
> 
> Good. So if a program which only initializes the timer and then exits,
> the other program which shares this timer will crash when it calls
> bpf_timer_start(), right?

Correct. There is a discussion about it in a different thread.

> 
> >
> > > If the timer subprog is always in the same prog which installs it, then
> >
> > installs it? I'm not following the quesiton.
> >
> > > this is fine. But then how do multiple programs share a timer?
> >
> > there is only one callback function.
> 
> That's exactly my question. How is one callback function shared
> by multiple eBPF programs which want to share the timer?

callback function is a piece of .text. Inside C program it's written once
and then compiled once into single piece of BPF asm code by llvm.
Later libbpf can copy-paste it into many bpf programs.
The C programmer doesn't see it and doesn't need to worry about it.
From the kernel memory consumption point of view it's a bit inefficient.
In the future we will add support for dynamic linking in the kernel
and in libbpf. The bpf progs will be able to share already loaded subprograms.

> 
> >
> > > In the
> > > case of conntrack, either ingress or egress could install the timer,
> > > it only depends which one gets traffic first. Do they have to copy
> > > the same subprog for the same timer?
> >
> > conntrack is an explicit non-goal.
> 
> I interpret this as you do not want timers to be shared by multiple
> eBPF programs, correct? Weirdly, maps are shared, timers are
> placed in a map, so timers should be shared naturally too.

I only meant that re-implementing kernel conntrack as a bpf program
is an explicit non-goal.
Meaning that people can do it and with this timer api it can be easily done.
But it's explicitly excluded from api requirements.
It doesn't mean that it's hard to do. It means that reimplenting kernel
conntrack as-is with non-scalable garbage collection algorithm
is outside of the scope of this work.

> > > 2. Can t->prog be freed between a timer expiration and bpf_timer_start()
> > > again?
> >
> > If it's already armed with the first bpf_timer_start() it won't be freed.
> 
> Why? I see t->prog is released in your timer callback:

That's another race and there is another thread where it's being discussed.
Please read the whole thread to get on the same page with everyone else.

> 
> Not exactly, they share a same C file but still can be loaded/unloaded
> separately. And logically, whether a timer has been initialized once or
> twice makes a huge difference for programers.

I've looked how kernel is using hrtimer apis and didn't find a single
case where timer function is redefined or more than one timer callback is used
with single hrtimer.

> Sounds like you find a great excuse for a failure of documentation.

I got different feedback regarding documentation that is already present in
the patches, but I'll expand the comments and documentation further
to make it more clear.