Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>,
	davem@davemloft.net, daniel@iogearbox.net,
	jakub.kicinski@netronome.com, netdev@vger.kernel.org,
	kernel-team@fb.com, mingo@redhat.com, will.deacon@arm.com,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	jannh@google.com
Subject: Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock
Date: Fri, 25 Jan 2019 15:42:43 -0800	[thread overview]
Message-ID: <20190125234241.soomtkrgp2i7m7ul@ast-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <20190125091057.GK17749@hirez.programming.kicks-ass.net>

On Fri, Jan 25, 2019 at 10:10:57AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 24, 2019 at 03:58:59PM -0800, Alexei Starovoitov wrote:
> > On Thu, Jan 24, 2019 at 07:01:09PM +0100, Peter Zijlstra wrote:
> > > 
> > > Thanks for having kernel/locking people on Cc...
> > > 
> > > On Wed, Jan 23, 2019 at 08:13:55PM -0800, Alexei Starovoitov wrote:
> > > 
> > > > Implementation details:
> > > > - on !SMP bpf_spin_lock() becomes nop
> > > 
> > > Because no BPF program is preemptible? I don't see any assertions or
> > > even a comment that says this code is non-preemptible.
> > > 
> > > AFAICT some of the BPF_RUN_PROG things are under rcu_read_lock() only,
> > > which is not sufficient.
> > 
> > nope. all bpf prog types disable preemption. That is must have for all
> > sorts of things to work properly.
> > If there is a prog type that doing rcu_read_lock only it's a serious bug.
> > About a year or so ago we audited everything specifically to make
> > sure everything disables preemption before calling bpf progs.
> > I'm pretty sure nothing crept in in the mean time.
> 
> Do we want something like (the completely untested) below to avoid
> having to manually audit this over and over?
> 
> ---
>  include/linux/filter.h |  2 +-
>  include/linux/kernel.h |  9 +++++++--
>  kernel/sched/core.c    | 28 ++++++++++++++++++++++++++++
>  3 files changed, 36 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index d531d4250bff..4ab51e78da36 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -513,7 +513,7 @@ struct sk_filter {
>  	struct bpf_prog	*prog;
>  };
>  
> -#define BPF_PROG_RUN(filter, ctx)  (*(filter)->bpf_func)(ctx, (filter)->insnsi)
> +#define BPF_PROG_RUN(filter, ctx)  ({ cant_sleep(); (*(filter)->bpf_func)(ctx, (filter)->insnsi); })

That looks reasonable and I intent to apply this patch to bpf-next after testing.
Can you pls reply with a sob ?

With Daniel we did a bit of git archaeology how we came to this missing
preempt_disable in send side of socket filters...
The main reason is that classic BPF was preemptable from the beginning
and it has SKF_AD_CPU feature for the 10+ years.
Meaning that classic BPF programs can read current cpu as a hint though
they are preemptable.
raw_smp_processor_id was used to avoid triggering false warning.
When eBPF came along we kept it as-is during classic->extended conversion.
Back then there were no per-cpu maps.
When per-cpu maps were introduced we missed preemptable sendmsg case.
This cant_sleep() check should help us in the future.

The easiest fix is to add preempt_disable/enable for socket filters.
There is a concern that such fix will make classic bpf non-preemptable
and classic bpf can be quite cpu expensive.
For example it's allowed to have 4k classic instructions that do
'load SKF_AD_PAY_OFFSET' which will call heavy flow_dissector 4k times.

We can do preempt_disable for extended only. Then BPF_PROG_RUN macro and
cant_sleep would need to be adjusted accordingly.
My preference would be to do preempt_disable for both classic and extended.
Since classic is only legacy uapi at this point. Nothing on the kernel side
makes it special and I'd like to keep it such.
Also on the receive side classic runs in bh, so 4k flow_dissector calls
in classic has to be dealt with anyway.

>  > nmi checks for bpf_prog_active==0. See bpf_overflow_handler.
> yuck yuck yuck.. That's horrific :-( That means the whole BPF crud is
> unreliable and events can go randomly missing.

bpf_prog_active is the mechanism to workaround non-reentrant pieces of the kernel.
Before that we couldn't do this:
sudo funccount.py *spin_lock*
Tracing 5 functions for "*spin_lock*"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
queued_spin_lock_slowpath                 682
_raw_spin_lock_bh                         997
_raw_spin_lock_irq                      10586
_raw_spin_lock_irqsave                  90666
_raw_spin_lock                         318300
Detaching...

Now we can.
This power comes with the 'horrific' caveat that two
tracing progs cannot execute on the same cpu at the same time.
It's not a pretty fix for the reentrancy issue.
If anyone has better ideas, I'm all ears.

> What about the progs that run from SoftIRQ ? Since that bpf_prog_active
> thing isn't inside BPF_PROG_RUN() what is to stop say:
> 
>    reuseport_select_sock()
>      ...
>        BPF_PROG_RUN()
>          bpf_spin_lock()
>         <IRQ>
> 	  ...
> 	  BPF_PROG_RUN()
> 	    bpf_spin_lock() // forever more
> 
> 	</IRQ>
> 
> Unless you stick that bpf_prog_active stuff inside BPF_PROG_RUN itself,
> I don't see how you can fundamentally avoid this happening (now or in
> the future).

BPF_PROG_RUN macro is used as a template for another macro:
ret = BPF_PROG_RUN_ARRAY(&bpf_prog_array, ctx, BPF_PROG_RUN);
see kernel/trace/bpf_trace.c
Doing bpf_prog_active for every prog in array is too expensive.

But your issue above is valid.
We don't use bpf_prog_active for networking progs, since we allow
for one level of nesting due to the classic SKF_AD_PAY_OFFSET legacy.
Also we allow tracing progs to nest with networking progs.
People using this actively.
Typically it's not an issue, since in networking there is no
arbitrary nesting (unlike kprobe/nmi in tracing),
but for bpf_spin_lock it can be, since the same map can be shared
by networking and tracing progs and above deadlock would be possible:
(first BPF_PROG_RUN will be from networking prog, then kprobe+bpf's
BPF_PROG_RUN accessing the same map with bpf_spin_lock)

So for now I'm going to allow bpf_spin_lock in networking progs only,
since there is no arbitrary nesting there.
And once we figure out the safety concerns for kprobe/tracepoint progs
we can enable bpf_spin_lock there too.
NMI bpf progs will never have bpf_spin_lock.

re: memory model.. will reply in the other thread.