Re: [PATCH bpf v2] bpf: fix nested bpf tracepoints with per-cpu data

From: Matt Mullins <mmullins@fb.com>
To: "daniel@iogearbox.net" <daniel@iogearbox.net>,
	"andrii.nakryiko@gmail.com" <andrii.nakryiko@gmail.com>
Cc: Song Liu <songliubraving@fb.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"rostedt@goodmis.org" <rostedt@goodmis.org>,
	"ast@kernel.org" <ast@kernel.org>, Andrew Hall <hall@fb.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"Martin Lau" <kafai@fb.com>, Yonghong Song <yhs@fb.com>
Subject: Re: [PATCH bpf v2] bpf: fix nested bpf tracepoints with per-cpu data
Date: Fri, 14 Jun 2019 17:25:39 +0000	[thread overview]
Message-ID: <82327aade6a42e838bfb0c2399a63eb9baed57b7.camel@fb.com> (raw)
In-Reply-To: <e9665520-0523-def3-ddbb-59137694b029@iogearbox.net>

On Fri, 2019-06-14 at 16:50 +0200, Daniel Borkmann wrote:
> On 06/14/2019 02:51 AM, Matt Mullins wrote:
> > On Fri, 2019-06-14 at 00:47 +0200, Daniel Borkmann wrote:
> > > On 06/12/2019 07:00 AM, Andrii Nakryiko wrote:
> > > > On Tue, Jun 11, 2019 at 8:48 PM Matt Mullins <mmullins@fb.com> wrote:
> > > > > 
> > > > > BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
> > > > > they do not increment bpf_prog_active while executing.
> > > > > 
> > > > > This enables three levels of nesting, to support
> > > > >   - a kprobe or raw tp or perf event,
> > > > >   - another one of the above that irq context happens to call, and
> > > > >   - another one in nmi context
> > > > > (at most one of which may be a kprobe or perf event).
> > > > > 
> > > > > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data")
> > > 
> > > Generally, looks good to me. Two things below:
> > > 
> > > Nit, for stable, shouldn't fixes tag be c4f6699dfcb8 ("bpf: introduce BPF_RAW_TRACEPOINT")
> > > instead of the one you currently have?
> > 
> > Ah, yeah, that's probably more reasonable; I haven't managed to come up
> > with a scenario where one could hit this without raw tracepoints.  I'll
> > fix up the nits that've accumulated since v2.
> > 
> > > One more question / clarification: we have __bpf_trace_run() vs trace_call_bpf().
> > > 
> > > Only raw tracepoints can be nested since the rest has the bpf_prog_active per-CPU
> > > counter via trace_call_bpf() and would bail out otherwise, iiuc. And raw ones use
> > > the __bpf_trace_run() added in c4f6699dfcb8 ("bpf: introduce BPF_RAW_TRACEPOINT").
> > > 
> > > 1) I tried to recall and find a rationale for mentioned trace_call_bpf() split in
> > > the c4f6699dfcb8 log, but couldn't find any. Is the raison d'être purely because of
> > > performance overhead (and desire to not miss events as a result of nesting)? (This
> > > also means we're not protected by bpf_prog_active in all the map ops, of course.)
> > > 2) Wouldn't this also mean that we only need to fix the raw tp programs via
> > > get_bpf_raw_tp_regs() / put_bpf_raw_tp_regs() and won't need this duplication for
> > > the rest which relies upon trace_call_bpf()? I'm probably missing something, but
> > > given they have separate pt_regs there, how could they be affected then?
> > 
> > For the pt_regs, you're correct: I only used get/put_raw_tp_regs for
> > the _raw_tp variants.  However, consider the following nesting:
> > 
> >                                     trace_nest_level raw_tp_nest_level
> >   (kprobe) bpf_perf_event_output            1               0
> >   (raw_tp) bpf_perf_event_output_raw_tp     2               1
> >   (raw_tp) bpf_get_stackid_raw_tp           2               2
> > 
> > I need to increment a nest level (and ideally increment it only once)
> > between the kprobe and the first raw_tp, because they would otherwise
> > share the struct perf_sample_data.  But I also need to increment a nest
> 
> I'm not sure I follow on this one: the former would still keep using the
> bpf_trace_sd as-is today since only ever /one/ can be active on a given CPU
> as we otherwise bail out in trace_call_bpf() due to bpf_prog_active counter.
> Given these two are /not/ shared, you only need the code you have below for
> nesting to deal with the raw_tps via get_bpf_raw_tp_regs() / put_bpf_raw_tp_regs()
> which should also simplify the code quite a bit.

bpf_perf_event_output_raw_tp calls ____bpf_perf_event_output, so it
currently shares bpf_trace_sd with kprobes -- it _can_ be nested.

> 
> > level between the two raw_tps, since they share the pt_regs -- I can't
> > use trace_nest_level for everything because it's not used by
> > get_stackid, and I can't use raw_tp_nest_level for everything because
> > it's not incremented by kprobes.
> 
> (See above wrt kprobes.)
> 
> > If raw tracepoints were to bump bpf_prog_active, then I could get away
> > with just using that count in these callsites -- I'm reluctant to do
> > that, though, since it would prevent kprobes from ever running inside a
> > raw_tp.  I'd like to retain the ability to (e.g.)
> >   trace.py -K htab_map_update_elem
> > and get some stack traces from at least within raw tracepoints.
> > 
> > That said, as I wrote up this example, bpf_trace_nest_level seems to be
> > wildly misnamed; I should name those after the structure they're
> > protecting...