From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Matt Mullins <mmullins@fb.com>
Cc: "daniel@iogearbox.net" <daniel@iogearbox.net>,
"andrii.nakryiko@gmail.com" <andrii.nakryiko@gmail.com>,
Song Liu <songliubraving@fb.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
"rostedt@goodmis.org" <rostedt@goodmis.org>,
"ast@kernel.org" <ast@kernel.org>, Andrew Hall <hall@fb.com>,
"mingo@redhat.com" <mingo@redhat.com>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
Martin Lau <kafai@fb.com>, Yonghong Song <yhs@fb.com>
Subject: Re: [PATCH bpf v2] bpf: fix nested bpf tracepoints with per-cpu data
Date: Thu, 13 Jun 2019 17:55:35 -0700 [thread overview]
Message-ID: <CAADnVQLV3n3ozBbz-7dJbYfptDwQtL_zM95Z5rcAF-A72aJ9DA@mail.gmail.com> (raw)
In-Reply-To: <9c77657414993332987ca79d4081c4d71cc48d66.camel@fb.com>
On Thu, Jun 13, 2019 at 5:52 PM Matt Mullins <mmullins@fb.com> wrote:
>
> On Fri, 2019-06-14 at 00:47 +0200, Daniel Borkmann wrote:
> > On 06/12/2019 07:00 AM, Andrii Nakryiko wrote:
> > > On Tue, Jun 11, 2019 at 8:48 PM Matt Mullins <mmullins@fb.com> wrote:
> > > >
> > > > BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
> > > > they do not increment bpf_prog_active while executing.
> > > >
> > > > This enables three levels of nesting, to support
> > > > - a kprobe or raw tp or perf event,
> > > > - another one of the above that irq context happens to call, and
> > > > - another one in nmi context
> > > > (at most one of which may be a kprobe or perf event).
> > > >
> > > > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data")
> >
> > Generally, looks good to me. Two things below:
> >
> > Nit, for stable, shouldn't fixes tag be c4f6699dfcb8 ("bpf: introduce BPF_RAW_TRACEPOINT")
> > instead of the one you currently have?
>
> Ah, yeah, that's probably more reasonable; I haven't managed to come up
> with a scenario where one could hit this without raw tracepoints. I'll
> fix up the nits that've accumulated since v2.
>
> > One more question / clarification: we have __bpf_trace_run() vs trace_call_bpf().
> >
> > Only raw tracepoints can be nested since the rest has the bpf_prog_active per-CPU
> > counter via trace_call_bpf() and would bail out otherwise, iiuc. And raw ones use
> > the __bpf_trace_run() added in c4f6699dfcb8 ("bpf: introduce BPF_RAW_TRACEPOINT").
> >
> > 1) I tried to recall and find a rationale for mentioned trace_call_bpf() split in
> > the c4f6699dfcb8 log, but couldn't find any. Is the raison d'être purely because of
> > performance overhead (and desire to not miss events as a result of nesting)? (This
> > also means we're not protected by bpf_prog_active in all the map ops, of course.)
> > 2) Wouldn't this also mean that we only need to fix the raw tp programs via
> > get_bpf_raw_tp_regs() / put_bpf_raw_tp_regs() and won't need this duplication for
> > the rest which relies upon trace_call_bpf()? I'm probably missing something, but
> > given they have separate pt_regs there, how could they be affected then?
>
> For the pt_regs, you're correct: I only used get/put_raw_tp_regs for
> the _raw_tp variants. However, consider the following nesting:
>
> trace_nest_level raw_tp_nest_level
> (kprobe) bpf_perf_event_output 1 0
> (raw_tp) bpf_perf_event_output_raw_tp 2 1
> (raw_tp) bpf_get_stackid_raw_tp 2 2
>
> I need to increment a nest level (and ideally increment it only once)
> between the kprobe and the first raw_tp, because they would otherwise
> share the struct perf_sample_data. But I also need to increment a nest
> level between the two raw_tps, since they share the pt_regs -- I can't
> use trace_nest_level for everything because it's not used by
> get_stackid, and I can't use raw_tp_nest_level for everything because
> it's not incremented by kprobes.
>
> If raw tracepoints were to bump bpf_prog_active, then I could get away
> with just using that count in these callsites -- I'm reluctant to do
> that, though, since it would prevent kprobes from ever running inside a
> raw_tp. I'd like to retain the ability to (e.g.)
> trace.py -K htab_map_update_elem
> and get some stack traces from at least within raw tracepoints.
>
> That said, as I wrote up this example, bpf_trace_nest_level seems to be
> wildly misnamed; I should name those after the structure they're
> protecting...
I still don't get what's wrong with the previous approach.
Didn't I manage to convince both of you that perf_sample_data
inside bpf_perf_event_array doesn't have any issue that Daniel brought up?
I think this refcnting approach is inferior.
next prev parent reply other threads:[~2019-06-14 0:55 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-11 21:53 [PATCH bpf v2] bpf: fix nested bpf tracepoints with per-cpu data Matt Mullins
2019-06-12 5:00 ` Andrii Nakryiko
2019-06-13 22:47 ` Daniel Borkmann
2019-06-14 0:51 ` Matt Mullins
2019-06-14 0:55 ` Alexei Starovoitov [this message]
2019-06-14 15:16 ` Daniel Borkmann
2019-06-14 14:50 ` Daniel Borkmann
2019-06-14 17:25 ` Matt Mullins
2019-06-15 23:35 ` Alexei Starovoitov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAADnVQLV3n3ozBbz-7dJbYfptDwQtL_zM95Z5rcAF-A72aJ9DA@mail.gmail.com \
--to=alexei.starovoitov@gmail.com \
--cc=andrii.nakryiko@gmail.com \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=hall@fb.com \
--cc=kafai@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=mmullins@fb.com \
--cc=netdev@vger.kernel.org \
--cc=rostedt@goodmis.org \
--cc=songliubraving@fb.com \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).