From: Steven Rostedt <rostedt@goodmis.org> To: Alexei Starovoitov <ast@plumgrid.com> Cc: Ingo Molnar <mingo@kernel.org>, Namhyung Kim <namhyung@kernel.org>, Arnaldo Carvalho de Melo <acme@infradead.org>, Jiri Olsa <jolsa@redhat.com>, Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>, Linux API <linux-api@vger.kernel.org>, Network Development <netdev@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Peter Zijlstra <peterz@infradead.org>, "Eric W. Biederman" <ebiederm@xmission.com> Subject: Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls Date: Tue, 10 Feb 2015 19:50:29 -0500 [thread overview] Message-ID: <20150210195029.2092bdd6@grimm.local.home> (raw) In-Reply-To: <CAMEtUuzY_Po=WtFEFg1aqzJ8dEF4rHGcWDsaS44KYgACMNPPgA@mail.gmail.com> On Tue, 10 Feb 2015 16:22:50 -0800 Alexei Starovoitov <ast@plumgrid.com> wrote: > > Yep, and if this becomes a standard, then any change that makes > > trace_pipe different will be reverted. > > I think reading of trace_pipe is widespread. I've heard of a few, but nothing that has broken when something changed. Is it scripts or actual C code? Again, it matters if people complain about the change. > >> But some maintainers think of them as ABI, whereas others > >> are using them freely. imo it's time to remove ambiguity. > > > > I would love to, and have brought this up at Kernel Summit more than > > once with no solution out of it. > > let's try it again at plumbers in august? Well, we need a statement from Linus. And it would be nice if we could also get Ingo involved in the discussion, but he seldom comes to anything but Kernel Summit. > > For now I'm going to drop bpf+tracepoints, since it's so controversial > and go with bpf+syscall and bpf+kprobe only. Probably the safest bet. > > Hopefully by august it will be clear what bpf+kprobes can do > and why I'm excited about bpf+tracepoints in the future. BTW, I wonder if I could make a simple compiler in the kernel that would translate the current ftrace filters into a BPF program, where it could use the program and not use the current filter logic. > >> These tracepoint will receive one or two pointers to important > >> structs only. They will not have TP_printk, assign and fields. > >> The placement and arguments to these new tracepoints > >> will be an ABI. > >> All existing tracepoints are not. > > > > TP_printk() is not really an issue. > > I think it is. The way things are printed is the most > visible part of tracepoints and I suspect maintainers don't > want to add new ones, because internal fields are printed > and users do parse trace_pipe. > Recent discussion about tcp instrumentation was > about adding new tracepoints and a module to print them. > As soon as something like this is in, the next question > 'what we're going to do when more arguments need > to be printed'... I should rephrase that. It's not that it's not an issue, it's just that it hasn't been an issue. the trace_pipe code is slow. The raw_trace_pipe is much faster. Any tool would benefit from using it. I really need to get a library out to help users do such a thing. > > imo the solution is DEFINE_EVENT_BPF that doesn't > print anything and a bpf program to process it. You mean to be completely invisible to ftrace? And the debugfs/tracefs directory? > > > >> it is portable and will run on any kernel. > >> In uapi header we can define: > >> struct task_struct_user { > >> int pid; > >> int prio; > > > > Here's a perfect example of something that looks stable to show to > > user space, but is really a pimple that is hiding cancer. > > > > Lets start with pid. We have name spaces. What pid will be put there? > > We have to show the pid of the name space it is under. > > > > Then we have prio. What is prio in the DEADLINE scheduler. It is rather > > meaningless. Also, it is meaningless in SCHED_OTHER. > > > > Also note that even for SCHED_FIFO, the prio is used differently in the > > kernel than it is in userspace. For the kernel, lower is higher. > > well, ->prio and ->pid are already printed by sched tracepoints > and their meaning depends on scheduler. So users taking that > into account. I know, and Peter hates this. > I'm not suggesting to preserve the meaning of 'pid' semantically > in all cases. That's not what users would want anyway. > I want to allow programs to access important fields and print > them in more generic way than current TP_printk does. > Then exposed ABI of such tracepoint_bpf is smaller than > with current tracepoints. Again, this would mean they become invisible to ftrace, and even ftrace_dump_on_oops. I'm not fully understanding what is to be exported by this new ABI. If the fields available, will always be available, then why can't the appear in a TP_printk()? > > eBPF is very flexible, which means it is bound to have someone use it > > in a way you never dreamed of, and that will be what bites you in the > > end (pun intended). > > understood :) > let's start slow then with bpf+syscall and bpf+kprobe only. I'm fine with that. > > >> also not all bpf programs are equal. > >> bpf+existing tracepoint is not ABI > > > > Why not? > > well, because we want to see more tracepoints in the kernel. > We're already struggling to add more. Still, the question is, even with a new "tracepoint" that limits what it shows, there still needs to be something that is guaranteed to always be there. I still don't see how this is different than the current tracepoints. > > >> bpf+new tracepoint is ABI if programs are not using bpf_fetch > > > > How is this different? > > the new ones will be explicit by definition. Who monitors this? > > To give you an example, we thought about scrambling the trace event > > field locations from boot to boot to keep tools from hard coding the > > event layout. This may sound crazy, but developers out there are crazy. > > And if you want to keep them from abusing interfaces, you just need to > > be a bit more crazy than they are. > > that is indeed crazy. the point is understood. > > right now I cannot think of a solid way to prevent abuse > of bpf+tracepoint, so just going to drop it for now. Welcome to our world ;-) > Cool things can be done with bpf+kprobe/syscall already. True. -- Steve
WARNING: multiple messages have this Message-ID (diff)
From: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org> To: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> Cc: Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, Namhyung Kim <namhyung-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, Arnaldo Carvalho de Melo <acme-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>, Jiri Olsa <jolsa-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, Masami Hiramatsu <masami.hiramatsu.pt-FCd8Q96Dh0JBDgjK7y7TUQ@public.gmane.org>, Linux API <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, Network Development <netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>, "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Subject: Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls Date: Tue, 10 Feb 2015 19:50:29 -0500 [thread overview] Message-ID: <20150210195029.2092bdd6@grimm.local.home> (raw) In-Reply-To: <CAMEtUuzY_Po=WtFEFg1aqzJ8dEF4rHGcWDsaS44KYgACMNPPgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> On Tue, 10 Feb 2015 16:22:50 -0800 Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote: > > Yep, and if this becomes a standard, then any change that makes > > trace_pipe different will be reverted. > > I think reading of trace_pipe is widespread. I've heard of a few, but nothing that has broken when something changed. Is it scripts or actual C code? Again, it matters if people complain about the change. > >> But some maintainers think of them as ABI, whereas others > >> are using them freely. imo it's time to remove ambiguity. > > > > I would love to, and have brought this up at Kernel Summit more than > > once with no solution out of it. > > let's try it again at plumbers in august? Well, we need a statement from Linus. And it would be nice if we could also get Ingo involved in the discussion, but he seldom comes to anything but Kernel Summit. > > For now I'm going to drop bpf+tracepoints, since it's so controversial > and go with bpf+syscall and bpf+kprobe only. Probably the safest bet. > > Hopefully by august it will be clear what bpf+kprobes can do > and why I'm excited about bpf+tracepoints in the future. BTW, I wonder if I could make a simple compiler in the kernel that would translate the current ftrace filters into a BPF program, where it could use the program and not use the current filter logic. > >> These tracepoint will receive one or two pointers to important > >> structs only. They will not have TP_printk, assign and fields. > >> The placement and arguments to these new tracepoints > >> will be an ABI. > >> All existing tracepoints are not. > > > > TP_printk() is not really an issue. > > I think it is. The way things are printed is the most > visible part of tracepoints and I suspect maintainers don't > want to add new ones, because internal fields are printed > and users do parse trace_pipe. > Recent discussion about tcp instrumentation was > about adding new tracepoints and a module to print them. > As soon as something like this is in, the next question > 'what we're going to do when more arguments need > to be printed'... I should rephrase that. It's not that it's not an issue, it's just that it hasn't been an issue. the trace_pipe code is slow. The raw_trace_pipe is much faster. Any tool would benefit from using it. I really need to get a library out to help users do such a thing. > > imo the solution is DEFINE_EVENT_BPF that doesn't > print anything and a bpf program to process it. You mean to be completely invisible to ftrace? And the debugfs/tracefs directory? > > > >> it is portable and will run on any kernel. > >> In uapi header we can define: > >> struct task_struct_user { > >> int pid; > >> int prio; > > > > Here's a perfect example of something that looks stable to show to > > user space, but is really a pimple that is hiding cancer. > > > > Lets start with pid. We have name spaces. What pid will be put there? > > We have to show the pid of the name space it is under. > > > > Then we have prio. What is prio in the DEADLINE scheduler. It is rather > > meaningless. Also, it is meaningless in SCHED_OTHER. > > > > Also note that even for SCHED_FIFO, the prio is used differently in the > > kernel than it is in userspace. For the kernel, lower is higher. > > well, ->prio and ->pid are already printed by sched tracepoints > and their meaning depends on scheduler. So users taking that > into account. I know, and Peter hates this. > I'm not suggesting to preserve the meaning of 'pid' semantically > in all cases. That's not what users would want anyway. > I want to allow programs to access important fields and print > them in more generic way than current TP_printk does. > Then exposed ABI of such tracepoint_bpf is smaller than > with current tracepoints. Again, this would mean they become invisible to ftrace, and even ftrace_dump_on_oops. I'm not fully understanding what is to be exported by this new ABI. If the fields available, will always be available, then why can't the appear in a TP_printk()? > > eBPF is very flexible, which means it is bound to have someone use it > > in a way you never dreamed of, and that will be what bites you in the > > end (pun intended). > > understood :) > let's start slow then with bpf+syscall and bpf+kprobe only. I'm fine with that. > > >> also not all bpf programs are equal. > >> bpf+existing tracepoint is not ABI > > > > Why not? > > well, because we want to see more tracepoints in the kernel. > We're already struggling to add more. Still, the question is, even with a new "tracepoint" that limits what it shows, there still needs to be something that is guaranteed to always be there. I still don't see how this is different than the current tracepoints. > > >> bpf+new tracepoint is ABI if programs are not using bpf_fetch > > > > How is this different? > > the new ones will be explicit by definition. Who monitors this? > > To give you an example, we thought about scrambling the trace event > > field locations from boot to boot to keep tools from hard coding the > > event layout. This may sound crazy, but developers out there are crazy. > > And if you want to keep them from abusing interfaces, you just need to > > be a bit more crazy than they are. > > that is indeed crazy. the point is understood. > > right now I cannot think of a solid way to prevent abuse > of bpf+tracepoint, so just going to drop it for now. Welcome to our world ;-) > Cool things can be done with bpf+kprobe/syscall already. True. -- Steve
next prev parent reply other threads:[~2015-02-11 0:49 UTC|newest] Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top 2015-02-11 0:22 [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls Alexei Starovoitov 2015-02-11 0:50 ` Steven Rostedt [this message] 2015-02-11 0:50 ` Steven Rostedt 2015-02-11 9:33 ` Peter Zijlstra 2015-02-11 9:45 ` Peter Zijlstra 2015-02-11 9:45 ` Peter Zijlstra 2015-02-11 10:15 ` Peter Zijlstra 2015-02-11 10:15 ` Peter Zijlstra 2015-02-12 4:58 ` Hekuang 2015-02-12 4:58 ` Hekuang 2015-02-12 4:58 ` Hekuang 2015-02-16 11:26 ` He Kuang 2015-02-16 11:26 ` He Kuang -- strict thread matches above, loose matches on Subject: below -- 2015-02-23 18:55 Alexei Starovoitov 2015-02-23 18:55 ` Alexei Starovoitov 2015-02-14 23:02 Alexei Starovoitov 2015-02-14 23:02 ` Alexei Starovoitov 2015-02-14 22:54 Alexei Starovoitov 2015-02-14 22:54 ` Alexei Starovoitov 2015-02-14 22:48 Alexei Starovoitov 2015-02-14 22:48 ` Alexei Starovoitov 2015-02-11 6:33 Alexei Starovoitov 2015-02-11 6:33 ` Alexei Starovoitov 2015-02-11 12:51 ` Steven Rostedt 2015-02-11 12:51 ` Steven Rostedt 2015-02-11 3:04 Alexei Starovoitov 2015-02-11 4:31 ` Steven Rostedt 2015-02-11 4:31 ` Steven Rostedt 2015-02-10 19:53 Alexei Starovoitov 2015-02-10 21:53 ` Steven Rostedt 2015-02-10 21:53 ` Steven Rostedt 2015-02-11 10:28 ` Peter Zijlstra 2015-02-11 10:28 ` Peter Zijlstra 2015-02-10 6:10 Alexei Starovoitov 2015-02-10 6:10 ` Alexei Starovoitov 2015-02-10 13:05 ` Steven Rostedt 2015-02-10 13:05 ` Steven Rostedt 2015-02-10 5:51 Alexei Starovoitov 2015-02-10 5:51 ` Alexei Starovoitov 2015-02-10 12:27 ` Steven Rostedt 2015-02-10 3:45 [PATCH v3 linux-trace 0/8] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Alexei Starovoitov 2015-02-10 3:45 ` [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls Alexei Starovoitov 2015-02-10 3:45 ` Alexei Starovoitov 2015-02-10 4:46 ` Steven Rostedt 2015-02-10 4:46 ` Steven Rostedt 2015-02-10 5:13 ` Steven Rostedt 2015-02-10 5:13 ` Steven Rostedt
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20150210195029.2092bdd6@grimm.local.home \ --to=rostedt@goodmis.org \ --cc=acme@infradead.org \ --cc=ast@plumgrid.com \ --cc=ebiederm@xmission.com \ --cc=jolsa@redhat.com \ --cc=linux-api@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=masami.hiramatsu.pt@hitachi.com \ --cc=mingo@kernel.org \ --cc=namhyung@kernel.org \ --cc=netdev@vger.kernel.org \ --cc=peterz@infradead.org \ --cc=torvalds@linux-foundation.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.