Re: [PATCH v2 00/15] tracing: 'hist' triggers

From: Tom Zanussi <tom.zanussi@linux.intel.com>
To: Alexei Starovoitov <ast@plumgrid.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>,
	Namhyung Kim <namhyung@kernel.org>,
	Andi Kleen <andi@firstfloor.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Arnaldo Carvalho de Melo <acme@infradead.org>
Subject: Re: [PATCH v2 00/15] tracing: 'hist' triggers
Date: Tue, 03 Mar 2015 23:01:38 -0600	[thread overview]
Message-ID: <1425445298.20819.222.camel@picadillo> (raw)
In-Reply-To: <54F66C7C.90301@plumgrid.com>

On Tue, 2015-03-03 at 18:22 -0800, Alexei Starovoitov wrote:
> On 3/3/15 7:47 AM, Tom Zanussi wrote:
> > On Mon, 2015-03-02 at 18:31 -0800, Alexei Starovoitov wrote:
> >> On Mon, Mar 2, 2015 at 5:18 PM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> 
> >> I'm not proposing to use asm or C for this 'hist->bpf' tool.
> >> Keep proposed 'hist:keys=...:vals=...' syntax and generate
> >> bpf program on the fly based on this string.
> >> So this user tool will take string, generate program, load
> >> and attach it. Everything based on this single string input.
> >> With the examples you mentioned in docs, it's not hard.
> >> It will be more involved than patch 12, but way more generic
> >> and easily extensible when 'hist:keys=...' string would need
> >> to be extended.
> >>
> >
> > Hmm, this seems silly to me - doing all that work just to get back to
> > where we already are.
> 
> your 'hist->bpf' tool could have been first to solve 'bpf hard to use'
> problem and overtime it could have evolved into full dtrace alternative.
> Whereas by adding 'hist:keys=..' parsing to kernel you'll be stuck
> with it and somebody else's dtrace-like tool will supersede it.
> 

I think there's some misunderstanding there - it was never my intent to
create a full dtrace alternative, really the idea was (and has been,
even before there was any such thing as ebpf in the kernel) only to
provide access to some low-hanging fruit via the standard read/write
interfaces people are used to with ftrace.  The dtrace-like tools were
what I thought ebpf was all about, where users could go after exhausting
the possibilities of a simple table-driven first approach to a problem. 

> > I don't think you can start requiring all new tracing functionality to
> > embed a code generator/compiler, or as a result force any particular
> > interface on users.
> 
> force interface on users?!
> I think I've explained enough that bpf would not be a user visible
> interface. It's kernel to user land interface. In my proposal of
> 'hist->bpf' tool users won't even see that bpf is being generated
> and run underneath. Same with dtrace-like scripting. Users will
> see scripting language and won't care how it's compiled and
> executed inside kernel. Multiple different languages and interfaces
> are possible. Including hist-like text that's not a language.
> 

What about the kernel to user land interface that users have already
been asking for, pseudo-files in debugfs/tracefs?  If you can't do that,
then you're forcing interface on users.  

>  > If it's possible to factor out a common
> > infrastructure that can accommodate different user interfaces and
> > preferences, don't you think it makes sense to do that?
> >
> > In any case, outside of the boilerplate tracing infrastructure code, it
> > seems to me that a lot of the code in the main hist trigger patches is
> > code that you'd also need for a tracepoint/bpf interface.  I think we've
> > already agreed that it would be good to work towards being able to share
> > that, so for the next version, I'll see what I can come up with.
> 
> yes. let's see what pieces can be shared. Though I don't want to
> sacrifice long term goals for short term solution.
> In case of bpf_maps, the objective is to share data between
> kernel and user space by single abstraction with different
> implementations underneath. In some use cases kernel programs only
> do lookups into maps, while user space concurrently adding/deleting
> entries (so no allocations from programs)
> In your patch 4 you're adding kernel only access to maps but planning
> to use hash type only and also not exposing these maps to user space
> at all... that defeats bpf_map purpose and you only reusing ~200
> lines of hashtab.c code, but also need to hack it for pre-allocate
> and make config_bpf_syscall a strong dependency for 'hist'. why?
> 'hist' patch set would have been much shorter if you just used
> hashtable.h or rhashtable.h or even rbtree.h (so you don't need
> to do sorting in patch 13). The actual hash insert/walk code
> is tiny and trivial.

Well, I kind of hinted in the cover letter that ultimately I thought the
tracing_map common code would move out of bpf/syscall.c, and thus break
the dependency on bpf_syscall.  I didn't want to spend time actually
doing that before knowing how the general idea would fly, and am glad
now that I didn't.

But the reason I thought of using it in the first place was that I saw a
map implementation specially designed for tracing and thought it would
make sense to reuse it if possible for what I thought was a
complementary tracing tool.  I did find it useful, and using it did
simplify my overall code, but it seems it's more trouble than it's worth
and not particularly welcome, so yeah, I probably will just look at
something else.

Thanks for your input,

Tom

> I've looked through the patches again and they seem to have serious
> bugs when it comes to object life time:
> - patch 5 that adds clear_map is completely broken, calling
> delete_all_elements() by wrapping it in rcu_read_lock() ?!
> - patch 6 adds 'free_elem' callback that is called right from
> htab_map_delete_elem() while we're still under rcu.. ?!
> - patch 12 does
> struct hist_trigger_entry *entry;
> tracing_map_lookup_elem(hist_data->map, next_key, &entry);
> hist_trigger_entry_print(m, hist_data, next_key, entry);
> so when you're storing a pointer inside the map the rcu protection
> of map_lookup protects only the value of 'entry' pointer.
> The actual contents of *entry are not protected by anything,
> so even if you fix 5 and 6, you'll still have severe memory corruptions,
> since nothing guarantees that contents of *entry are valid when
> hist_trigger_entry_print() is trying to access it.
> 
> I think you'd be better off using vanilla hashtable.h
> 
> Thanks
>