linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Checking to see if a bit is _not_ set in a ftrace event filter
@ 2014-12-02  2:19 Theodore Ts'o
  2014-12-02  2:41 ` Steven Rostedt
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2014-12-02  2:19 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML

I was trying to do something like this:

filter="events/writeback/writeback_mark_inode_dirty/filter"
echo "(flags & 2048) && ((state & 2048) == 0)" > $filter

... but that doesn't work.

This works:

echo "flags & 2048" > $filter

But the problem is this:

echo "(state & 2048) == 0" > $filter

The simplest patch to add this would be add a new filter_ops so we
could do this:

echo "(state !& 2048)" > $filter

... but that's pretty ugly.  But adding more general expression
parsing in the ftrace event filter code would be non-trivial, and if
we start trying to make things like "!(state & 2048)" or "(state &
2048) == 0", then at some point some crazy person might request
supporting something like this: "(state ^ flags) == 2048".  :-)

So I guess the main question I want to ask is your opinion about
whether a patch that adds support for the operator "!&" is too ugly to
live?

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Checking to see if a bit is _not_ set in a ftrace event filter
  2014-12-02  2:19 Checking to see if a bit is _not_ set in a ftrace event filter Theodore Ts'o
@ 2014-12-02  2:41 ` Steven Rostedt
  2014-12-02  3:52   ` Alexei Starovoitov
  0 siblings, 1 reply; 10+ messages in thread
From: Steven Rostedt @ 2014-12-02  2:41 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: LKML, Alexei Starovoitov

On Mon, 1 Dec 2014 21:19:12 -0500
Theodore Ts'o <tytso@mit.edu> wrote:

> I was trying to do something like this:
> 
> filter="events/writeback/writeback_mark_inode_dirty/filter"
> echo "(flags & 2048) && ((state & 2048) == 0)" > $filter
> 
> ... but that doesn't work.
> 
> This works:
> 
> echo "flags & 2048" > $filter
> 
> But the problem is this:
> 
> echo "(state & 2048) == 0" > $filter
> 
> The simplest patch to add this would be add a new filter_ops so we
> could do this:
> 
> echo "(state !& 2048)" > $filter
> 
> ... but that's pretty ugly.  But adding more general expression
> parsing in the ftrace event filter code would be non-trivial, and if
> we start trying to make things like "!(state & 2048)" or "(state &
> 2048) == 0", then at some point some crazy person might request
> supporting something like this: "(state ^ flags) == 2048".  :-)
> 
> So I guess the main question I want to ask is your opinion about
> whether a patch that adds support for the operator "!&" is too ugly to
> live?
> 

Yeah, I don't want to add some bastardization compare that we'll be
stuck with till the end of time. Either we modify the tree walk to
handle values (it shouldn't be too difficult, but it wont be trivial),
or we wait till eBPF is up and running as the trace filter replacement
and that should be able to handle this much better.

-- Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Checking to see if a bit is _not_ set in a ftrace event filter
  2014-12-02  2:41 ` Steven Rostedt
@ 2014-12-02  3:52   ` Alexei Starovoitov
  2014-12-02  3:57     ` Steven Rostedt
  2014-12-02  5:04     ` Theodore Ts'o
  0 siblings, 2 replies; 10+ messages in thread
From: Alexei Starovoitov @ 2014-12-02  3:52 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Theodore Ts'o, LKML

On Mon, Dec 1, 2014 at 6:41 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 1 Dec 2014 21:19:12 -0500
> Theodore Ts'o <tytso@mit.edu> wrote:
>
>> I was trying to do something like this:
>>
>> filter="events/writeback/writeback_mark_inode_dirty/filter"
>> echo "(flags & 2048) && ((state & 2048) == 0)" > $filter
>>
>> ... but that doesn't work.
>>
>> This works:
>>
>> echo "flags & 2048" > $filter
>>
>> But the problem is this:
>>
>> echo "(state & 2048) == 0" > $filter
>>
>> The simplest patch to add this would be add a new filter_ops so we
>> could do this:
>>
>> echo "(state !& 2048)" > $filter
>>
>> ... but that's pretty ugly.  But adding more general expression
>> parsing in the ftrace event filter code would be non-trivial, and if
>> we start trying to make things like "!(state & 2048)" or "(state &
>> 2048) == 0", then at some point some crazy person might request
>> supporting something like this: "(state ^ flags) == 2048".  :-)
>>
>> So I guess the main question I want to ask is your opinion about
>> whether a patch that adds support for the operator "!&" is too ugly to
>> live?
>>
>
> Yeah, I don't want to add some bastardization compare that we'll be
> stuck with till the end of time. Either we modify the tree walk to
> handle values (it shouldn't be too difficult, but it wont be trivial),
> or we wait till eBPF is up and running as the trace filter replacement
> and that should be able to handle this much better.

yeah. teaching tree walk to do (state & 2048) == 0 is not trivial,
since it doesn't have a concept of value of expression.
Doing !(state & 2048) is probably a bit easier, since there
are hacks for 'not' already.
Another alternative is, of course, to wait little bit for
eBPF+tracing to land ;) All the core pieces
(verifier, bpf syscall, maps) are in net-next already,
so in the next dev cycle we can get tracing bits
reviewed/tested/merged without cross-tree conflicts.
All parsing and code generation will be done by user space,
so all complex expression will be supported.
It will change the workflow for folks who use 'echo expr > filter'
directly. trace-cmd -e -f can be made to work transparently
with new features.
Ted, I don't see 'writeback_mark_inode_dirty' event
in the tree. Some new stuff?
What kind of post-filtering are you doing with this event?
Just visually checking that trace is sane or the trace output
is fed into other tools? Are you trying to aggregate or
correlate multiple events (may be based on 'ino') ?
One of the goals for eBPF+tracing is to minimize
additions of new tracepoints. Right now we already
have a ton of them. events/ext4.h is ~2500 lines.
Some of them look like hooks for in-production
debugging of a function at a time. Sort of like poor's man
kprobe/kretprobe.
With eBPF we should be able to avoid adding
trace_func_enter(), trace_func_exit() to so many func.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Checking to see if a bit is _not_ set in a ftrace event filter
  2014-12-02  3:52   ` Alexei Starovoitov
@ 2014-12-02  3:57     ` Steven Rostedt
  2014-12-02  3:58       ` Steven Rostedt
  2014-12-02  4:51       ` Steven Rostedt
  2014-12-02  5:04     ` Theodore Ts'o
  1 sibling, 2 replies; 10+ messages in thread
From: Steven Rostedt @ 2014-12-02  3:57 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Theodore Ts'o, LKML

On Mon, 1 Dec 2014 19:52:11 -0800
Alexei Starovoitov <ast@plumgrid.com> wrote:

> yeah. teaching tree walk to do (state & 2048) == 0 is not trivial,
> since it doesn't have a concept of value of expression.
> Doing !(state & 2048) is probably a bit easier, since there
> are hacks for 'not' already.

That's exactly what I was looking at doing. If I can get something
working without spending more than an hour or two, I may even push it
for this release cycle.

-- Steve


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Checking to see if a bit is _not_ set in a ftrace event filter
  2014-12-02  3:57     ` Steven Rostedt
@ 2014-12-02  3:58       ` Steven Rostedt
  2014-12-02  4:51       ` Steven Rostedt
  1 sibling, 0 replies; 10+ messages in thread
From: Steven Rostedt @ 2014-12-02  3:58 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Theodore Ts'o, LKML

On Mon, 1 Dec 2014 22:57:25 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 1 Dec 2014 19:52:11 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
> 
> > yeah. teaching tree walk to do (state & 2048) == 0 is not trivial,
> > since it doesn't have a concept of value of expression.
> > Doing !(state & 2048) is probably a bit easier, since there
> > are hacks for 'not' already.
> 
> That's exactly what I was looking at doing. If I can get something
> working without spending more than an hour or two, I may even push it
> for this release cycle.

Clarification; When I said "release cycle", I meant the next merge
window. Not for 3.18.

-- Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Checking to see if a bit is _not_ set in a ftrace event filter
  2014-12-02  3:57     ` Steven Rostedt
  2014-12-02  3:58       ` Steven Rostedt
@ 2014-12-02  4:51       ` Steven Rostedt
  1 sibling, 0 replies; 10+ messages in thread
From: Steven Rostedt @ 2014-12-02  4:51 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Theodore Ts'o, LKML

On Mon, 1 Dec 2014 22:57:25 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 1 Dec 2014 19:52:11 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
> 
> > yeah. teaching tree walk to do (state & 2048) == 0 is not trivial,
> > since it doesn't have a concept of value of expression.
> > Doing !(state & 2048) is probably a bit easier, since there
> > are hacks for 'not' already.
> 
> That's exactly what I was looking at doing. If I can get something
> working without spending more than an hour or two, I may even push it
> for this release cycle.

Actually, this was rather trivial. It took me longer because I was
doing this half asleep, but it seems to work.

Note, this is a major beta patch. It probably doesn't work with ANDS
and ORS, that is !(x == 1 && y == 2) probably doesn't work, but this
should: !(x == 1), and so should !(state & 2048).

I'll clean this up and get it ready for 3.19.

Cheers!

-- Steve

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 7a8c1528e141..aee7b743c9df 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -37,6 +37,7 @@ enum filter_op_ids
 {
 	OP_OR,
 	OP_AND,
+	OP_NOT,
 	OP_GLOB,
 	OP_NE,
 	OP_EQ,
@@ -59,6 +60,7 @@ struct filter_op {
 static struct filter_op filter_ops[] = {
 	{ OP_OR,	"||",		1 },
 	{ OP_AND,	"&&",		2 },
+	{ OP_NOT,	"!",		3 },
 	{ OP_GLOB,	"~",		4 },
 	{ OP_NE,	"!=",		4 },
 	{ OP_EQ,	"==",		4 },
@@ -166,7 +168,7 @@ static int filter_pred_##type(struct filter_pred *pred, void *event)	\
 		break;							\
 	}								\
 									\
-	return match;							\
+	return match == !pred->not;					\
 }
 
 #define DEFINE_EQUALITY_PRED(size)					\
@@ -565,7 +567,7 @@ int filter_match_preds(struct event_filter *filter, void *rec)
 	data.preds = preds = rcu_dereference_sched(filter->preds);
 	ret = walk_pred_tree(preds, root, filter_match_preds_cb, &data);
 	WARN_ON(ret);
-	return data.match;
+	return !!data.match;
 }
 EXPORT_SYMBOL_GPL(filter_match_preds);
 
@@ -1028,7 +1030,7 @@ static int init_pred(struct filter_parse_state *ps,
 	}
 
 	if (pred->op == OP_NE)
-		pred->not = 1;
+		pred->not ^= 1;
 
 	pred->fn = fn;
 	return 0;
@@ -1590,6 +1592,16 @@ static int replace_preds(struct ftrace_event_call *call,
 			continue;
 		}
 
+		if (elt->op == OP_NOT) {
+			if (!n_preds) {
+				err = -EINVAL;
+				goto fail;
+			}
+			if (!dry_run)
+				filter->preds[n_preds - 1].not ^= 1;
+			continue;
+		}
+
 		if (WARN_ON(n_preds++ == MAX_FILTER_PRED)) {
 			parse_error(ps, FILT_ERR_TOO_MANY_PREDS, 0);
 			err = -ENOSPC;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Checking to see if a bit is _not_ set in a ftrace event filter
  2014-12-02  3:52   ` Alexei Starovoitov
  2014-12-02  3:57     ` Steven Rostedt
@ 2014-12-02  5:04     ` Theodore Ts'o
  2014-12-02  5:58       ` Alexei Starovoitov
  1 sibling, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2014-12-02  5:04 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Steven Rostedt, LKML

On Mon, Dec 01, 2014 at 07:52:11PM -0800, Alexei Starovoitov wrote:
> Ted, I don't see 'writeback_mark_inode_dirty' event
> in the tree. Some new stuff?

Yep, see:

http://thread.gmane.org/gmane.comp.file-systems.ext4/47092

Except instead of the mini-script which I gave in the above URL, I
wanted to do additional filtering.  The current hack which I am using
instead of:

echo "flags == 2048" > events/writeback/writeback_mark_inode_dirty/filter

is:

echo "(flags == 2048) && (state < 2048)" > events/writeback/writeback_mark_inode_dirty/filter

... but this relies on the fact that all of the i_state bits that I
care about are at positions 1 << 10 and below.  i.e., it's a terrible
hack.

> What kind of post-filtering are you doing with this event?
> Just visually checking that trace is sane or the trace output
> is fed into other tools? Are you trying to aggregate or
> correlate multiple events (may be based on 'ino') ?

I plan to write some tools that agregate based on 'ino', but I haven't
yet.

> It will change the workflow for folks who use 'echo expr > filter'
> directly. trace-cmd -e -f can be made to work transparently
> with new features.

This will break a bunch of **really** useful scripts found at:

	https://github.com/brendangregg/perf-tools.git

OTOH, Brendan will probably will be able to rewrite them to take
advantage of the new interfaces, and I'm sure he'll appreciate the
power of being able to use eBPF.  :-)

> One of the goals for eBPF+tracing is to minimize
> additions of new tracepoints. Right now we already
> have a ton of them. events/ext4.h is ~2500 lines.
> Some of them look like hooks for in-production
> debugging of a function at a time. Sort of like poor's man
> kprobe/kretprobe.

Well, except that kprobe and kretprobe can't give me the arguments
passed into the function (unless you compile with full -g debugging
info enabled and bloat the object files and compilation time by a
factor of 10 --- which I can't stand and why I use ftrace instead of
systemtap :-)

> With eBPF we should be able to avoid adding
> trace_func_enter(), trace_func_exit() to so many func.

If eBPF can solve the ability to be able to get at the critical
function variables without making the compiled kernel take 10x the
disk space and time to compile (mostly due to the time to write out
the !@#!@?! bloated object files), that would be great.  My
understanding though is that this fundamentally requires improved
DWARF compression and structure information deduping, which the
systemtap folks promised would be coming in improved compiler
toolchains many years ago, but as far as I know has never
materialized.  :-(

But that's why I have the trace_func_enter() and trace_func_exit()
calls; I need to be able to get do various run-time debugging without
needing to recompile the kernel and without forcing all of my
development builds to have full debug info.

					- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Checking to see if a bit is _not_ set in a ftrace event filter
  2014-12-02  5:04     ` Theodore Ts'o
@ 2014-12-02  5:58       ` Alexei Starovoitov
  2014-12-02 12:14         ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2014-12-02  5:58 UTC (permalink / raw)
  To: Theodore Ts'o, Alexei Starovoitov, Steven Rostedt, LKML

On Mon, Dec 1, 2014 at 9:04 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Dec 01, 2014 at 07:52:11PM -0800, Alexei Starovoitov wrote:
>
>> It will change the workflow for folks who use 'echo expr > filter'
>> directly. trace-cmd -e -f can be made to work transparently
>> with new features.
>
> This will break a bunch of **really** useful scripts found at:

I didn't mean that new features will break old. All existing filters
will stay as-is.

>> One of the goals for eBPF+tracing is to minimize
>> additions of new tracepoints. Right now we already
>> have a ton of them. events/ext4.h is ~2500 lines.
>> Some of them look like hooks for in-production
>> debugging of a function at a time. Sort of like poor's man
>> kprobe/kretprobe.
>
> Well, except that kprobe and kretprobe can't give me the arguments
> passed into the function (unless you compile with full -g debugging
> info enabled and bloat the object files and compilation time by a
> factor of 10 --- which I can't stand and why I use ftrace instead of
> systemtap :-)

well, dwarf is only needed if you have > 6 function
arguments in x64, since x64 ABI promotes scalars
to 64-bit and passes first six args in registers, so it's
trivial to know where arguments are even without debug info.
Similar situation is on most 64-bit risc archs.
dwarf is needed when one wants to see a value of
local C variable somewhere in the middle of the function,
but it's not common and rarely works in practice, since
var-tracking is not easy for compilers.

Another reason people say that dwarf is 'must have'
is to access struct fields. The 'inode->i_state' dereference
would require stap to use dwarf to know the offset of 'i_state'.
For eBPF we don't need debug info in such case,
since C compiler does it for us.
We can just do 'offsetof(typeof(*inode), i_state)'
as part of eBPF C program and llvm will figure out
correct field offset during compile time of eBPF program.
(for this to work, one need to have matching kernel
headers).

> If eBPF can solve the ability to be able to get at the critical
> function variables ...

If 'critical' function variables are arguments or variables
accessible via pointer walking (like inode->i_sb->s_fs_info)
then eBPF with kernel headers will do the job.
(For arguments only no headers needed)

> But that's why I have the trace_func_enter() and trace_func_exit()
> calls; I need to be able to get do various run-time debugging without
> needing to recompile the kernel and without forcing all of my
> development builds to have full debug info.

Understood. Thanks for sharing!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Checking to see if a bit is _not_ set in a ftrace event filter
  2014-12-02  5:58       ` Alexei Starovoitov
@ 2014-12-02 12:14         ` Theodore Ts'o
  2014-12-03  0:02           ` Alexei Starovoitov
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2014-12-02 12:14 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Steven Rostedt, LKML

On Mon, Dec 01, 2014 at 09:58:32PM -0800, Alexei Starovoitov wrote:
> well, dwarf is only needed if you have > 6 function
> arguments in x64, since x64 ABI promotes scalars
> to 64-bit and passes first six args in registers, so it's
> trivial to know where arguments are even without debug info.

Well, if eBPF can get more information without DWARF, that's great.
I'll use whatever I can get, so long as I can do it without DWARF.  :-)

So if eBPF can do a better job without the presence of debug info, and
even a better job with debug info (i.e., distro kernels where it's
worth the time to spend over an hour compiling a distro release rpm or
deb package), that would clearly be the best of both worlds.  I
suspect that developers like me will want to continue to add
tracepoints so we can do what we need to get done w/o debug info,
though.

(To me that was always the reason why I ignored systemtap; it focused
too much on the need of the release customer --- which makes sense,
since nearly all of the stap developers worked for Red Hat --- but it
ignored the needs of kernel developers.  So they shouldn't have been
surprised that most kernel developers returned the favor.  :-)

> Similar situation is on most 64-bit risc archs.
> dwarf is needed when one wants to see a value of
> local C variable somewhere in the middle of the function,
> but it's not common and rarely works in practice, since
> var-tracking is not easy for compilers.

Yes; fortunately, we can make that available via adding new
tracepoints --- which in general is why I'm in favor of not relying on
the ability to grab information via pure kprobe and kretprobe.  If we
can get the first six arguments on x86_64, then if I didn't think to
add a tracepoint, I can do some emergency debugging w/o needing to
recompile, which is a win.  (But since I also do a lot of debug runs
using 32-bit kernels, it would be nice to have things that worked
there as well, which is why in general I'll add tracepoints once I
find that it might be useful to trace a particular function.)

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Checking to see if a bit is _not_ set in a ftrace event filter
  2014-12-02 12:14         ` Theodore Ts'o
@ 2014-12-03  0:02           ` Alexei Starovoitov
  0 siblings, 0 replies; 10+ messages in thread
From: Alexei Starovoitov @ 2014-12-03  0:02 UTC (permalink / raw)
  To: Theodore Ts'o, Alexei Starovoitov, Steven Rostedt, LKML

On Tue, Dec 2, 2014 at 4:14 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>
> Well, if eBPF can get more information without DWARF, that's great.
> I'll use whatever I can get, so long as I can do it without DWARF.  :-)

eBPF is just an engine. Anyone can develop a userspace bits that
will use debug info and/or their favorite language (stap, ktap, dtrace
or any other) when right hooks are in place on the kernel side.
I'm focusing on restricted C as a way to program it and
environments without debug info.

> recompile, which is a win.  (But since I also do a lot of debug runs
> using 32-bit kernels, it would be nice to have things that worked
> there as well, which is why in general I'll add tracepoints once I
> find that it might be useful to trace a particular function.)

Understood. i386 is difficult, since arg passing is using
combination of registers(eax,edx,ecx) and stack in
non-obvious way, so some trick is needed to get it right
without debug info. I'm still trying to solve this one.
That's for programs attached to kprobes.
Programs that attach to tracepoints will see arguments
in a normal way on all archs.
Anyway, one step at a time. Easy things first.

As one of the crazy things for the future I was thinking
about mini-tracepoints. Instead of generating full
ftrace_raw_* bodies, generate minimal stubs
to collect tracepoint arguments only and invoke programs.
In other words move responsibility of TP_STRUCT__entry,
TP_fast_assign and TP_printk into the bpf program.
On x64 all we need is single stub for all mini-tracepoints.
For 32-bit archs all mini-tracepoint stubs will be different
to accomodate 32/64-bit argument passing.
That will save quite a bit from .text and hopefully more
people will enable such tracepoints in production.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-12-03  0:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-02  2:19 Checking to see if a bit is _not_ set in a ftrace event filter Theodore Ts'o
2014-12-02  2:41 ` Steven Rostedt
2014-12-02  3:52   ` Alexei Starovoitov
2014-12-02  3:57     ` Steven Rostedt
2014-12-02  3:58       ` Steven Rostedt
2014-12-02  4:51       ` Steven Rostedt
2014-12-02  5:04     ` Theodore Ts'o
2014-12-02  5:58       ` Alexei Starovoitov
2014-12-02 12:14         ` Theodore Ts'o
2014-12-03  0:02           ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).