All of lore.kernel.org
 help / color / mirror / Atom feed
* disabling group leader perf_event
@ 2010-09-06  9:12 Avi Kivity
  2010-09-06 11:24 ` Peter Zijlstra
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2010-09-06  9:12 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel

  If I read the code correctly, disabling a group leader perf_event will 
disable the entire group.

Is this correct?

If so, how can I disable just the event itself?  Can I allocate a dummy 
invent for the group leader so I can enable and disable each perf_event 
in the group individually?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06  9:12 disabling group leader perf_event Avi Kivity
@ 2010-09-06 11:24 ` Peter Zijlstra
  2010-09-06 11:34   ` Avi Kivity
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2010-09-06 11:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-perf-users, linux-kernel, Ingo Molnar

On Mon, 2010-09-06 at 12:12 +0300, Avi Kivity wrote:
> If I read the code correctly, disabling a group leader perf_event will 
> disable the entire group.
> 
> Is this correct?

Yeah, pretty much.

> If so, how can I disable just the event itself?  Can I allocate a dummy 
> invent for the group leader so I can enable and disable each perf_event 
> in the group individually?

Which makes me wonder why you use groups in the first place.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 11:24 ` Peter Zijlstra
@ 2010-09-06 11:34   ` Avi Kivity
  2010-09-06 11:54     ` Peter Zijlstra
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2010-09-06 11:34 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-perf-users, linux-kernel, Ingo Molnar

  On 09/06/2010 02:24 PM, Peter Zijlstra wrote:
> On Mon, 2010-09-06 at 12:12 +0300, Avi Kivity wrote:
>> If I read the code correctly, disabling a group leader perf_event will
>> disable the entire group.
>>
>> Is this correct?
> Yeah, pretty much.

Well, I never liked group_leader style APIs.  I like different types for 
the container and the contained.  But such is not unix.

>> If so, how can I disable just the event itself?  Can I allocate a dummy
>> invent for the group leader so I can enable and disable each perf_event
>> in the group individually?
> Which makes me wonder why you use groups in the first place.

Basically, to read() all events in one go.  I have many of them.

My current problem is that I have an event (kvm_exit) which I want to 
drill down by looking at a field (exit_reason).  So I create lots of 
separate perf_events with a filter for each reason: 
kvm_exit(exit_reason==0), kvm_exit(exit_reason==1), etc.  But filters 
are fairly slow (can have ~60 such events on AMD), so I want to make 
this drill-down optional.

Current plan is to have a group for the basic events and another group 
for the drilldown events (each per-cpu), and activate the drilldown 
group on user request.  perf will be able to schedule both groups 
concurrently since they only contain tracepoints, yes?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 11:34   ` Avi Kivity
@ 2010-09-06 11:54     ` Peter Zijlstra
  2010-09-06 11:58       ` Avi Kivity
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2010-09-06 11:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-perf-users, linux-kernel, Ingo Molnar

On Mon, 2010-09-06 at 14:34 +0300, Avi Kivity wrote:
> On 09/06/2010 02:24 PM, Peter Zijlstra wrote:
> > On Mon, 2010-09-06 at 12:12 +0300, Avi Kivity wrote:
> >> If I read the code correctly, disabling a group leader perf_event will
> >> disable the entire group.
> >>
> >> Is this correct?
> > Yeah, pretty much.
> 
> Well, I never liked group_leader style APIs.  I like different types for 
> the container and the contained.  But such is not unix.
> 
> >> If so, how can I disable just the event itself?  Can I allocate a dummy
> >> invent for the group leader so I can enable and disable each perf_event
> >> in the group individually?
> > Which makes me wonder why you use groups in the first place.
> 
> Basically, to read() all events in one go.  I have many of them.
> 
> My current problem is that I have an event (kvm_exit) which I want to 
> drill down by looking at a field (exit_reason).  So I create lots of 
> separate perf_events with a filter for each reason: 
> kvm_exit(exit_reason==0), kvm_exit(exit_reason==1), etc.  But filters 
> are fairly slow (can have ~60 such events on AMD), so I want to make 
> this drill-down optional.

Yeah, filters suck.

So what you're basically trying to do is create some histogram of
exit_reason?

Being able to make histograms in-kernel has been on the todo list for a
long while, its just that I never could come up with a sane
interface.. :/

> Current plan is to have a group for the basic events and another group 
> for the drilldown events (each per-cpu), and activate the drilldown 
> group on user request.  perf will be able to schedule both groups 
> concurrently since they only contain tracepoints, yes?

More or less, yeah (the scheduling of software and hardware events isn't
properly separated atm -- am working on that). Software events have no
scheduling constraints and should always get scheduled.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 11:54     ` Peter Zijlstra
@ 2010-09-06 11:58       ` Avi Kivity
  2010-09-06 12:29         ` Peter Zijlstra
  2010-09-06 12:43         ` Ingo Molnar
  0 siblings, 2 replies; 53+ messages in thread
From: Avi Kivity @ 2010-09-06 11:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-perf-users, linux-kernel, Ingo Molnar

  On 09/06/2010 02:54 PM, Peter Zijlstra wrote:
>
>> Basically, to read() all events in one go.  I have many of them.
>>
>> My current problem is that I have an event (kvm_exit) which I want to
>> drill down by looking at a field (exit_reason).  So I create lots of
>> separate perf_events with a filter for each reason:
>> kvm_exit(exit_reason==0), kvm_exit(exit_reason==1), etc.  But filters
>> are fairly slow (can have ~60 such events on AMD), so I want to make
>> this drill-down optional.
> Yeah, filters suck.

Any idea why?  I saw nothing obvious in the code, except that there is 
lots of it.

> So what you're basically trying to do is create some histogram of
> exit_reason?

Yes, exactly.

> Being able to make histograms in-kernel has been on the todo list for a
> long while, its just that I never could come up with a sane
> interface.. :/

Interesting, I thought it was just me.

One option is to keep the existing filter interface, but recognize those 
cases and optimize the implementation.  Sort of like a compiler can 
optimize a large dense switch statement to a jump table.

>> Current plan is to have a group for the basic events and another group
>> for the drilldown events (each per-cpu), and activate the drilldown
>> group on user request.  perf will be able to schedule both groups
>> concurrently since they only contain tracepoints, yes?
> More or less, yeah (the scheduling of software and hardware events isn't
> properly separated atm -- am working on that). Software events have no
> scheduling constraints and should always get scheduled.

Great, thanks.

(one other issue - right now I'm using cpu events.  If I switch to task 
events, I lose events generated by workqueues, yes?)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 11:58       ` Avi Kivity
@ 2010-09-06 12:29         ` Peter Zijlstra
  2010-09-06 12:40           ` Ingo Molnar
  2010-09-06 12:49           ` Avi Kivity
  2010-09-06 12:43         ` Ingo Molnar
  1 sibling, 2 replies; 53+ messages in thread
From: Peter Zijlstra @ 2010-09-06 12:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-perf-users, linux-kernel, Ingo Molnar

On Mon, 2010-09-06 at 14:58 +0300, Avi Kivity wrote:
> On 09/06/2010 02:54 PM, Peter Zijlstra wrote:
> >
> >> Basically, to read() all events in one go.  I have many of them.
> >>
> >> My current problem is that I have an event (kvm_exit) which I want to
> >> drill down by looking at a field (exit_reason).  So I create lots of
> >> separate perf_events with a filter for each reason:
> >> kvm_exit(exit_reason==0), kvm_exit(exit_reason==1), etc.  But filters
> >> are fairly slow (can have ~60 such events on AMD), so I want to make
> >> this drill-down optional.
> > Yeah, filters suck.
> 
> Any idea why?  I saw nothing obvious in the code, except that there is 
> lots of it.

It filters after it does all the hard work of creating the trace event,
instead of before.

> > So what you're basically trying to do is create some histogram of
> > exit_reason?
> 
> Yes, exactly.

One thing I thought of, you can use the unfiltered kvm_exit event as
leader, that will give you the total number of events, which, esp. when
you create partial histograms, is useful to figure out how much you
missed.

You can individually disable/enable !leader siblings.

> (one other issue - right now I'm using cpu events.  If I switch to task 
> events, I lose events generated by workqueues, yes?)

Right, those have their own tasks.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 12:29         ` Peter Zijlstra
@ 2010-09-06 12:40           ` Ingo Molnar
  2010-09-06 13:16             ` Steven Rostedt
  2010-09-06 12:49           ` Avi Kivity
  1 sibling, 1 reply; 53+ messages in thread
From: Ingo Molnar @ 2010-09-06 12:40 UTC (permalink / raw)
  To: Peter Zijlstra, Tom Zanussi, Frédéric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo
  Cc: Avi Kivity, linux-perf-users, linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, 2010-09-06 at 14:58 +0300, Avi Kivity wrote:
> > On 09/06/2010 02:54 PM, Peter Zijlstra wrote:
> > >
> > >> Basically, to read() all events in one go.  I have many of them.
> > >>
> > >> My current problem is that I have an event (kvm_exit) which I want to
> > >> drill down by looking at a field (exit_reason).  So I create lots of
> > >> separate perf_events with a filter for each reason:
> > >> kvm_exit(exit_reason==0), kvm_exit(exit_reason==1), etc.  But filters
> > >> are fairly slow (can have ~60 such events on AMD), so I want to make
> > >> this drill-down optional.
> > > Yeah, filters suck.
> > 
> > Any idea why?  I saw nothing obvious in the code, except that there 
> > is lots of it.
> 
> It filters after it does all the hard work of creating the trace 
> event, instead of before.

Btw., that's an implementational property of filters that would be nice 
to fix.

If it's not doable within CPP then outside of it.

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 11:58       ` Avi Kivity
  2010-09-06 12:29         ` Peter Zijlstra
@ 2010-09-06 12:43         ` Ingo Molnar
  2010-09-06 12:45           ` Avi Kivity
  1 sibling, 1 reply; 53+ messages in thread
From: Ingo Molnar @ 2010-09-06 12:43 UTC (permalink / raw)
  To: Avi Kivity, Tom Zanussi, Frédéric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, linux-perf-users, linux-kernel


* Avi Kivity <avi@redhat.com> wrote:

> > Being able to make histograms in-kernel has been on the todo list 
> > for a long while, its just that I never could come up with a sane 
> > interface.. :/
> 
> Interesting, I thought it was just me.
> 
> One option is to keep the existing filter interface, but recognize 
> those cases and optimize the implementation.  Sort of like a compiler 
> can optimize a large dense switch statement to a jump table.

Yes. The filter engine is a safe, in-kernel interpreted language in the 
making. The C syntax was chosen because it's close to the heart of every 
kernel developer.

It might make sense to bring this concept a few steps further. Looks 
rather complex but also rather cool ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 12:43         ` Ingo Molnar
@ 2010-09-06 12:45           ` Avi Kivity
  2010-09-06 12:59             ` Ingo Molnar
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2010-09-06 12:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tom Zanussi, Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

  On 09/06/2010 03:43 PM, Ingo Molnar wrote:
> Yes. The filter engine is a safe, in-kernel interpreted language in the
> making. The C syntax was chosen because it's close to the heart of every
> kernel developer.
>
> It might make sense to bring this concept a few steps further. Looks
> rather complex but also rather cool ...

Is this a roundabout way of saying "jit"?

If so, I'm all for it.  I could use one myself.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 12:29         ` Peter Zijlstra
  2010-09-06 12:40           ` Ingo Molnar
@ 2010-09-06 12:49           ` Avi Kivity
  1 sibling, 0 replies; 53+ messages in thread
From: Avi Kivity @ 2010-09-06 12:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-perf-users, linux-kernel, Ingo Molnar

  On 09/06/2010 03:29 PM, Peter Zijlstra wrote:
>
>>> So what you're basically trying to do is create some histogram of
>>> exit_reason?
>> Yes, exactly.
> One thing I thought of, you can use the unfiltered kvm_exit event as
> leader, that will give you the total number of events, which, esp. when
> you create partial histograms, is useful to figure out how much you
> missed.

Yeah - and the aggregate kvm_exit count is the single most important 
metric for kvm.  We do count it regardless of whether we want the 
individual exit_reasons or not.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 12:45           ` Avi Kivity
@ 2010-09-06 12:59             ` Ingo Molnar
  2010-09-06 13:41               ` Pekka Enberg
  2010-09-06 14:57               ` Avi Kivity
  0 siblings, 2 replies; 53+ messages in thread
From: Ingo Molnar @ 2010-09-06 12:59 UTC (permalink / raw)
  To: Avi Kivity, Pekka Enberg
  Cc: Tom Zanussi, Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel


* Avi Kivity <avi@redhat.com> wrote:

>  On 09/06/2010 03:43 PM, Ingo Molnar wrote:
> >
> > Yes. The filter engine is a safe, in-kernel interpreted language in 
> > the making. The C syntax was chosen because it's close to the heart 
> > of every kernel developer.
> >
> > It might make sense to bring this concept a few steps further. Looks 
> > rather complex but also rather cool ...
> 
> Is this a roundabout way of saying "jit"?

Partly. I'm not sure we want to actually upload programs in bytecode 
form. ASCII is just fine - just like a .gz Javascript is fine for web 
apps. (and in most cases compresses down better than the bytecode 
equivalent)

So a clear language (the simpler initially the better) plus an in-kernel 
compiler.

This could be used for far more than just instrumentation: IMO security 
policies could be expressed in such a way. (Simplified, they are quite 
similar to filters installed on syscall entry/exit, with the ability of 
the filter to influence whether the syscall is performed.)

> If so, I'm all for it.  I could use one myself.

Good ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 12:40           ` Ingo Molnar
@ 2010-09-06 13:16             ` Steven Rostedt
  2010-09-06 16:42               ` Tom Zanussi
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2010-09-06 13:16 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Tom Zanussi,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo
  Cc: Avi Kivity, linux-perf-users, linux-kernel

Sorry for top post, sending via my android phone. Today's a US holiday.


We can filter before the work, if we expose the parameters. Currently we filter what goes into the buffer and there are several cases where we don't know the result before the work.

If we also expose the parameters of a TRACE_EVENT, then we can filter on them before the work.

-- Steve

"Ingo Molnar" <mingo@elte.hu> wrote:

>
>* Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Mon, 2010-09-06 at 14:58 +0300, Avi Kivity wrote:
>> > On 09/06/2010 02:54 PM, Peter Zijlstra wrote:
>> > >
>> > >> Basically, to read() all events in one go.  I have many of them.
>> > >>
>> > >> My current problem is that I have an event (kvm_exit) which I want to
>> > >> drill down by looking at a field (exit_reason).  So I create lots of
>> > >> separate perf_events with a filter for each reason:
>> > >> kvm_exit(exit_reason==0), kvm_exit(exit_reason==1), etc.  But filters
>> > >> are fairly slow (can have ~60 such events on AMD), so I want to make
>> > >> this drill-down optional.
>> > > Yeah, filters suck.
>> > 
>> > Any idea why?  I saw nothing obvious in the code, except that there 
>> > is lots of it.
>> 
>> It filters after it does all the hard work of creating the trace 
>> event, instead of before.
>
>Btw., that's an implementational property of filters that would be nice 
>to fix.
>
>If it's not doable within CPP then outside of it.
>
>	Ingo

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 12:59             ` Ingo Molnar
@ 2010-09-06 13:41               ` Pekka Enberg
  2010-09-06 13:54                 ` Ingo Molnar
  2010-09-06 14:57               ` Avi Kivity
  1 sibling, 1 reply; 53+ messages in thread
From: Pekka Enberg @ 2010-09-06 13:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

On 09/06/2010 03:43 PM, Ingo Molnar wrote:
>> > Yes. The filter engine is a safe, in-kernel interpreted language in
>> > the making. The C syntax was chosen because it's close to the heart
>> > of every kernel developer.
>> >
* Avi Kivity <avi@redhat.com> wrote:
>> > It might make sense to bring this concept a few steps further. Looks
>> > rather complex but also rather cool ...
>>
>> Is this a roundabout way of saying "jit"?

On Mon, Sep 6, 2010 at 3:59 PM, Ingo Molnar <mingo@elte.hu> wrote:
> Partly. I'm not sure we want to actually upload programs in bytecode
> form. ASCII is just fine - just like a .gz Javascript is fine for web
> apps. (and in most cases compresses down better than the bytecode
> equivalent)
>
> So a clear language (the simpler initially the better) plus an in-kernel
> compiler.
>
> This could be used for far more than just instrumentation: IMO security
> policies could be expressed in such a way. (Simplified, they are quite
> similar to filters installed on syscall entry/exit, with the ability of
> the filter to influence whether the syscall is performed.)

Filter engine? I've never heard of it before. Where does it live?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 13:41               ` Pekka Enberg
@ 2010-09-06 13:54                 ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2010-09-06 13:54 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel


* Pekka Enberg <penberg@kernel.org> wrote:

> On 09/06/2010 03:43 PM, Ingo Molnar wrote:
> >> > Yes. The filter engine is a safe, in-kernel interpreted language in
> >> > the making. The C syntax was chosen because it's close to the heart
> >> > of every kernel developer.
> >> >
> * Avi Kivity <avi@redhat.com> wrote:
> >> > It might make sense to bring this concept a few steps further. Looks
> >> > rather complex but also rather cool ...
> >>
> >> Is this a roundabout way of saying "jit"?
> 
> On Mon, Sep 6, 2010 at 3:59 PM, Ingo Molnar <mingo@elte.hu> wrote:
> > Partly. I'm not sure we want to actually upload programs in bytecode
> > form. ASCII is just fine - just like a .gz Javascript is fine for web
> > apps. (and in most cases compresses down better than the bytecode
> > equivalent)
> >
> > So a clear language (the simpler initially the better) plus an in-kernel
> > compiler.
> >
> > This could be used for far more than just instrumentation: IMO security
> > policies could be expressed in such a way. (Simplified, they are quite
> > similar to filters installed on syscall entry/exit, with the ability of
> > the filter to influence whether the syscall is performed.)
> 
> Filter engine? I've never heard of it before. Where does it live?

It's in kernel/trace/trace_events_filter.c, and currently bound to trace 
events - but that is just an implementational detail really.

It allows us to do things like:

   perf record -a -g -e sched:sched_switch \
              --filter "prev_pid == 0 && prev_prio == 120" sleep 10

This profiles context switches that go into the idle task from 
SCHED_NORMAL tasks (and excludes all other types of context switches). 
I.e. this counts 'go idle' events and excludes all other types of 
context switches.

You can also see/use it via /debug/tracing/events/*/*/filter. Allowed 
fields are those that are named in /debug/tracing/events/*/*/format.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 12:59             ` Ingo Molnar
  2010-09-06 13:41               ` Pekka Enberg
@ 2010-09-06 14:57               ` Avi Kivity
  2010-09-06 15:30                 ` Alan Cox
  2010-09-06 15:47                 ` Ingo Molnar
  1 sibling, 2 replies; 53+ messages in thread
From: Avi Kivity @ 2010-09-06 14:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Tom Zanussi, Frédéric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel

  On 09/06/2010 03:59 PM, Ingo Molnar wrote:
>
>> Is this a roundabout way of saying "jit"?
> Partly. I'm not sure we want to actually upload programs in bytecode
> form. ASCII is just fine - just like a .gz Javascript is fine for web
> apps. (and in most cases compresses down better than the bytecode
> equivalent)
>
> So a clear language (the simpler initially the better) plus an in-kernel
> compiler.
>
> This could be used for far more than just instrumentation: IMO security
> policies could be expressed in such a way. (Simplified, they are quite
> similar to filters installed on syscall entry/exit, with the ability of
> the filter to influence whether the syscall is performed.)

For me the requirements are:
- turing complete (more than just filters)
- easy interface to kernel APIs (like hrtimers)
- safe to use by untrusted users

The actual language doesn't really matter.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 15:30                 ` Alan Cox
@ 2010-09-06 15:20                   ` Avi Kivity
  2010-09-06 15:48                     ` Alan Cox
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2010-09-06 15:20 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

  On 09/06/2010 06:30 PM, Alan Cox wrote:
>
>> For me the requirements are:
>> - turing complete (more than just filters)
> Needs infinite storage and may not terminate

Ow come on.  We can always terminate it by inserting checks and 
unwinding the stack; and obviously we'll limit storage.

>> - easy interface to kernel APIs (like hrtimers)
>> - safe to use by untrusted users
>>
>> The actual language doesn't really matter.
> It does for performance and audit. You don't want a JIT as it murders
> cache performance,

Strangely, everyone uses a jit these days unless they're memory 
constrained.  Yes it costs cache, but an interpreter is still slower.

> which means you want
>
> - no self modification

Right.

> - bounded run time

No, I want the ability to terminate the code at any time and clean up 
any resources used.  We have exactly the same requirements for ordinary 
userspace.

> - bounded memory use
> - trustable behaviour for access

Right.

> and usually minimal side effects since you want to optimise very
> heavily and side effects stop that (which is also why Fortran still kicks
> C's backside for crunching)
>
> Not sure you need/want to do the conversion in kernel.

I prefer bytecode as well.

> I'd have thought a
> sane way to handle it would have been to throw stuff at the kernel in
> some kind of semi-sane byte code that can be interpreted by a noddy
> interpreter but firstly when you get it have the kernel try and run a
> helper to compile it.

So you do want to jit?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 14:57               ` Avi Kivity
@ 2010-09-06 15:30                 ` Alan Cox
  2010-09-06 15:20                   ` Avi Kivity
  2010-09-06 15:47                 ` Ingo Molnar
  1 sibling, 1 reply; 53+ messages in thread
From: Alan Cox @ 2010-09-06 15:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

> > This could be used for far more than just instrumentation: IMO security
> > policies could be expressed in such a way. (Simplified, they are quite
> > similar to filters installed on syscall entry/exit, with the ability of
> > the filter to influence whether the syscall is performed.)

Hardly - security policy is almost entirely based on context and state
change, the syscalls causing the state change are usually of minor
interest

(eg we don't care how the uid or chmod bits got set, we care what value
they hold)

> For me the requirements are:
> - turing complete (more than just filters)

Needs infinite storage and may not terminate

> - easy interface to kernel APIs (like hrtimers)
> - safe to use by untrusted users
> 
> The actual language doesn't really matter.

It does for performance and audit. You don't want a JIT as it murders
cache performance, which means you want

- no self modification
- bounded run time
- bounded memory use
- trustable behaviour for access

and usually minimal side effects since you want to optimise very
heavily and side effects stop that (which is also why Fortran still kicks
C's backside for crunching)

Not sure you need/want to do the conversion in kernel. I'd have thought a
sane way to handle it would have been to throw stuff at the kernel in
some kind of semi-sane byte code that can be interpreted by a noddy
interpreter but firstly when you get it have the kernel try and run a
helper to compile it.

Alan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 14:57               ` Avi Kivity
  2010-09-06 15:30                 ` Alan Cox
@ 2010-09-06 15:47                 ` Ingo Molnar
  2010-09-06 17:55                   ` Avi Kivity
                                     ` (3 more replies)
  1 sibling, 4 replies; 53+ messages in thread
From: Ingo Molnar @ 2010-09-06 15:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Pekka Enberg, Tom Zanussi, Frédéric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel


* Avi Kivity <avi@redhat.com> wrote:

>  On 09/06/2010 03:59 PM, Ingo Molnar wrote:
> >
> >>Is this a roundabout way of saying "jit"?
> >Partly. I'm not sure we want to actually upload programs in bytecode
> >form. ASCII is just fine - just like a .gz Javascript is fine for web
> >apps. (and in most cases compresses down better than the bytecode
> >equivalent)
> >
> >So a clear language (the simpler initially the better) plus an in-kernel
> >compiler.
> >
> >This could be used for far more than just instrumentation: IMO security
> >policies could be expressed in such a way. (Simplified, they are quite
> >similar to filters installed on syscall entry/exit, with the ability of
> >the filter to influence whether the syscall is performed.)
> 
> For me the requirements are:
> - turing complete (more than just filters)

Yep. Filters are obviously just basically expressions.

Conditions and variables can be added. Maybe loops too in simpler forms 
- as long as we can prove halting - or maybe with a runtime abort 
mechanism.

> - easy interface to kernel APIs (like hrtimers)
> - safe to use by untrusted users

Yep.

> The actual language doesn't really matter.

There are 3 basic categories:

 1- Most (least abstract) specific code: a block of bytecode in the form 
    of a simplified, executable, kernel-checked x86 machine code block - 
    this is also the fastest form. [yes, this is actually possible.]

 2- Least specific (most abstract) code: A subset/sideset of C - as it's 
    the most kernel-developer-trustable/debuggable form.

 3- Everything else little more than a dot on the spectrum between the
    first two points.

I lean towards #2 - but #1 looks interesting too. #3 is distinctly 
uninteresting as it cannot be as fast as #1 and cannot be as convenient 
as #2.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 15:20                   ` Avi Kivity
@ 2010-09-06 15:48                     ` Alan Cox
  2010-09-06 17:50                       ` Avi Kivity
  0 siblings, 1 reply; 53+ messages in thread
From: Alan Cox @ 2010-09-06 15:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

> No, I want the ability to terminate the code at any time and clean up 
> any resources used.  We have exactly the same requirements for ordinary 
> userspace.

So you intend to prevent its use at any point where we have a kernel only
resource held in any way manner or form ? Remember we don't hold locks or
many kinds of kernel resource across syscalls. Likewise I'm interested
how you will keep it compatible with real time ?

> > - bounded memory use
> > - trustable behaviour for access
> 
> Right.
> 
> > and usually minimal side effects since you want to optimise very
> > heavily and side effects stop that (which is also why Fortran still kicks
> > C's backside for crunching)
> >
> > Not sure you need/want to do the conversion in kernel.
> 
> I prefer bytecode as well.
> 
> > I'd have thought a
> > sane way to handle it would have been to throw stuff at the kernel in
> > some kind of semi-sane byte code that can be interpreted by a noddy
> > interpreter but firstly when you get it have the kernel try and run a
> > helper to compile it.
> 
> So you do want to jit?

Depends what you mean by jit. JIT normally implies compiling/recompiling
as you go. Do we want to compile it once at load time - probably,
assuming the tool is present.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 13:16             ` Steven Rostedt
@ 2010-09-06 16:42               ` Tom Zanussi
  2010-09-07 12:53                 ` Steven Rostedt
  0 siblings, 1 reply; 53+ messages in thread
From: Tom Zanussi @ 2010-09-06 16:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Avi Kivity, linux-perf-users,
	linux-kernel

Hi,

On Mon, 2010-09-06 at 09:16 -0400, Steven Rostedt wrote:
> Sorry for top post, sending via my android phone. Today's a US holiday.
>  
> 
> We can filter before the work, if we expose the parameters. Currently we filter what goes into the buffer and there are several cases where we don't know the result before the work.
> 
> If we also expose the parameters of a TRACE_EVENT, then we can filter on them before the work.

I'm not sure exactly what you mean by exposing the parameters, but yeah,
in general it should be possible to filter on any field you can get the
address of, before you ever allocate space for the event or assign the
field to it.

In the cases where you don't know the result until you do the work, such
as for example this from kvm_age_page tracepoint:

        TP_fast_assign(
                __entry->hva            = hva;
                __entry->gfn            =
                  slot->base_gfn + ((hva - slot->userspace_addr) >>
PAGE_SHIFT);
                __entry->referenced     = ref;
        ),

I guess you'd want the macro to assign the result to a temporary in
order to be able to participate in the filtering, or did you have
something else in mind?

Tom

> -- Steve
> 
> "Ingo Molnar" <mingo@elte.hu> wrote:
> 
> >
> >* Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> On Mon, 2010-09-06 at 14:58 +0300, Avi Kivity wrote:
> >> > On 09/06/2010 02:54 PM, Peter Zijlstra wrote:
> >> > >
> >> > >> Basically, to read() all events in one go.  I have many of them.
> >> > >>
> >> > >> My current problem is that I have an event (kvm_exit) which I want to
> >> > >> drill down by looking at a field (exit_reason).  So I create lots of
> >> > >> separate perf_events with a filter for each reason:
> >> > >> kvm_exit(exit_reason==0), kvm_exit(exit_reason==1), etc.  But filters
> >> > >> are fairly slow (can have ~60 such events on AMD), so I want to make
> >> > >> this drill-down optional.
> >> > > Yeah, filters suck.
> >> > 
> >> > Any idea why?  I saw nothing obvious in the code, except that there 
> >> > is lots of it.
> >> 
> >> It filters after it does all the hard work of creating the trace 
> >> event, instead of before.
> >
> >Btw., that's an implementational property of filters that would be nice 
> >to fix.
> >
> >If it's not doable within CPP then outside of it.
> >
> >	Ingo
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 15:48                     ` Alan Cox
@ 2010-09-06 17:50                       ` Avi Kivity
  0 siblings, 0 replies; 53+ messages in thread
From: Avi Kivity @ 2010-09-06 17:50 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

  On 09/06/2010 06:48 PM, Alan Cox wrote:
>> No, I want the ability to terminate the code at any time and clean up
>> any resources used.  We have exactly the same requirements for ordinary
>> userspace.
> So you intend to prevent its use at any point where we have a kernel only
> resource held in any way manner or form ? Remember we don't hold locks or
> many kinds of kernel resource across syscalls.

I don't think we can allow the downloaded code to hold any kernel 
locks.  If the locks are necessary for correct operation, then the 
untrusted code can "forget" to take them.  If they aren't necessary, 
don't expose them in the first place.  All APIs exposed to the code have 
to be thread safe.  We can allow locks that only protect user data 
structures (and so we can just wipe them out if we need to kill the 
code, just like with futexes[1]).

Other resources are easy, we do this everywhere, including for userspace.

> Likewise I'm interested
> how you will keep it compatible with real time ?

It's just like other executing any user code.  Whatever thread executes 
that code is impacted by its real time characteristics.  If a 
non-realtime thread executes it, who cares.  If a real time thread 
executes it, then it's up to the user to guarantee real time behaviour.  
The kernel need only ensure that user code given by a low-priority task 
is not executed in a high priority task.

>> So you do want to jit?
> Depends what you mean by jit. JIT normally implies compiling/recompiling
> as you go. Do we want to compile it once at load time - probably,
> assuming the tool is present.

That's jit enough for me.  These would usually be small code snippets, 
not megabytes of junk, though that would no doubt follow if we do a good 
job.

We could probably s/netfilter/call to user code/ and remove tons of 
potentially vulnerable kernel code, and gain lots of flexibility into 
the bargain.  I'd like to use this for kvm as well.


[1] Ignoring robust futexes, IIUC

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 15:47                 ` Ingo Molnar
@ 2010-09-06 17:55                   ` Avi Kivity
  2010-09-07  3:44                     ` Ingo Molnar
  2010-09-06 20:31                   ` Pekka Enberg
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2010-09-06 17:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Tom Zanussi, Frédéric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel

  On 09/06/2010 06:47 PM, Ingo Molnar wrote:
>
>> The actual language doesn't really matter.
> There are 3 basic categories:
>
>   1- Most (least abstract) specific code: a block of bytecode in the form
>      of a simplified, executable, kernel-checked x86 machine code block -
>      this is also the fastest form. [yes, this is actually possible.]

Do you then recompile it?  x86 is quite unpleasant.

>   2- Least specific (most abstract) code: A subset/sideset of C - as it's
>      the most kernel-developer-trustable/debuggable form.
>
>   3- Everything else little more than a dot on the spectrum between the
>      first two points.
>
> I lean towards #2 - but #1 looks interesting too. #3 is distinctly
> uninteresting as it cannot be as fast as #1 and cannot be as convenient
> as #2.

Curious - how do you guarantee safety of #1 or even #2?  Can you point 
me to any research?

Everything I'm aware of is bytecode with explicit measures to prevent 
forged pointers, but I admit I've spent no time on it.  It's interesting 
stuff, though.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 15:47                 ` Ingo Molnar
  2010-09-06 17:55                   ` Avi Kivity
@ 2010-09-06 20:31                   ` Pekka Enberg
  2010-09-06 20:37                     ` Pekka Enberg
                                       ` (2 more replies)
  2010-09-07 13:35                   ` Steven Rostedt
  2010-09-12  6:46                   ` Pavel Machek
  3 siblings, 3 replies; 53+ messages in thread
From: Pekka Enberg @ 2010-09-06 20:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

Hi Ingo,

On Mon, Sep 6, 2010 at 6:47 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> The actual language doesn't really matter.
>
> There are 3 basic categories:
>
>  1- Most (least abstract) specific code: a block of bytecode in the form
>    of a simplified, executable, kernel-checked x86 machine code block -
>    this is also the fastest form. [yes, this is actually possible.]
>
>  2- Least specific (most abstract) code: A subset/sideset of C - as it's
>    the most kernel-developer-trustable/debuggable form.
>
>  3- Everything else little more than a dot on the spectrum between the
>    first two points.
>
> I lean towards #2 - but #1 looks interesting too. #3 is distinctly
> uninteresting as it cannot be as fast as #1 and cannot be as convenient
> as #2.

It's a question where you want to push the complexity of parsing the
language and verifying the executed code. I'd image it's easier to
evolve an ABI if we use an intermediate form ("bytecode") on the
kernel side. Supporting multiple versions of a C-like language is
probably going to be painful. You also probably don't want to put
heavy-weight compiler optimization passes in the kernel so with an
intermediate form, you can do much of that in user-space.

I'm guessing this thing is expected to work on all architectures? If
that's true, I'd forget about JIT'ing for the time being and write an
interpreter first because it's much easier to port. There are
techniques in making an interpreter pretty fast too. Google for
"inlining interpreter" if you're interested.

As for the intermediate form, you might want to take a look at Dalvik:

http://www.netmite.com/android/mydroid/dalvik/docs/dalvik-bytecode.html

and probably ParrotVM bytecode too. The thing to avoid is stack-based
instructions like in Java bytecode because although it's easy to write
interpreters for them, it makes JIT'ing harder (which needs to convert
stack-based representation to register-based) and probably doesn't
lend itself well to stack-constrained kernel code.

                        Pekka

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 20:31                   ` Pekka Enberg
@ 2010-09-06 20:37                     ` Pekka Enberg
  2010-09-07  4:03                     ` Ingo Molnar
  2010-09-07 10:57                     ` KOSAKI Motohiro
  2 siblings, 0 replies; 53+ messages in thread
From: Pekka Enberg @ 2010-09-06 20:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

On Mon, Sep 6, 2010 at 6:47 PM, Ingo Molnar <mingo@elte.hu> wrote:
>>> The actual language doesn't really matter.
>>
>> There are 3 basic categories:
>>
>>  1- Most (least abstract) specific code: a block of bytecode in the form
>>    of a simplified, executable, kernel-checked x86 machine code block -
>>    this is also the fastest form. [yes, this is actually possible.]
>>
>>  2- Least specific (most abstract) code: A subset/sideset of C - as it's
>>    the most kernel-developer-trustable/debuggable form.
>>
>>  3- Everything else little more than a dot on the spectrum between the
>>    first two points.
>>
>> I lean towards #2 - but #1 looks interesting too. #3 is distinctly
>> uninteresting as it cannot be as fast as #1 and cannot be as convenient
>> as #2.

2010/9/6 Pekka Enberg <penberg@kernel.org>:
> It's a question where you want to push the complexity of parsing the
> language and verifying the executed code. I'd image it's easier to
> evolve an ABI if we use an intermediate form ("bytecode") on the
> kernel side. Supporting multiple versions of a C-like language is
> probably going to be painful. You also probably don't want to put
> heavy-weight compiler optimization passes in the kernel so with an
> intermediate form, you can do much of that in user-space.
>
> I'm guessing this thing is expected to work on all architectures? If
> that's true, I'd forget about JIT'ing for the time being and write an
> interpreter first because it's much easier to port. There are
> techniques in making an interpreter pretty fast too. Google for
> "inlining interpreter" if you're interested.
>
> As for the intermediate form, you might want to take a look at Dalvik:
>
> http://www.netmite.com/android/mydroid/dalvik/docs/dalvik-bytecode.html
>
> and probably ParrotVM bytecode too. The thing to avoid is stack-based
> instructions like in Java bytecode because although it's easy to write
> interpreters for them, it makes JIT'ing harder (which needs to convert
> stack-based representation to register-based) and probably doesn't
> lend itself well to stack-constrained kernel code.

Btw, the alternative route is to imitate how compilers like tcc and
lcc (and IIRC go and one of the plan9 languages) go about compiling C
language code into native code directly. They are pretty light-weight,
fast, and generate decent code. The down-side is that verification is
going to be more tricky and ABI issues might turn out to be nasty.

                        Pekka

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 17:55                   ` Avi Kivity
@ 2010-09-07  3:44                     ` Ingo Molnar
  2010-09-07  8:33                       ` Stefan Hajnoczi
                                         ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Ingo Molnar @ 2010-09-07  3:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Pekka Enberg, Tom Zanussi, Frédéric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel


* Avi Kivity <avi@redhat.com> wrote:

>  On 09/06/2010 06:47 PM, Ingo Molnar wrote:
> >
> >>The actual language doesn't really matter.
> >There are 3 basic categories:
> >
> >  1- Most (least abstract) specific code: a block of bytecode in the form
> >     of a simplified, executable, kernel-checked x86 machine code block -
> >     this is also the fastest form. [yes, this is actually possible.]
> 
> Do you then recompile it? [...]

No, it's machine code. It's 'safe x86 bytecode executed natively by the 
kernel as a function'.

It needs a verification pass (because the code can come from untrusted 
apps) so that we can copy, verify and trust it (so obviously it's not 
_arbitrary_ x86 machine code - a safe subset of x86) - maybe with a sha1 
based cache for already-verified snippets (or a fast verifier).

> x86 is quite unpleasant.

Any machine code that is fast and compact is unpleasant almost by 
definition: it's a rather non-obvious Huffman encoding embedded in an 
instruction architecture.

But that's the life of kernel hackers, we deal with difficult things. 
(We could have made a carreer choice of selling icecream instead, but 
it's too late i suspect.)

> >  2- Least specific (most abstract) code: A subset/sideset of C - as it's
> >     the most kernel-developer-trustable/debuggable form.
> >
> >  3- Everything else little more than a dot on the spectrum between the
> >     first two points.
> >
> > I lean towards #2 - but #1 looks interesting too. #3 is distinctly 
> > uninteresting as it cannot be as fast as #1 and cannot be as 
> > convenient as #2.
> 
> Curious - how do you guarantee safety of #1 or even #2? [...]

Safety of #1 (x86 bytecode passed in by untrusted user-space, verified 
and saved by the kernel and executed natively as an x86 function if it 
passes the security checks) is trivial but obviously needs quite a bit 
of work.

We start with trivial (and useless) special case of something like:

#define MAX_BYTECODE_SIZE 256

int x86_bytecode_verify(char *opcodes, unsigned int len)
{

	if (len-1 > MAX_BYTECODE_SIZE-1)
		return -EINVAL;

	if (opcodes[0] != 0xc3) /* RET instruction */
		return -EINVAL;

	return 0;
}

... and then we add checks for accepted/safe x86 patterns of 
instructions step by step - always keeping it 100% correct.

Initially it would only allow general register operations with some 
input and output parameters in registers, and a wrapper would 
save/restore those general registers - later on stack operands and 
globals could be added too.

That's not yet Turing complete but already quite functional: an amazing 
amount of logic can be expressed via generic register ops only - i think 
the filter engine could be implemented via that for example.

We'd eventually make it Turing complete in the operations space we care 
about: a fixed-size stack sandbox and a virtual memory window sandbox 
area, allow conditional jumps (only to instruction boundaries).

The code itself is copied into kernel-space and immutable after it has 
been verified.

The point is to decode only safe instructions we know, and to always 
have a 'safe' core of checking code we can extend safely and 
iteratively.

Safety of #2 (C code) is like the filter engine: it's safe right now, as 
it parses the ASCII expression in-kernel, compiles it into predicaments 
and executes those predicament (which are baby instructions really) 
safely.

Every extension needs to be done safely, of course - and more complex 
language constructs will complicate matters for sure.

Note that we have (small) bits of #1 done already in the kernel: the x86 
disassembler. Any instruction pattern we dont know or dont trust we punt 
on.

( Also note that beyond native execution this 'x86 bytecode' approach 
  would still allow JIT techniques, if we are so inclined: x86 bytecode, 
  because we fully verify it and fully know its structure (and exclude 
  nasties like self-modifying code) can be re-JIT-ed just fine.

  Common sequences might even be pre-JIT-ed and cached in a hash. That 
  way we could make sequences faster post facto, via a kernel change 
  only, without impacting any user-space which only passes in the 'old' 
  sequence. Lots of flexibility. )

> Can you point me to any research?

Nope, havent seen this 'safe native x86 bytecode' idea 
mentioned/researched anywhere yet.

> Everything I'm aware of is bytecode with explicit measures to prevent 
> forged pointers, but I admit I've spent no time on it.  It's 
> interesting stuff, though.

I think some Java-like bytecode is roughly the same amount of conceptual 
work as an x86 bytecode verifier, with the big disadvantage that even 
with a JIT it's much slower [and a JIT is far from simple] - not to 
mention the non-technical complications of Java.

> I have a truly marvellous patch that fixes the bug which this 
> signature is too narrow to contain.

Make sure you write down a short but buggy version of the patch on the 
margin of a book. Pass on the book to your heirs and enjoy the centuries 
long confusion from the heavens.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 20:31                   ` Pekka Enberg
  2010-09-06 20:37                     ` Pekka Enberg
@ 2010-09-07  4:03                     ` Ingo Molnar
  2010-09-07  9:30                       ` Pekka Enberg
  2010-09-07 10:57                     ` KOSAKI Motohiro
  2 siblings, 1 reply; 53+ messages in thread
From: Ingo Molnar @ 2010-09-07  4:03 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel


* Pekka Enberg <penberg@kernel.org> wrote:

> Hi Ingo,
> 
> On Mon, Sep 6, 2010 at 6:47 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >> The actual language doesn't really matter.
> >
> > There are 3 basic categories:
> >
> >  1- Most (least abstract) specific code: a block of bytecode in the form
> >    of a simplified, executable, kernel-checked x86 machine code block -
> >    this is also the fastest form. [yes, this is actually possible.]
> >
> >  2- Least specific (most abstract) code: A subset/sideset of C - as it's
> >    the most kernel-developer-trustable/debuggable form.
> >
> >  3- Everything else little more than a dot on the spectrum between the
> >    first two points.
> >
> > I lean towards #2 - but #1 looks interesting too. #3 is distinctly
> > uninteresting as it cannot be as fast as #1 and cannot be as convenient
> > as #2.
> 
> It's a question where you want to push the complexity of parsing the 
> language and verifying the executed code. I'd image it's easier to 
> evolve an ABI if we use an intermediate form ("bytecode") on the 
> kernel side. Supporting multiple versions of a C-like language is 
> probably going to be painful. [...]

Not really, as it's only extended. So there's really just one version to 
support for every kernel - it's just that user-space will initially only 
use 'older' elements of the language.

> [...] You also probably don't want to put heavy-weight compiler 
> optimization passes in the kernel so with an intermediate form, you 
> can do much of that in user-space.

The question of what can and cannot be done in the kernel is overrated. 
We sure can put a C compiler into the kernel - 10 years down the line we 
wont understand what the fuss was all about.

I still remember all the silly 'graphics code should never be in the 
kernel, it's way too complex and fragile' arguments from 1996.

What matters is that it's a hugely flexible and hugely useful feature. 
All our ad-hoc script engines in the kernel (trace-filter, selinux, 
netfilter), etc. could be implemented via it.

And it would allow fantastic feature beyond existing code.

For example a new category of filesystem could be created: with a 
'self-defining layout' - by storing the C code of the filesystem data 
structures _on-disk_.

A filesystem could have a new, more optimal layout by simply having new 
format routines defined in C, stored on disk (in the superblock, or in a 
block referred to by inodes). Old filesystem layouts would be compatible 
forever: the C code is on-disk and never lost as long as the data is 
there - etc.

New filesystem features could be created in a very flexible way, without 
risking old data.

Mixed mode filesystems would be possible: new files get the new logic, 
old files the old logic. This would allow the gradual migration to a new 
filesystem layout for example, without a reinstall.

etc.

Key is to have a kernel that can execute code as data and to embedd that 
code in data structures.

> I'm guessing this thing is expected to work on all architectures? If 
> that's true, I'd forget about JIT'ing for the time being and write an 
> interpreter first because it's much easier to port. There are 
> techniques in making an interpreter pretty fast too. Google for 
> "inlining interpreter" if you're interested.

Yeah, i dont think speed is a primary concern - if overhead matters it 
will be clearly measurable and people can iterate the optimizations ...

> As for the intermediate form, you might want to take a look at Dalvik:
> 
> http://www.netmite.com/android/mydroid/dalvik/docs/dalvik-bytecode.html
> 
> and probably ParrotVM bytecode too. The thing to avoid is stack-based 
> instructions like in Java bytecode because although it's easy to write 
> interpreters for them, it makes JIT'ing harder (which needs to convert 
> stack-based representation to register-based) and probably doesn't 
> lend itself well to stack-constrained kernel code.

_If_ we pass in any sort of machine code to the kernel (which bytecode 
really is), then we should do the right thing and pass in raw x86 
bytecode, and verify it in the kernel.

That way the compiler can be kept out of the kernel, and performance of 
the thing will be phenomenal from day 1 on.

For non-x86 in most cases we can use a simple translator that runs 
during the verification run - or of course they could have their own 
native 'assembly bytecode' verifier and their user-space could compile 
to those.

But i'd prefer C code really, as it's really 'abstract data' in the most 
generic sense. That's why the trace filter engine started with a subset 
of C.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-07  3:44                     ` Ingo Molnar
@ 2010-09-07  8:33                       ` Stefan Hajnoczi
  2010-09-07  9:13                         ` Avi Kivity
  2010-09-07 22:43                         ` Ingo Molnar
  2010-09-07 15:55                       ` Alan Cox
  2010-09-08  1:44                       ` Paul Mackerras
  2 siblings, 2 replies; 53+ messages in thread
From: Stefan Hajnoczi @ 2010-09-07  8:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

On Tue, Sep 7, 2010 at 4:44 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Avi Kivity <avi@redhat.com> wrote:
>
>>  On 09/06/2010 06:47 PM, Ingo Molnar wrote:
>> >
>> >>The actual language doesn't really matter.
>> >There are 3 basic categories:
>> >
>> >  1- Most (least abstract) specific code: a block of bytecode in the form
>> >     of a simplified, executable, kernel-checked x86 machine code block -
>> >     this is also the fastest form. [yes, this is actually possible.]
>>
>> Do you then recompile it? [...]
>
> No, it's machine code. It's 'safe x86 bytecode executed natively by the
> kernel as a function'.
>
> It needs a verification pass (because the code can come from untrusted
> apps) so that we can copy, verify and trust it (so obviously it's not
> _arbitrary_ x86 machine code - a safe subset of x86) - maybe with a sha1
> based cache for already-verified snippets (or a fast verifier).
>
>> x86 is quite unpleasant.
>
> Any machine code that is fast and compact is unpleasant almost by
> definition: it's a rather non-obvious Huffman encoding embedded in an
> instruction architecture.
>
> But that's the life of kernel hackers, we deal with difficult things.
> (We could have made a carreer choice of selling icecream instead, but
> it's too late i suspect.)
>
>> >  2- Least specific (most abstract) code: A subset/sideset of C - as it's
>> >     the most kernel-developer-trustable/debuggable form.
>> >
>> >  3- Everything else little more than a dot on the spectrum between the
>> >     first two points.
>> >
>> > I lean towards #2 - but #1 looks interesting too. #3 is distinctly
>> > uninteresting as it cannot be as fast as #1 and cannot be as
>> > convenient as #2.
>>
>> Curious - how do you guarantee safety of #1 or even #2? [...]
>
> Safety of #1 (x86 bytecode passed in by untrusted user-space, verified
> and saved by the kernel and executed natively as an x86 function if it
> passes the security checks) is trivial but obviously needs quite a bit
> of work.
>
> We start with trivial (and useless) special case of something like:
>
> #define MAX_BYTECODE_SIZE 256
>
> int x86_bytecode_verify(char *opcodes, unsigned int len)
> {
>
>        if (len-1 > MAX_BYTECODE_SIZE-1)
>                return -EINVAL;
>
>        if (opcodes[0] != 0xc3) /* RET instruction */
>                return -EINVAL;
>
>        return 0;
> }
>
> ... and then we add checks for accepted/safe x86 patterns of
> instructions step by step - always keeping it 100% correct.
>
> Initially it would only allow general register operations with some
> input and output parameters in registers, and a wrapper would
> save/restore those general registers - later on stack operands and
> globals could be added too.
>
> That's not yet Turing complete but already quite functional: an amazing
> amount of logic can be expressed via generic register ops only - i think
> the filter engine could be implemented via that for example.
>
> We'd eventually make it Turing complete in the operations space we care
> about: a fixed-size stack sandbox and a virtual memory window sandbox
> area, allow conditional jumps (only to instruction boundaries).
>
> The code itself is copied into kernel-space and immutable after it has
> been verified.
>
> The point is to decode only safe instructions we know, and to always
> have a 'safe' core of checking code we can extend safely and
> iteratively.
>
> Safety of #2 (C code) is like the filter engine: it's safe right now, as
> it parses the ASCII expression in-kernel, compiles it into predicaments
> and executes those predicament (which are baby instructions really)
> safely.
>
> Every extension needs to be done safely, of course - and more complex
> language constructs will complicate matters for sure.
>
> Note that we have (small) bits of #1 done already in the kernel: the x86
> disassembler. Any instruction pattern we dont know or dont trust we punt
> on.
>
> ( Also note that beyond native execution this 'x86 bytecode' approach
>  would still allow JIT techniques, if we are so inclined: x86 bytecode,
>  because we fully verify it and fully know its structure (and exclude
>  nasties like self-modifying code) can be re-JIT-ed just fine.
>
>  Common sequences might even be pre-JIT-ed and cached in a hash. That
>  way we could make sequences faster post facto, via a kernel change
>  only, without impacting any user-space which only passes in the 'old'
>  sequence. Lots of flexibility. )
>
>> Can you point me to any research?
>
> Nope, havent seen this 'safe native x86 bytecode' idea
> mentioned/researched anywhere yet.

Native Client: A Sandbox for Portable, Untrusted x86 Native Code, IEEE
Symposium on Security and Privacy, May 2009
http://nativeclient.googlecode.com/svn/data/docs_tarball/nacl/googleclient/native_client/documentation/nacl_paper.pdf

The "Inner Sandbox" they talk about verifies a subset of x86 code.
For indirect control flow (computed jumps), they introduce a new
instruction that can do run-time checking of the destination address.

IIRC they have a patched gcc toolchain that can compile to this subset of x86.

Stefan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-07  8:33                       ` Stefan Hajnoczi
@ 2010-09-07  9:13                         ` Avi Kivity
  2010-09-07 22:43                         ` Ingo Molnar
  1 sibling, 0 replies; 53+ messages in thread
From: Avi Kivity @ 2010-09-07  9:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Ingo Molnar, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

  On 09/07/2010 11:33 AM, Stefan Hajnoczi wrote:
>
> Native Client: A Sandbox for Portable, Untrusted x86 Native Code, IEEE
> Symposium on Security and Privacy, May 2009
> http://nativeclient.googlecode.com/svn/data/docs_tarball/nacl/googleclient/native_client/documentation/nacl_paper.pdf
>
> The "Inner Sandbox" they talk about verifies a subset of x86 code.
> For indirect control flow (computed jumps), they introduce a new
> instruction that can do run-time checking of the destination address.

Interesting, but appears to rely on x86 segmentation, which isn't 
available on x86_64.

Removing that requirement means replacing indirect memory access by a 
new instruction that does run-time checking, like indirect control flow, 
which is likely to kill performance.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-07  4:03                     ` Ingo Molnar
@ 2010-09-07  9:30                       ` Pekka Enberg
  2010-09-07 22:27                         ` Ingo Molnar
  0 siblings, 1 reply; 53+ messages in thread
From: Pekka Enberg @ 2010-09-07  9:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Avi Kivity, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

Hi Ingo,

On 9/7/10 7:03 AM, Ingo Molnar wrote:
> But i'd prefer C code really, as it's really 'abstract data' in the most
> generic sense. That's why the trace filter engine started with a subset
> of C.

I think it sounds better in principle than what it will be in practice. 
The OpenGL shadling language the same kind of model where you use an API 
call to upload C-like code that gets parsed. That of course has the 
unfortunate side-effect that compilation error reporting isn't all that 
user-friendly because you have to query for errors separately.

I think we've seen with ftrace vs. perf that it's easier to write rich, 
user-friendly interfaces in userspace than in kernel-space.

>> [...] You also probably don't want to put heavy-weight compiler
>> optimization passes in the kernel so with an intermediate form, you
>> can do much of that in user-space.
>
> The question of what can and cannot be done in the kernel is overrated.
> We sure can put a C compiler into the kernel - 10 years down the line we
> wont understand what the fuss was all about.

Yeah, I'm not saying we can't do that but it's a big chunk of code that 
can be potentially exploited.

>> As for the intermediate form, you might want to take a look at Dalvik:
>>
>> http://www.netmite.com/android/mydroid/dalvik/docs/dalvik-bytecode.html
>>
>> and probably ParrotVM bytecode too. The thing to avoid is stack-based
>> instructions like in Java bytecode because although it's easy to write
>> interpreters for them, it makes JIT'ing harder (which needs to convert
>> stack-based representation to register-based) and probably doesn't
>> lend itself well to stack-constrained kernel code.
>
> _If_ we pass in any sort of machine code to the kernel (which bytecode
> really is), then we should do the right thing and pass in raw x86
> bytecode, and verify it in the kernel.
>
> That way the compiler can be kept out of the kernel, and performance of
> the thing will be phenomenal from day 1 on.
>
> For non-x86 in most cases we can use a simple translator that runs
> during the verification run - or of course they could have their own
> native 'assembly bytecode' verifier and their user-space could compile
> to those.

If you'd go for x86 as 'assembly bytecode' which ISA would you pick? 
32-bit or 64-bit? I can see problems with both of them:

   - The register set that can be encoded with 32-bit ISA is very
     limited which will force us to spill in memory.

   - The 64-bit ISA with REX prefixes is unnecessarily fat.

   - Instructions work directly on memory addresses which makes
     verification harder

   - The 32-bit ABI uses stack for argument passing which forces us
     to verify that operations on stack make sense.

OTOH, if the ABI is that you upload _native code_ on every architecture, 
then the trade-off makes more sense to me. The downside is that we'd 
need a separate verifier for each architecture, though.

			Pekka

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 20:31                   ` Pekka Enberg
  2010-09-06 20:37                     ` Pekka Enberg
  2010-09-07  4:03                     ` Ingo Molnar
@ 2010-09-07 10:57                     ` KOSAKI Motohiro
  2010-09-07 12:14                       ` Pekka Enberg
  2 siblings, 1 reply; 53+ messages in thread
From: KOSAKI Motohiro @ 2010-09-07 10:57 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: kosaki.motohiro, Ingo Molnar, Avi Kivity, Pekka Enberg,
	Tom Zanussi, Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

> As for the intermediate form, you might want to take a look at Dalvik:
> 
> http://www.netmite.com/android/mydroid/dalvik/docs/dalvik-bytecode.html
> 
> and probably ParrotVM bytecode too. The thing to avoid is stack-based
> instructions like in Java bytecode because although it's easy to write
> interpreters for them, it makes JIT'ing harder (which needs to convert
> stack-based representation to register-based) and probably doesn't
> lend itself well to stack-constrained kernel code.

(offtopic)

Afaik, NetBSD plan to include lua interpreter in kernel. it is optimized embedded environment.


(more offtopic)

in kernel interpreter is needed some concern. 1) restricted stack size (typical userland VM
often use >100K stack size)  2) restrected memory allocation, especially high order allocation 
often fail. 3) GC often makes unacceptable large lag especially on UP kernel. etc etc
So, We can't apply rich interpreter (e.g. Dalvik, Parrrot) so easily. I think. personally I prefer
minimum component.

Thanks.




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-07 10:57                     ` KOSAKI Motohiro
@ 2010-09-07 12:14                       ` Pekka Enberg
  0 siblings, 0 replies; 53+ messages in thread
From: Pekka Enberg @ 2010-09-07 12:14 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Ingo Molnar, Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

On Tue, Sep 7, 2010 at 1:57 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> As for the intermediate form, you might want to take a look at Dalvik:
>>
>> http://www.netmite.com/android/mydroid/dalvik/docs/dalvik-bytecode.html
>>
>> and probably ParrotVM bytecode too. The thing to avoid is stack-based
>> instructions like in Java bytecode because although it's easy to write
>> interpreters for them, it makes JIT'ing harder (which needs to convert
>> stack-based representation to register-based) and probably doesn't
>> lend itself well to stack-constrained kernel code.
>
> (offtopic)
>
> Afaik, NetBSD plan to include lua interpreter in kernel. it is optimized embedded environment.
>
> (more offtopic)
>
> in kernel interpreter is needed some concern. 1) restricted stack size (typical userland VM
> often use >100K stack size)  2) restrected memory allocation, especially high order allocation
> often fail. 3) GC often makes unacceptable large lag especially on UP kernel. etc etc
> So, We can't apply rich interpreter (e.g. Dalvik, Parrrot) so easily. I think. personally I prefer
> minimum component.

Yes, we definitely don't want to support memory allocation in the
first stages (if ever). I didn't mean that we should integrate Dalvik
or Parriot but that we should look at their _intermediate code_
("bytecode") as an example what we could design for the kernel.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 16:42               ` Tom Zanussi
@ 2010-09-07 12:53                 ` Steven Rostedt
  2010-09-07 14:16                   ` Tom Zanussi
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2010-09-07 12:53 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Ingo Molnar, Peter Zijlstra, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Avi Kivity, linux-perf-users,
	linux-kernel

On Mon, 2010-09-06 at 11:42 -0500, Tom Zanussi wrote:
> Hi,
> 
> On Mon, 2010-09-06 at 09:16 -0400, Steven Rostedt wrote:
> > Sorry for top post, sending via my android phone. Today's a US holiday.
> >  
> > 
> > We can filter before the work, if we expose the parameters. Currently we filter what goes into the buffer and there are several cases where we don't know the result before the work.
> > 
> > If we also expose the parameters of a TRACE_EVENT, then we can filter on them before the work.
> 
> I'm not sure exactly what you mean by exposing the parameters, but yeah,
> in general it should be possible to filter on any field you can get the
> address of, before you ever allocate space for the event or assign the
> field to it.
> 
> In the cases where you don't know the result until you do the work, such
> as for example this from kvm_age_page tracepoint:
> 
>         TP_fast_assign(
>                 __entry->hva            = hva;
>                 __entry->gfn            =
>                   slot->base_gfn + ((hva - slot->userspace_addr) >>
> PAGE_SHIFT);
>                 __entry->referenced     = ref;
>         ),
> 
> I guess you'd want the macro to assign the result to a temporary in
> order to be able to participate in the filtering, or did you have
> something else in mind?

Here we already did the work. We assigned everything, then might as well
keep what we have. What I had in mind was:

TRACE_EVENT(kvm_age_page,
        TP_PROTO(ulong hva, struct kvm_memory_slot *slot, int ref),
        TP_ARGS(hva, slot, ref),

and be able to make a filter that says: slot->userspace_addr > X

Where we filter the parameters and not what is in the entry.
Unfortunately, this would make TRACE_EVENT() even bigger since we would
need to store what the parameter names are.

-- Steve





^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 15:47                 ` Ingo Molnar
  2010-09-06 17:55                   ` Avi Kivity
  2010-09-06 20:31                   ` Pekka Enberg
@ 2010-09-07 13:35                   ` Steven Rostedt
  2010-09-07 13:47                     ` Avi Kivity
  2010-09-12  6:46                   ` Pavel Machek
  3 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2010-09-07 13:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Peter Zijlstra, linux-perf-users, linux-kernel

On Mon, 2010-09-06 at 17:47 +0200, Ingo Molnar wrote:

> > The actual language doesn't really matter.
> 
> There are 3 basic categories:
> 
>  1- Most (least abstract) specific code: a block of bytecode in the form 
>     of a simplified, executable, kernel-checked x86 machine code block - 
>     this is also the fastest form. [yes, this is actually possible.]
> 
>  2- Least specific (most abstract) code: A subset/sideset of C - as it's 
>     the most kernel-developer-trustable/debuggable form.
> 
>  3- Everything else little more than a dot on the spectrum between the
>     first two points.
> 
> I lean towards #2 - but #1 looks interesting too. #3 is distinctly 
> uninteresting as it cannot be as fast as #1 and cannot be as convenient 
> as #2.

I would lean to passing a limited assembly language to the kernel, in
ASCII. This would do the following:

1) probably the easiest to verify.

2) we could write a simple interpreter that all archs can use

3) each arch can have a simple compiler to convert the assembly to
native byte code to optimize it.


The input, output and memory heap can be expressed and the kernel can
grant or deny any of what is touched.

Now here's some of my concerns for any of this. Using the kvm tracepoint
as an example:

slot->base_gfn + ((hva - slot->userspace_addr) >> PAGE_SHIFT)


If we were given "slot" and now we need to dereference it to get
base_gfn or userspace_addr, how would the kernel know this is a valid
address that can be read? Seems to me that this may allow userspace to
trivially see parts of the kernel that was never meant to be seen.

One reason that ftrace only allows root access, is that the kernel is
best a black box for most userspace.  Letting userspace see how SELinux
is treating it, and finding addresses that SELinux is using, can give a
large arsenal to black hats that are writing tools to circumvent Linux
security.

Unless we only let this interpreter access the inputs and its own
allocated memory, it will be very difficult to verify what the
interpreter is doing. I guess one thing we could do is to have a table
of places in the kernel that we let userspace see. This table will need
strict scrutinizing to verify that it can't be used to exploit other
parts of the kernel.

-- Steve
  


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-07 13:35                   ` Steven Rostedt
@ 2010-09-07 13:47                     ` Avi Kivity
  2010-09-07 16:02                       ` Steven Rostedt
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2010-09-07 13:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Peter Zijlstra, linux-perf-users, linux-kernel

  On 09/07/2010 04:35 PM, Steven Rostedt wrote:
> On Mon, 2010-09-06 at 17:47 +0200, Ingo Molnar wrote:
>
>>> The actual language doesn't really matter.
>> There are 3 basic categories:
>>
>>   1- Most (least abstract) specific code: a block of bytecode in the form
>>      of a simplified, executable, kernel-checked x86 machine code block -
>>      this is also the fastest form. [yes, this is actually possible.]
>>
>>   2- Least specific (most abstract) code: A subset/sideset of C - as it's
>>      the most kernel-developer-trustable/debuggable form.
>>
>>   3- Everything else little more than a dot on the spectrum between the
>>      first two points.
>>
>> I lean towards #2 - but #1 looks interesting too. #3 is distinctly
>> uninteresting as it cannot be as fast as #1 and cannot be as convenient
>> as #2.
> I would lean to passing a limited assembly language to the kernel, in
> ASCII. This would do the following:
>
> 1) probably the easiest to verify.
>
> 2) we could write a simple interpreter that all archs can use
>
> 3) each arch can have a simple compiler to convert the assembly to
> native byte code to optimize it.
>
>
> The input, output and memory heap can be expressed and the kernel can
> grant or deny any of what is touched.
>
> Now here's some of my concerns for any of this. Using the kvm tracepoint
> as an example:
>
> slot->base_gfn + ((hva - slot->userspace_addr)>>  PAGE_SHIFT)

We can't allow untrusted access to random kernel memory.

Let's take netfilter as an example.  Userspace downloads bytecode to 
determine whether to allow a packet or not, or to mangle it.  The kernel 
exposes APIs to read and write the packet, access the conntrack hash, 
and whatever else is needed.  The bytecode reads the packet, allows, 
denies or mangles to taste, and exits.

> If we were given "slot" and now we need to dereference it to get
> base_gfn or userspace_addr, how would the kernel know this is a valid
> address that can be read? Seems to me that this may allow userspace to
> trivially see parts of the kernel that was never meant to be seen.

I don't understand this example.  Why would you need such bytecode?

For untrusted filters, you only allow access to tracepoint arguments.  
For trusted filters, perhaps, you can allow arbitrary memory access at 
the user's own risk.

> One reason that ftrace only allows root access, is that the kernel is
> best a black box for most userspace.  Letting userspace see how SELinux
> is treating it, and finding addresses that SELinux is using, can give a
> large arsenal to black hats that are writing tools to circumvent Linux
> security.
>
> Unless we only let this interpreter access the inputs and its own
> allocated memory, it will be very difficult to verify what the
> interpreter is doing. I guess one thing we could do is to have a table
> of places in the kernel that we let userspace see. This table will need
> strict scrutinizing to verify that it can't be used to exploit other
> parts of the kernel.

The way I see it, we expose a function pointer vector to the untrusted 
code, similar to the syscall vector.  Trusted code may also see 
functions to access kernel memory (or we just loosen up the validation 
rules).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-07 12:53                 ` Steven Rostedt
@ 2010-09-07 14:16                   ` Tom Zanussi
  0 siblings, 0 replies; 53+ messages in thread
From: Tom Zanussi @ 2010-09-07 14:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Avi Kivity, linux-perf-users,
	linux-kernel

On Tue, 2010-09-07 at 08:53 -0400, Steven Rostedt wrote:
> On Mon, 2010-09-06 at 11:42 -0500, Tom Zanussi wrote:
> > Hi,
> > 
> > On Mon, 2010-09-06 at 09:16 -0400, Steven Rostedt wrote:
> > > Sorry for top post, sending via my android phone. Today's a US holiday.
> > >  
> > > 
> > > We can filter before the work, if we expose the parameters. Currently we filter what goes into the buffer and there are several cases where we don't know the result before the work.
> > > 
> > > If we also expose the parameters of a TRACE_EVENT, then we can filter on them before the work.
> > 
> > I'm not sure exactly what you mean by exposing the parameters, but yeah,
> > in general it should be possible to filter on any field you can get the
> > address of, before you ever allocate space for the event or assign the
> > field to it.
> > 
> > In the cases where you don't know the result until you do the work, such
> > as for example this from kvm_age_page tracepoint:
> > 
> >         TP_fast_assign(
> >                 __entry->hva            = hva;
> >                 __entry->gfn            =
> >                   slot->base_gfn + ((hva - slot->userspace_addr) >>
> > PAGE_SHIFT);
> >                 __entry->referenced     = ref;
> >         ),
> > 
> > I guess you'd want the macro to assign the result to a temporary in
> > order to be able to participate in the filtering, or did you have
> > something else in mind?
> 
> Here we already did the work. We assigned everything, then might as well
> keep what we have. What I had in mind was:
> 
> TRACE_EVENT(kvm_age_page,
>         TP_PROTO(ulong hva, struct kvm_memory_slot *slot, int ref),
>         TP_ARGS(hva, slot, ref),
> 
> and be able to make a filter that says: slot->userspace_addr > X
> 
> Where we filter the parameters and not what is in the entry.
> Unfortunately, this would make TRACE_EVENT() even bigger since we would
> need to store what the parameter names are.
> 

Yeah, that would be nice to be able to do someday, with the full-fledged
interpreter being discussed; I was thinking for now of only the minimal
set of changes to the current filtering code needed to do the filtering
before the events hit the buffer, in which case I think for complex
expressions like this, you have to do the work of evaluating the full
expression before using the end result in a filter. 

Tom

> -- Steve
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-07  3:44                     ` Ingo Molnar
  2010-09-07  8:33                       ` Stefan Hajnoczi
@ 2010-09-07 15:55                       ` Alan Cox
  2010-09-08  1:44                       ` Paul Mackerras
  2 siblings, 0 replies; 53+ messages in thread
From: Alan Cox @ 2010-09-07 15:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

> Safety of #1 (x86 bytecode passed in by untrusted user-space, verified 
> and saved by the kernel and executed natively as an x86 function if it 
> passes the security checks) is trivial but obviously needs quite a bit 
> of work.

Hardly trivial - and it will always be buggy.

As well as the fact your interpreter is going to have bugs its also no
longer portable. If you have a sane input code and verify that then
compile it you get portability and verifiability.

> > Can you point me to any research?
> 
> Nope, havent seen this 'safe native x86 bytecode' idea 
> mentioned/researched anywhere yet.

Its been done as a linux arch experiment using a trusted assembler.

> I think some Java-like bytecode is roughly the same amount of conceptual 
> work as an x86 bytecode verifier, with the big disadvantage that even 
> with a JIT it's much slower [and a JIT is far from simple] - not to 
> mention the non-technical complications of Java.

The Java JIT is horrible. A better intermediate with compiler available
looks more promising. How about the qemu or valgrind intermediates ?

> 
> > I have a truly marvellous patch that fixes the bug which this 
> > signature is too narrow to contain.
> 
> Make sure you write down a short but buggy version of the patch on the 
> margin of a book. Pass on the book to your heirs and enjoy the centuries 
> long confusion from the heavens.

I'm sure the perl version will fit ;)

Alan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-07 13:47                     ` Avi Kivity
@ 2010-09-07 16:02                       ` Steven Rostedt
  0 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2010-09-07 16:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Peter Zijlstra, linux-perf-users, linux-kernel

On Tue, 2010-09-07 at 16:47 +0300, Avi Kivity wrote:

> > Now here's some of my concerns for any of this. Using the kvm tracepoint
> > as an example:
> >
> > slot->base_gfn + ((hva - slot->userspace_addr)>>  PAGE_SHIFT)
> 
> We can't allow untrusted access to random kernel memory.
> 
> Let's take netfilter as an example.  Userspace downloads bytecode to 
> determine whether to allow a packet or not, or to mangle it.  The kernel 
> exposes APIs to read and write the packet, access the conntrack hash, 
> and whatever else is needed.  The bytecode reads the packet, allows, 
> denies or mangles to taste, and exits.
> 
> > If we were given "slot" and now we need to dereference it to get
> > base_gfn or userspace_addr, how would the kernel know this is a valid
> > address that can be read? Seems to me that this may allow userspace to
> > trivially see parts of the kernel that was never meant to be seen.
> 
> I don't understand this example.  Why would you need such bytecode?

I was just using this as if we were to use this bytecode for filtering
on parameters and we wanted to look at the same stuff that goes into the
buffers, but before we touch the buffer code.

> 
> For untrusted filters, you only allow access to tracepoint arguments.  
> For trusted filters, perhaps, you can allow arbitrary memory access at 
> the user's own risk.

I was thinking this too.

> 
> > One reason that ftrace only allows root access, is that the kernel is
> > best a black box for most userspace.  Letting userspace see how SELinux
> > is treating it, and finding addresses that SELinux is using, can give a
> > large arsenal to black hats that are writing tools to circumvent Linux
> > security.
> >
> > Unless we only let this interpreter access the inputs and its own
> > allocated memory, it will be very difficult to verify what the
> > interpreter is doing. I guess one thing we could do is to have a table
> > of places in the kernel that we let userspace see. This table will need
> > strict scrutinizing to verify that it can't be used to exploit other
> > parts of the kernel.
> 
> The way I see it, we expose a function pointer vector to the untrusted 
> code, similar to the syscall vector.  Trusted code may also see 
> functions to access kernel memory (or we just loosen up the validation 
> rules).

Ah, kind of what the tracepoints already do. Taking the same example:

        TP_fast_assign(
                __entry->hva            = hva;
                __entry->gfn            =
                  slot->base_gfn + ((hva - slot->userspace_addr) >> PAGE_SHIFT);
                __entry->referenced     = ref;
        ),

Have a function does the slot->base_gfn... work for us. It would also be
required that the only input is indeed the slot pointer, otherwise you
could dereference any tracepoint argument. So the arguments to the
tracepoint may get a list of helper functions that allow filtering on
dereferences to them.

This is getting a bit over-engineered IMO. If we want to do something
with the tracepoints, it should start out with simply taking the
arguments that are passed in, and then manipulating them and testing
them. Perhaps we allow a single page of heap to let the algorithms work
with.

Later, we could add functionality that could be triggered with a
condition. The first thing that comes to mind is a way to trigger the
start of tracing or stopping a trace.

-- Steve



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-07  9:30                       ` Pekka Enberg
@ 2010-09-07 22:27                         ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2010-09-07 22:27 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Pekka Enberg, Avi Kivity, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel


* Pekka Enberg <penberg@cs.helsinki.fi> wrote:

> Hi Ingo,
> 
> On 9/7/10 7:03 AM, Ingo Molnar wrote:
> >But i'd prefer C code really, as it's really 'abstract data' in the most
> >generic sense. That's why the trace filter engine started with a subset
> >of C.
> 
> I think it sounds better in principle than what it will be in
> practice. The OpenGL shadling language the same kind of model where
> you use an API call to upload C-like code that gets parsed. That of
> course has the unfortunate side-effect that compilation error
> reporting isn't all that user-friendly because you have to query for
> errors separately.

Not really. It's not a binary choice. The very same checking code can be 
used by tools and by the kernel too.

The kernel does the checking not because we want to do development of 
this code by using the kernel as an editor/compiler, but because we want 
to allow unprivileged tasks to pass in stuff, hence we must verify.

Error reporting by the kernel is a rare slowpath and should be pretty 
straightforward and minimalistic: to return the position of the parsing 
error.

> I think we've seen with ftrace vs. perf that it's easier to write
> rich, user-friendly interfaces in userspace than in kernel-space.
> 
> >>[...] You also probably don't want to put heavy-weight compiler
> >>optimization passes in the kernel so with an intermediate form, you
> >>can do much of that in user-space.
> >
> >The question of what can and cannot be done in the kernel is overrated.
> >We sure can put a C compiler into the kernel - 10 years down the line we
> >wont understand what the fuss was all about.
> 
> Yeah, I'm not saying we can't do that but it's a big chunk of code
> that can be potentially exploited.

The kernel is 10+ million lines of code that can potentially be 
exploited ...

> >>As for the intermediate form, you might want to take a look at Dalvik:
> >>
> >>http://www.netmite.com/android/mydroid/dalvik/docs/dalvik-bytecode.html
> >>
> >>and probably ParrotVM bytecode too. The thing to avoid is stack-based
> >>instructions like in Java bytecode because although it's easy to write
> >>interpreters for them, it makes JIT'ing harder (which needs to convert
> >>stack-based representation to register-based) and probably doesn't
> >>lend itself well to stack-constrained kernel code.
> >
> >_If_ we pass in any sort of machine code to the kernel (which bytecode
> >really is), then we should do the right thing and pass in raw x86
> >bytecode, and verify it in the kernel.
> >
> >That way the compiler can be kept out of the kernel, and performance of
> >the thing will be phenomenal from day 1 on.
> >
> >For non-x86 in most cases we can use a simple translator that runs
> >during the verification run - or of course they could have their own
> >native 'assembly bytecode' verifier and their user-space could compile
> >to those.
> 
> If you'd go for x86 as 'assembly bytecode' which ISA would you pick?
> 32-bit or 64-bit? I can see problems with both of them:

I'd use the native mode and would start with 64-bit.

>   - The register set that can be encoded with 32-bit ISA is very
>     limited which will force us to spill in memory.
> 
>   - The 64-bit ISA with REX prefixes is unnecessarily fat.
> 
>   - Instructions work directly on memory addresses which makes
>     verification harder
> 
>   - The 32-bit ABI uses stack for argument passing which forces us
>     to verify that operations on stack make sense.
> 
> OTOH, if the ABI is that you upload _native code_ on every
> architecture, then the trade-off makes more sense to me. The
> downside is that we'd need a separate verifier for each
> architecture, though.

Correct. I still prefer the C style variant tho.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-07  8:33                       ` Stefan Hajnoczi
  2010-09-07  9:13                         ` Avi Kivity
@ 2010-09-07 22:43                         ` Ingo Molnar
  1 sibling, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2010-09-07 22:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel


* Stefan Hajnoczi <stefanha@gmail.com> wrote:

> >> Can you point me to any research?
> >
> > Nope, havent seen this 'safe native x86 bytecode' idea 
> > mentioned/researched anywhere yet.
> 
> Native Client: A Sandbox for Portable, Untrusted x86 Native Code, IEEE 
> Symposium on Security and Privacy, May 2009 
> http://nativeclient.googlecode.com/svn/data/docs_tarball/nacl/googleclient/native_client/documentation/nacl_paper.pdf
> 
> The "Inner Sandbox" they talk about verifies a subset of x86 code. For 
> indirect control flow (computed jumps), they introduce a new 
> instruction that can do run-time checking of the destination address.
> 
> IIRC they have a patched gcc toolchain that can compile to this subset 
> of x86.

Btw., the first time i mentioned this idea publicly was in early 2006, 3 
years before the above 2009 paper, in a CONFIG_SECCOMP discussion on 
cpushare-discuss.

I've attached a few of those emails below, which outlines the idea.

Thanks,

	Ingo

----- Forwarded message from Ingo Molnar <mingo@elte.hu> -----

Date: Tue, 10 Jan 2006 12:52:04 +0100
From: Ingo Molnar <mingo@elte.hu>
To: Andi Kleen <ak@suse.de>
Cc: Andrea Arcangeli <andrea@cpushare.com>,
	Ed Suominen <general@eepatents.com>,
	Linus Torvalds <torvalds@osdl.org>, cpushare-discuss@cpushare.com,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [patch] make CONFIG_SECCOMP default=n


* Andi Kleen <ak@suse.de> wrote:

> The beauty of using seccomp for the special case of data 
> transformation in a pipe is that it is very simple and likely quite 
> secure and it looks actually practical to me.

well a more generic method could be _more_ practical and still as safe: 
enable bytecode to be uploaded into the kernel, and allow kernel 
components to rely on them. [Add some trivial timeout mechanism to 
detect infinite loops, and abort such instances safely and disable that 
code from that point on]. Seccomp would be just one user of such a 
mechanism.

that 'bytecode' could be "limited x86 code, verified by the kernel at 
upload time, and executed natively afterwards". E.g. pure arithmetical 
code with relative jumps into kernel-validated instruction boundaries 
within that byte code would be an obvious correct first step.

even memory ops could be allowed, as long as the kernel's bytecode 
loading mechanism can automatically prove it's safe: e.g. only stack ops 
are allowed, and the stack segment is limited into a 
per-bytecode-instance small and safe memory range.

yes, the kernel would have to do some (rather simple) disassembly at 
load time to validate things, but that's not a big issue, as it only 
happens once, and is only as complex as complex we allow it to become.

vioala: complex network-filtering decisions done straight in interrupt 
context, defined by the user, compiled into native x86 code and uploaded 
into the kernel.

you could also attach such byte code between pipes, achieving much of 
the seccomp model. At a better performance: no context-switching to the 
'safe seccomp context' is needed.

it could also become _safer_ than seccomp: seccomp does not protect 
against hardware/CPU level attacks, while an in-kernel bytecode loader 
could/would restrict the instruction stream too. E.g. the f00f lockup 
could not be triggered, because the loader does not allow LOCK-ed memory 
ops for example.

so i really think SECCOMP is pretty ad-hoc, poorly thought out and 
apparently not that hot with application writers.

	Ingo

----- Forwarded message from Ingo Molnar <mingo@elte.hu> -----

Date: Tue, 10 Jan 2006 13:35:55 +0100
From: Ingo Molnar <mingo@elte.hu>
To: Andrea Arcangeli <andrea@cpushare.com>
Cc: Andi Kleen <ak@suse.de>, Ed Suominen <general@eepatents.com>,
	Linus Torvalds <torvalds@osdl.org>, cpushare-discuss@cpushare.com,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [patch] make CONFIG_SECCOMP default=n


* Andrea Arcangeli <andrea@cpushare.com> wrote:

> > the seccomp model. At a better performance: no context-switching to the
> > 'safe seccomp context' is needed.
> 
> No need of context switching, I already said you can safely attach shm 
> to do the inter process communication, as far as I can tell, you can 
> mmap hard the framebuffer where to decompress the jpeg in the seccomp 
> task and use mmap to get the data in, zerocopy (modulo decompression 
> costs).

you still need to context-switch to the seccomp task (and away from it)! 
With the 'bytecode in the kernel' approach the bytecode could be run via
a syscall, which is an order of magnitude faster than a context-switch.

> > it could also become _safer_ than seccomp: seccomp does not protect 
> > against hardware/CPU level attacks, while an in-kernel bytecode loader 
> > could/would restrict the instruction stream too. E.g. the f00f lockup 
> > could not be triggered, because the loader does not allow LOCK-ed memory 
> > ops for example.
> 
> The same filtering can be done much more simply in userland before 
> firing up the untrusted bytecode, so that can't be more secure, it can 
> only be more complicated and less secure because of the ring 0.

sure, you could do the same filtering in userspace, but the current 
seccomp model does not do filtering. Also, via in-kernel bytecode we 
could embedd user-defined functionality at almost arbitrary places in 
the kernel. Think 'user-defined plugins' for the kernel. Yes, since it 
runs at ring 0 it _has_ to have filtering, mandatorily, but once done, 
it can do much more than seccomp.

> Furthermore in the decompression case, there's no need of filtering 
> the bytecode, the bytecode is trusted (but perhaps you mean to use 
> this new mechanism like kprobes to load seccomp into the kernel, still 
> it's unclear how can you run sys_read/sys_write that way if you said 
> it has to be pure arithmetical bytecode).

details :-) 'x86 bytecode' could include a placeholder for a callout to 
some kernel function. Also, the results could be defined on the safe 
stack as well.

> I see the point of doing the packet filtering decision in irq context, 
> something that cannot be done in userland easily, but that's a very 
> different problem than the one I was trying to address with seccomp. I 
> wasn't even dreaming of executing untrusted bytecode in kernel mode. I 
> would never do that in ring 0. It has to be ring 3 and in the future 
> guest ring 3.

well, lets go step by step. You would trust a trivial untrusted bytecode 
in the kernel, if it was defined as:

	up to 16 instructions of NOP.

correct? It doesnt do anything, but is a first step, and you'd trust it 
even on ring 0, right?

Then, lets extend this a little bit with trivial linear arithmetic ops 
[no divisions or multiplications for now] done to %eax and %ebx:

	movl %eax, %ebx
	addl %eax, %ebx

at most 100 instructions, no jumps allowed, and the bytecode interpreter 
running this code will saves/restores eax/ebx. We can still trust it, 
even on ring 0, and it's provably correct, right?

using similar steps, we can build a pretty usable virtual machine out of 
trivial x86 ops that are 'obviously correct' and easily provable.

branches and jumps need a little bit of care from the validator: they 
may only be relative, non-indirect and may only point to a validated 
instruction. [i.e. no jumping back to in 'between' two instructions, and 
no "jmp (%eax)", etc.  A timeout mechanism [e.g. driven from the timer 
interrupt] ensures that no bytecode can ever run longer than a 
pre-specified amount of time, and if it does, it's disabled and the 
admin is notified.]

again, our pick of instructions was opt-in all along, and the result is 
obviously safe and provable, even though it runs at ring 0, correct?

so if you walk this thought-experiment a bit, you'll quickly arrive to a 
virtual machine that is actually pretty useful already, and is fully 
provable. You should not dismiss this as "I dont trust it because it's 
at ring 0", unless you can show some fatal flaw in my thinking.

> > so i really think SECCOMP is pretty ad-hoc, poorly thought out and 
> > apparently not that hot with application writers.
> 
> I don't think it has received enough attention, your code injection 
> that can't execute syscalls will have the same issues as seccomp as 
> far as application writers are concerned.  [...]

correct. But there will be one crutial difference: it allows untrusted 
code to be run at ring 0! _That_ makes a performance (and feature) 
difference that some application writers might go the trouble of APIs 
for! The possibilities are quite interesting:

- e.g. a webserver protocol stack in the kernel. (Tux done right)

- webserver dynamic pages generated from the kernel.

- complex DoS avoidance filters natively executed in the kernel.

- filesystem plugins executed in kernel-space

- complex security decisions done at native speed. (ok, selinux has a
  pretty good language for this already, which it interprets runtime.)

_that_ is something application writers (or kernel coders) might get 
excited about.

but more importantly, such an approach could generally ease some of the 
"how much functionality should go into the kernel" pressure.

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-07  3:44                     ` Ingo Molnar
  2010-09-07  8:33                       ` Stefan Hajnoczi
  2010-09-07 15:55                       ` Alan Cox
@ 2010-09-08  1:44                       ` Paul Mackerras
  2010-09-08  6:16                         ` Pekka Enberg
  2010-09-08  6:19                         ` Avi Kivity
  2 siblings, 2 replies; 53+ messages in thread
From: Paul Mackerras @ 2010-09-08  1:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

On Tue, Sep 07, 2010 at 05:44:17AM +0200, Ingo Molnar wrote:

> No, it's machine code. It's 'safe x86 bytecode executed natively by the 
> kernel as a function'.

I'm curious - how are you going to prove that addresses for the MOV
instruction are safe, using just a static pre-execution scan of the
code?  If you only allow things that you can prove are safe, then
either you'll have a very large and complex verifier, or you'll end up
disallowing interesting and useful cases.

The other alternative is to do runtime verification of addressses,
which is much more straightforward (and it's also much easier to prove
it's secure).  But it does mean that you can't run the code as-is.

> It needs a verification pass (because the code can come from untrusted 
> apps) so that we can copy, verify and trust it (so obviously it's not 
> _arbitrary_ x86 machine code - a safe subset of x86) - maybe with a sha1 
> based cache for already-verified snippets (or a fast verifier).
> 
> > x86 is quite unpleasant.
> 
> Any machine code that is fast and compact is unpleasant almost by 
> definition: it's a rather non-obvious Huffman encoding embedded in an 
> instruction architecture.

>From the point of view of having to emulate or translate x86 code (as
we would have to do on powerpc), it's not the compactness that's
unpleasant, it's things like not being able to tell whether the
condition codes set by an instruction are ever going to be used.  Many
x86 instructions set the condition codes but for most instructions the
condition codes that are set are never used.  So when emulating or
translating, you either waste time computing condition code values
that are never used, or you have to do complex lifetime analysis to
work out which instructions do need to compute the condition codes.

> We start with trivial (and useless) special case of something like:
> 
> #define MAX_BYTECODE_SIZE 256
> 
> int x86_bytecode_verify(char *opcodes, unsigned int len)
> {
> 
> 	if (len-1 > MAX_BYTECODE_SIZE-1)
> 		return -EINVAL;
> 
> 	if (opcodes[0] != 0xc3) /* RET instruction */
> 		return -EINVAL;
> 
> 	return 0;
> }
> 
> ... and then we add checks for accepted/safe x86 patterns of 
> instructions step by step - always keeping it 100% correct.

So... I would be interested to see you add the case for the MOV
instruction. :)

Paul.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-08  1:44                       ` Paul Mackerras
@ 2010-09-08  6:16                         ` Pekka Enberg
  2010-09-08  6:44                           ` Ingo Molnar
  2010-09-08  6:19                         ` Avi Kivity
  1 sibling, 1 reply; 53+ messages in thread
From: Pekka Enberg @ 2010-09-08  6:16 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

2010/9/8 Paul Mackerras <paulus@samba.org>:
>> We start with trivial (and useless) special case of something like:
>>
>> #define MAX_BYTECODE_SIZE 256
>>
>> int x86_bytecode_verify(char *opcodes, unsigned int len)
>> {
>>
>>       if (len-1 > MAX_BYTECODE_SIZE-1)
>>               return -EINVAL;
>>
>>       if (opcodes[0] != 0xc3) /* RET instruction */
>>               return -EINVAL;
>>
>>       return 0;
>> }
>>
>> ... and then we add checks for accepted/safe x86 patterns of
>> instructions step by step - always keeping it 100% correct.
>
> So... I would be interested to see you add the case for the MOV
> instruction. :)

Heh, which one of them - there are tons of variants under 'mov' on
x86? On a more serious note: the biggest problem is that you need to
do verification during execution because you don't know the exact
address until then for most addressing modes that use registers.

                        Pekka

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-08  1:44                       ` Paul Mackerras
  2010-09-08  6:16                         ` Pekka Enberg
@ 2010-09-08  6:19                         ` Avi Kivity
  1 sibling, 0 replies; 53+ messages in thread
From: Avi Kivity @ 2010-09-08  6:19 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel

  On 09/08/2010 04:44 AM, Paul Mackerras wrote:
>
> So... I would be interested to see you add the case for the MOV
> instruction. :)

x86 32-bit segmentation.

Unfortunately those are limited to i386.  With some care we could use 
them on x86_64 (temporarily switch the address space to include the 
"bytecode" and its data at the lower 4G, execute a far call to the 
bytecode, etc.

It's cumbersome though, and will make context switches to the bytecode 
quite slow.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-08  6:16                         ` Pekka Enberg
@ 2010-09-08  6:44                           ` Ingo Molnar
  2010-09-08  7:30                             ` Peter Zijlstra
  2010-09-08 19:30                             ` Frank Ch. Eigler
  0 siblings, 2 replies; 53+ messages in thread
From: Ingo Molnar @ 2010-09-08  6:44 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Paul Mackerras, Avi Kivity, Pekka Enberg, Tom Zanussi,
	Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, Peter Zijlstra, linux-perf-users,
	linux-kernel


* Pekka Enberg <penberg@kernel.org> wrote:

> 2010/9/8 Paul Mackerras <paulus@samba.org>:
> >> We start with trivial (and useless) special case of something like:
> >>
> >> #define MAX_BYTECODE_SIZE 256
> >>
> >> int x86_bytecode_verify(char *opcodes, unsigned int len)
> >> {
> >>
> >>       if (len-1 > MAX_BYTECODE_SIZE-1)
> >>               return -EINVAL;
> >>
> >>       if (opcodes[0] != 0xc3) /* RET instruction */
> >>               return -EINVAL;
> >>
> >>       return 0;
> >> }
> >>
> >> ... and then we add checks for accepted/safe x86 patterns of
> >> instructions step by step - always keeping it 100% correct.
> >
> > So... I would be interested to see you add the case for the MOV
> > instruction. :)
> 
> Heh, which one of them - there are tons of variants under 'mov' on 
> x86? On a more serious note: the biggest problem is that you need to 
> do verification during execution because you don't know the exact 
> address until then for most addressing modes that use registers.

Well, the main model and usecase would be to provide some sort of 
function (in the mathematical sense), which is dependent on fixed-size 
input and stores a fixed-size output somewhere.

For that category to restrict both the input and output space initially, 
wouldnt be a big restriction. I.e. dont allow register addresses, only 
stack addresses or fixed addresses to the user-space parameter space.

These functions are supposed to be short and generally they dont change 
state (they dont have access to locking in any case - we want to be able 
to call them from atomic contexts, etc.).

Thus instructions like 'mov (%rax), ...' would be handled by not 
allowing them, only addresses which do not change from execution state.

That still gives plenty of flexibility to implement filters or other 
kinds of input/output calculation/netfilter-rule kind of logic.

Do you know some interesting usecase that would be excluded via such 
address restrictions? Things like flexible arrays or more complex data 
structures such as trees couldnt be used initially. More complex data 
structures would need locking in any case.

I.e. it would initially be restricted roughly to code where halting can 
be proven. Still looks interesting to me: most netfilter policy rules, 
trace filters or selinux rules could be implemented that way - an order 
of magnitude or two faster than the ad-hoc tracing, avc or iptables rule 
engines ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-08  6:44                           ` Ingo Molnar
@ 2010-09-08  7:30                             ` Peter Zijlstra
  2010-09-08 19:30                             ` Frank Ch. Eigler
  1 sibling, 0 replies; 53+ messages in thread
From: Peter Zijlstra @ 2010-09-08  7:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Paul Mackerras, Avi Kivity, Pekka Enberg,
	Tom Zanussi, Frédéric Weisbecker, Steven Rostedt,
	Arnaldo Carvalho de Melo, linux-perf-users, linux-kernel

On Wed, 2010-09-08 at 08:44 +0200, Ingo Molnar wrote:
> Thus instructions like 'mov (%rax), ...' would be handled by not 
> allowing them, only addresses which do not change from execution
> state.

> Do you know some interesting usecase that would be excluded via such 
> address restrictions? 

my_array[i] = foo;

usually ends up wanting to do something with a register offset.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-08  6:44                           ` Ingo Molnar
  2010-09-08  7:30                             ` Peter Zijlstra
@ 2010-09-08 19:30                             ` Frank Ch. Eigler
  2010-09-09  7:38                               ` Ingo Molnar
  1 sibling, 1 reply; 53+ messages in thread
From: Frank Ch. Eigler @ 2010-09-08 19:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Paul Mackerras, Avi Kivity, Pekka Enberg,
	Tom Zanussi, =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel


mingo wrote:

> [...] Do you know some interesting usecase that would be excluded
> via such address restrictions?

You could study tracing-oriented runtime systems such as those of
dtrace, dprobes, or even systemtap, to help answer such questions.


> Things like flexible arrays or more complex data structures such as
> trees couldnt be used initially. More complex data structures would
> need locking in any case. [...]  I.e. it would initially be
> restricted roughly to code where halting can be proven. Still looks
> interesting to me [...]

There are years of experience out there to learn from.  Note the sorts
of filtering expressions (never mind control logic) used in the prior
art, so that you can know what you're about to sacrifice by a
simple-sounding implementation.


- FChE

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-08 19:30                             ` Frank Ch. Eigler
@ 2010-09-09  7:38                               ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2010-09-09  7:38 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Pekka Enberg, Paul Mackerras, Avi Kivity, Pekka Enberg,
	Tom Zanussi, =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel


* Frank Ch. Eigler <fche@redhat.com> wrote:

> mingo wrote:

 - slightly disparaging author quotation style: check

> > [...] Do you know some interesting usecase that would be excluded
> > via such address restrictions?
> 
> You could study tracing-oriented runtime systems such as those of
> dtrace, dprobes, or even systemtap, to help answer such questions.

 - slightly superior sounding style of discussion: check

> > Things like flexible arrays or more complex data structures such as
> > trees couldnt be used initially. More complex data structures would
> > need locking in any case. [...]  I.e. it would initially be
> > restricted roughly to code where halting can be proven. Still looks
> > interesting to me [...]
> 
> There are years of experience out there to learn from.  Note the sorts 
> of filtering expressions (never mind control logic) used in the prior 
> art, so that you can know what you're about to sacrifice by a 
> simple-sounding implementation.

 - offhanded, patronizing tone: check

 - widespread lack of specifics: check

 - dissing others while promoting systemtap: check

... so i dont really need to look at the 'From' field to see that this 
reply can only have come from 'fche'!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-06 15:47                 ` Ingo Molnar
                                     ` (2 preceding siblings ...)
  2010-09-07 13:35                   ` Steven Rostedt
@ 2010-09-12  6:46                   ` Pavel Machek
  2010-09-12 17:54                     ` Avi Kivity
  3 siblings, 1 reply; 53+ messages in thread
From: Pavel Machek @ 2010-09-12  6:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi, Fr?d?ric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel

Hi!

> > >>Is this a roundabout way of saying "jit"?
> > >Partly. I'm not sure we want to actually upload programs in bytecode
> > >form. ASCII is just fine - just like a .gz Javascript is fine for web
> > >apps. (and in most cases compresses down better than the bytecode
> > >equivalent)
> > >
> > >So a clear language (the simpler initially the better) plus an in-kernel
> > >compiler.
> > >
> > >This could be used for far more than just instrumentation: IMO security
> > >policies could be expressed in such a way. (Simplified, they are quite
> > >similar to filters installed on syscall entry/exit, with the ability of
> > >the filter to influence whether the syscall is performed.)
> > 
> > For me the requirements are:
> > - turing complete (more than just filters)
> 
> Yep. Filters are obviously just basically expressions.
> 
> Conditions and variables can be added. Maybe loops too in simpler forms 
> - as long as we can prove halting - or maybe with a runtime abort 
> mechanism.
> 
> > - easy interface to kernel APIs (like hrtimers)
> > - safe to use by untrusted users
> 
> Yep.
> 
> > The actual language doesn't really matter.
> 
> There are 3 basic categories:
> 
>  1- Most (least abstract) specific code: a block of bytecode in the form 
>     of a simplified, executable, kernel-checked x86 machine code block - 
>     this is also the fastest form. [yes, this is actually possible.]

Well... if we want to be a bit x86-entric.... can we just reuse ACPI
interpretter?

Plus, TOMOYO actually has a language inside... AppArmor actually has
something, too, but iirc it is only as powerful as regexps.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-12  6:46                   ` Pavel Machek
@ 2010-09-12 17:54                     ` Avi Kivity
  2010-09-12 18:48                       ` Ingo Molnar
  0 siblings, 1 reply; 53+ messages in thread
From: Avi Kivity @ 2010-09-12 17:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Pekka Enberg, Tom Zanussi, Fr?d?ric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel

  On 09/12/2010 08:46 AM, Pavel Machek wrote:
>   1- Most (least abstract) specific code: a block of bytecode in the form
>      of a simplified, executable, kernel-checked x86 machine code block -
>      this is also the fastest form. [yes, this is actually possible.]
> Well... if we want to be a bit x86-entric.... can we just reuse ACPI
> interpretter?

I hope this was a joke, ACPI won the academy awards for ugliness, 
slowness, low performance, bad specification, non-generality, and 
probably five other things I forgot.  Stay away from it as much as you can.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-12 17:54                     ` Avi Kivity
@ 2010-09-12 18:48                       ` Ingo Molnar
  2010-09-12 19:14                         ` Pavel Machek
  0 siblings, 1 reply; 53+ messages in thread
From: Ingo Molnar @ 2010-09-12 18:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Pavel Machek, Pekka Enberg, Tom Zanussi, Fr?d?ric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel


* Avi Kivity <avi@redhat.com> wrote:

>  On 09/12/2010 08:46 AM, Pavel Machek wrote:
> >  1- Most (least abstract) specific code: a block of bytecode in the form
> >     of a simplified, executable, kernel-checked x86 machine code block -
> >     this is also the fastest form. [yes, this is actually possible.]
> >Well... if we want to be a bit x86-entric.... can we just reuse ACPI
> >interpretter?
> 
> I hope this was a joke, ACPI won the academy awards for ugliness, 
> slowness, low performance, bad specification, non-generality, and 
> probably five other things I forgot.  Stay away from it as much as you 
> can.

It also combines the worst of the two worlds: it's the most specific 
type of code (almost like assembly), but has a very slow interpreter.

With 'x86 bytecode' the main (and pretty much only) point is to be able 
to execute the code as-is, once checked.

But, as i explained it before, i only consider it a theoretical 
possibility and i think that abstract code (such as ASCII text C source 
code) is a better solution.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-12 18:48                       ` Ingo Molnar
@ 2010-09-12 19:14                         ` Pavel Machek
  2010-09-12 20:32                           ` Ingo Molnar
  0 siblings, 1 reply; 53+ messages in thread
From: Pavel Machek @ 2010-09-12 19:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi, Fr?d?ric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel

On Sun 2010-09-12 20:48:43, Ingo Molnar wrote:
> 
> * Avi Kivity <avi@redhat.com> wrote:
> 
> >  On 09/12/2010 08:46 AM, Pavel Machek wrote:
> > >  1- Most (least abstract) specific code: a block of bytecode in the form
> > >     of a simplified, executable, kernel-checked x86 machine code block -
> > >     this is also the fastest form. [yes, this is actually possible.]
> > >Well... if we want to be a bit x86-entric.... can we just reuse ACPI
> > >interpretter?
> > 
> > I hope this was a joke, ACPI won the academy awards for ugliness, 
> > ..., bad specification, non-generality, and 

As did i386 instruction set :-).

> It also combines the worst of the two worlds: it's the most specific 
> type of code (almost like assembly), but has a very slow interpreter.
> 
> With 'x86 bytecode' the main (and pretty much only) point is to be able 
> to execute the code as-is, once checked.
> 
> But, as i explained it before, i only consider it a theoretical 
> possibility and i think that abstract code (such as ASCII text C source 
> code) is a better solution.

Compiler in kernel?

I'm not sure I like that one. Yes, ACPI interpreter is slow, but it is
simple, already in kernel, and there are already tools to compile into
it...

While it may be ugly, I believe it is better than either i386
verifier, compiler in the kernel, or yet another interpretter...

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-12 19:14                         ` Pavel Machek
@ 2010-09-12 20:32                           ` Ingo Molnar
  2010-09-12 21:06                             ` Pavel Machek
  0 siblings, 1 reply; 53+ messages in thread
From: Ingo Molnar @ 2010-09-12 20:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi, Fr?d?ric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel


* Pavel Machek <pavel@ucw.cz> wrote:

> On Sun 2010-09-12 20:48:43, Ingo Molnar wrote:
> > 
> > * Avi Kivity <avi@redhat.com> wrote:
> > 
> > >  On 09/12/2010 08:46 AM, Pavel Machek wrote:
> > > >  1- Most (least abstract) specific code: a block of bytecode in the form
> > > >     of a simplified, executable, kernel-checked x86 machine code block -
> > > >     this is also the fastest form. [yes, this is actually possible.]
> > > >Well... if we want to be a bit x86-entric.... can we just reuse ACPI
> > > >interpretter?
> > > 
> > > I hope this was a joke, ACPI won the academy awards for ugliness, 
> > > ..., bad specification, non-generality, and 
> 
> As did i386 instruction set :-).

Are you kidding? The i386 instruction set may be ugly, but it's 
implemented in hardware, which has obvious upsides.

The ACPI AML code is just plain ugly.

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-12 20:32                           ` Ingo Molnar
@ 2010-09-12 21:06                             ` Pavel Machek
  2010-09-12 22:19                               ` Ingo Molnar
  0 siblings, 1 reply; 53+ messages in thread
From: Pavel Machek @ 2010-09-12 21:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi, Fr?d?ric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel

On Sun 2010-09-12 22:32:58, Ingo Molnar wrote:
> * Pavel Machek <pavel@ucw.cz> wrote:
> > On Sun 2010-09-12 20:48:43, Ingo Molnar wrote:

> > > > >  1- Most (least abstract) specific code: a block of bytecode in the form
> > > > >     of a simplified, executable, kernel-checked x86 machine code block -
> > > > >     this is also the fastest form. [yes, this is actually possible.]
> > > > >Well... if we want to be a bit x86-entric.... can we just reuse ACPI
> > > > >interpretter?
> > > > 
> > > > I hope this was a joke, ACPI won the academy awards for ugliness, 
> > > > ..., bad specification, non-generality, and 
> > 
> > As did i386 instruction set :-).
> 
> Are you kidding? The i386 instruction set may be ugly, but it's 
> implemented in hardware, which has obvious upsides.

I was partly joking. But you have to admit that i386 set is ugly and
badly specified.

And yes, it is implemented in hardware on _i386_, which, true, has
some advantages on i386.(And what about x86-64, btw? Would same
bytecode run natively in both 32 and 64 bit mode?)

And... we'd have to maintain "is this bytecode safe" checker for i386,
and normal emulator for all other architectures.

> The ACPI AML code is just plain ugly.

Yep, but we already have interpreter in the kernel... How many
interpreters is too many? Does it really matter that AML is "ugly"?

You propose adding checker of similary ugly bytecode, and then
interpreter of the same bytecode.

So... let's drop the "use i386 instructions as bytecode idea", ok?

And now... is maintaining ugly interpreter and non-ugly interpreter
preferable to maintaining just the ugly interpreter? Maybe, if it has
significant speed advantages. But does it? What bytecode do you
propose?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: disabling group leader perf_event
  2010-09-12 21:06                             ` Pavel Machek
@ 2010-09-12 22:19                               ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2010-09-12 22:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Avi Kivity, Pekka Enberg, Tom Zanussi, Fr?d?ric Weisbecker,
	Steven Rostedt, Arnaldo Carvalho de Melo, Peter Zijlstra,
	linux-perf-users, linux-kernel


* Pavel Machek <pavel@ucw.cz> wrote:

> On Sun 2010-09-12 22:32:58, Ingo Molnar wrote:
> > * Pavel Machek <pavel@ucw.cz> wrote:
> > > On Sun 2010-09-12 20:48:43, Ingo Molnar wrote:
> 
> > > > > >  1- Most (least abstract) specific code: a block of bytecode in the form
> > > > > >     of a simplified, executable, kernel-checked x86 machine code block -
> > > > > >     this is also the fastest form. [yes, this is actually possible.]
> > > > > >Well... if we want to be a bit x86-entric.... can we just reuse ACPI
> > > > > >interpretter?
> > > > > 
> > > > > I hope this was a joke, ACPI won the academy awards for ugliness, 
> > > > > ..., bad specification, non-generality, and 
> > > 
> > > As did i386 instruction set :-).
> > 
> > Are you kidding? The i386 instruction set may be ugly, but it's 
> > implemented in hardware, which has obvious upsides.
> 
> I was partly joking. But you have to admit that i386 set is ugly and 
> badly specified.

Fortunately since i already said that it's ugly, in the sentence you 
quote above, i dont have to 'admit' it yet again, right? ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2010-09-12 22:19 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-06  9:12 disabling group leader perf_event Avi Kivity
2010-09-06 11:24 ` Peter Zijlstra
2010-09-06 11:34   ` Avi Kivity
2010-09-06 11:54     ` Peter Zijlstra
2010-09-06 11:58       ` Avi Kivity
2010-09-06 12:29         ` Peter Zijlstra
2010-09-06 12:40           ` Ingo Molnar
2010-09-06 13:16             ` Steven Rostedt
2010-09-06 16:42               ` Tom Zanussi
2010-09-07 12:53                 ` Steven Rostedt
2010-09-07 14:16                   ` Tom Zanussi
2010-09-06 12:49           ` Avi Kivity
2010-09-06 12:43         ` Ingo Molnar
2010-09-06 12:45           ` Avi Kivity
2010-09-06 12:59             ` Ingo Molnar
2010-09-06 13:41               ` Pekka Enberg
2010-09-06 13:54                 ` Ingo Molnar
2010-09-06 14:57               ` Avi Kivity
2010-09-06 15:30                 ` Alan Cox
2010-09-06 15:20                   ` Avi Kivity
2010-09-06 15:48                     ` Alan Cox
2010-09-06 17:50                       ` Avi Kivity
2010-09-06 15:47                 ` Ingo Molnar
2010-09-06 17:55                   ` Avi Kivity
2010-09-07  3:44                     ` Ingo Molnar
2010-09-07  8:33                       ` Stefan Hajnoczi
2010-09-07  9:13                         ` Avi Kivity
2010-09-07 22:43                         ` Ingo Molnar
2010-09-07 15:55                       ` Alan Cox
2010-09-08  1:44                       ` Paul Mackerras
2010-09-08  6:16                         ` Pekka Enberg
2010-09-08  6:44                           ` Ingo Molnar
2010-09-08  7:30                             ` Peter Zijlstra
2010-09-08 19:30                             ` Frank Ch. Eigler
2010-09-09  7:38                               ` Ingo Molnar
2010-09-08  6:19                         ` Avi Kivity
2010-09-06 20:31                   ` Pekka Enberg
2010-09-06 20:37                     ` Pekka Enberg
2010-09-07  4:03                     ` Ingo Molnar
2010-09-07  9:30                       ` Pekka Enberg
2010-09-07 22:27                         ` Ingo Molnar
2010-09-07 10:57                     ` KOSAKI Motohiro
2010-09-07 12:14                       ` Pekka Enberg
2010-09-07 13:35                   ` Steven Rostedt
2010-09-07 13:47                     ` Avi Kivity
2010-09-07 16:02                       ` Steven Rostedt
2010-09-12  6:46                   ` Pavel Machek
2010-09-12 17:54                     ` Avi Kivity
2010-09-12 18:48                       ` Ingo Molnar
2010-09-12 19:14                         ` Pavel Machek
2010-09-12 20:32                           ` Ingo Molnar
2010-09-12 21:06                             ` Pavel Machek
2010-09-12 22:19                               ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.