From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757404Ab0IGWnd (ORCPT ); Tue, 7 Sep 2010 18:43:33 -0400 Received: from mx3.mail.elte.hu ([157.181.1.138]:33048 "EHLO mx3.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755249Ab0IGWna (ORCPT ); Tue, 7 Sep 2010 18:43:30 -0400 Date: Wed, 8 Sep 2010 00:43:01 +0200 From: Ingo Molnar To: Stefan Hajnoczi Cc: Avi Kivity , Pekka Enberg , Tom Zanussi , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Steven Rostedt , Arnaldo Carvalho de Melo , Peter Zijlstra , linux-perf-users@vger.kernel.org, linux-kernel Subject: Re: disabling group leader perf_event Message-ID: <20100907224301.GC11605@elte.hu> References: <1283774045.1930.341.camel@laptop> <4C84D77B.6040600@redhat.com> <20100906124330.GA22314@elte.hu> <4C84E265.1020402@redhat.com> <20100906125905.GA25414@elte.hu> <4C850147.8010908@redhat.com> <20100906154737.GA4332@elte.hu> <4C852B2A.2030103@redhat.com> <20100907034417.GA14046@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-08-17) X-ELTE-SpamScore: -2.0 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-2.0 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.2.5 -2.0 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Stefan Hajnoczi wrote: > >> Can you point me to any research? > > > > Nope, havent seen this 'safe native x86 bytecode' idea > > mentioned/researched anywhere yet. > > Native Client: A Sandbox for Portable, Untrusted x86 Native Code, IEEE > Symposium on Security and Privacy, May 2009 > http://nativeclient.googlecode.com/svn/data/docs_tarball/nacl/googleclient/native_client/documentation/nacl_paper.pdf > > The "Inner Sandbox" they talk about verifies a subset of x86 code. For > indirect control flow (computed jumps), they introduce a new > instruction that can do run-time checking of the destination address. > > IIRC they have a patched gcc toolchain that can compile to this subset > of x86. Btw., the first time i mentioned this idea publicly was in early 2006, 3 years before the above 2009 paper, in a CONFIG_SECCOMP discussion on cpushare-discuss. I've attached a few of those emails below, which outlines the idea. Thanks, Ingo ----- Forwarded message from Ingo Molnar ----- Date: Tue, 10 Jan 2006 12:52:04 +0100 From: Ingo Molnar To: Andi Kleen Cc: Andrea Arcangeli , Ed Suominen , Linus Torvalds , cpushare-discuss@cpushare.com, Christoph Hellwig Subject: Re: [patch] make CONFIG_SECCOMP default=n * Andi Kleen wrote: > The beauty of using seccomp for the special case of data > transformation in a pipe is that it is very simple and likely quite > secure and it looks actually practical to me. well a more generic method could be _more_ practical and still as safe: enable bytecode to be uploaded into the kernel, and allow kernel components to rely on them. [Add some trivial timeout mechanism to detect infinite loops, and abort such instances safely and disable that code from that point on]. Seccomp would be just one user of such a mechanism. that 'bytecode' could be "limited x86 code, verified by the kernel at upload time, and executed natively afterwards". E.g. pure arithmetical code with relative jumps into kernel-validated instruction boundaries within that byte code would be an obvious correct first step. even memory ops could be allowed, as long as the kernel's bytecode loading mechanism can automatically prove it's safe: e.g. only stack ops are allowed, and the stack segment is limited into a per-bytecode-instance small and safe memory range. yes, the kernel would have to do some (rather simple) disassembly at load time to validate things, but that's not a big issue, as it only happens once, and is only as complex as complex we allow it to become. vioala: complex network-filtering decisions done straight in interrupt context, defined by the user, compiled into native x86 code and uploaded into the kernel. you could also attach such byte code between pipes, achieving much of the seccomp model. At a better performance: no context-switching to the 'safe seccomp context' is needed. it could also become _safer_ than seccomp: seccomp does not protect against hardware/CPU level attacks, while an in-kernel bytecode loader could/would restrict the instruction stream too. E.g. the f00f lockup could not be triggered, because the loader does not allow LOCK-ed memory ops for example. so i really think SECCOMP is pretty ad-hoc, poorly thought out and apparently not that hot with application writers. Ingo ----- Forwarded message from Ingo Molnar ----- Date: Tue, 10 Jan 2006 13:35:55 +0100 From: Ingo Molnar To: Andrea Arcangeli Cc: Andi Kleen , Ed Suominen , Linus Torvalds , cpushare-discuss@cpushare.com, Christoph Hellwig Subject: Re: [patch] make CONFIG_SECCOMP default=n * Andrea Arcangeli wrote: > > the seccomp model. At a better performance: no context-switching to the > > 'safe seccomp context' is needed. > > No need of context switching, I already said you can safely attach shm > to do the inter process communication, as far as I can tell, you can > mmap hard the framebuffer where to decompress the jpeg in the seccomp > task and use mmap to get the data in, zerocopy (modulo decompression > costs). you still need to context-switch to the seccomp task (and away from it)! With the 'bytecode in the kernel' approach the bytecode could be run via a syscall, which is an order of magnitude faster than a context-switch. > > it could also become _safer_ than seccomp: seccomp does not protect > > against hardware/CPU level attacks, while an in-kernel bytecode loader > > could/would restrict the instruction stream too. E.g. the f00f lockup > > could not be triggered, because the loader does not allow LOCK-ed memory > > ops for example. > > The same filtering can be done much more simply in userland before > firing up the untrusted bytecode, so that can't be more secure, it can > only be more complicated and less secure because of the ring 0. sure, you could do the same filtering in userspace, but the current seccomp model does not do filtering. Also, via in-kernel bytecode we could embedd user-defined functionality at almost arbitrary places in the kernel. Think 'user-defined plugins' for the kernel. Yes, since it runs at ring 0 it _has_ to have filtering, mandatorily, but once done, it can do much more than seccomp. > Furthermore in the decompression case, there's no need of filtering > the bytecode, the bytecode is trusted (but perhaps you mean to use > this new mechanism like kprobes to load seccomp into the kernel, still > it's unclear how can you run sys_read/sys_write that way if you said > it has to be pure arithmetical bytecode). details :-) 'x86 bytecode' could include a placeholder for a callout to some kernel function. Also, the results could be defined on the safe stack as well. > I see the point of doing the packet filtering decision in irq context, > something that cannot be done in userland easily, but that's a very > different problem than the one I was trying to address with seccomp. I > wasn't even dreaming of executing untrusted bytecode in kernel mode. I > would never do that in ring 0. It has to be ring 3 and in the future > guest ring 3. well, lets go step by step. You would trust a trivial untrusted bytecode in the kernel, if it was defined as: up to 16 instructions of NOP. correct? It doesnt do anything, but is a first step, and you'd trust it even on ring 0, right? Then, lets extend this a little bit with trivial linear arithmetic ops [no divisions or multiplications for now] done to %eax and %ebx: movl %eax, %ebx addl %eax, %ebx at most 100 instructions, no jumps allowed, and the bytecode interpreter running this code will saves/restores eax/ebx. We can still trust it, even on ring 0, and it's provably correct, right? using similar steps, we can build a pretty usable virtual machine out of trivial x86 ops that are 'obviously correct' and easily provable. branches and jumps need a little bit of care from the validator: they may only be relative, non-indirect and may only point to a validated instruction. [i.e. no jumping back to in 'between' two instructions, and no "jmp (%eax)", etc. A timeout mechanism [e.g. driven from the timer interrupt] ensures that no bytecode can ever run longer than a pre-specified amount of time, and if it does, it's disabled and the admin is notified.] again, our pick of instructions was opt-in all along, and the result is obviously safe and provable, even though it runs at ring 0, correct? so if you walk this thought-experiment a bit, you'll quickly arrive to a virtual machine that is actually pretty useful already, and is fully provable. You should not dismiss this as "I dont trust it because it's at ring 0", unless you can show some fatal flaw in my thinking. > > so i really think SECCOMP is pretty ad-hoc, poorly thought out and > > apparently not that hot with application writers. > > I don't think it has received enough attention, your code injection > that can't execute syscalls will have the same issues as seccomp as > far as application writers are concerned. [...] correct. But there will be one crutial difference: it allows untrusted code to be run at ring 0! _That_ makes a performance (and feature) difference that some application writers might go the trouble of APIs for! The possibilities are quite interesting: - e.g. a webserver protocol stack in the kernel. (Tux done right) - webserver dynamic pages generated from the kernel. - complex DoS avoidance filters natively executed in the kernel. - filesystem plugins executed in kernel-space - complex security decisions done at native speed. (ok, selinux has a pretty good language for this already, which it interprets runtime.) _that_ is something application writers (or kernel coders) might get excited about. but more importantly, such an approach could generally ease some of the "how much functionality should go into the kernel" pressure. Ingo