From mboxrd@z Thu Jan 1 00:00:00 1970 From: Will Drewry Subject: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Date: Thu, 12 Jan 2012 11:35:55 -0600 Message-ID: References: <1326302710-9427-1-git-send-email-wad@chromium.org> <1326302710-9427-2-git-send-email-wad@chromium.org> <1326383015.7642.77.camel@gandalf.stny.rr.com> <20120112172241.GJ7180@jl-vm1.vm.bytemark.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Steven Rostedt , linux-kernel@vger.kernel.org, keescook@chromium.org, john.johansen@canonical.com, serge.hallyn@canonical.com, coreyb@linux.vnet.ibm.com, pmoore@redhat.com, eparis@redhat.com, djm@mindrot.org, torvalds@linux-foundation.org, segoon@openwall.com, jmorris@namei.org, scarybeasts@gmail.com, avi@redhat.com, penberg@cs.helsinki.fi, viro@zeniv.linux.org.uk, luto@mit.edu, mingo@elte.hu, akpm@linux-foundation.org, khilman@ti.com, borislav.petkov@amd.com, amwang@redhat.com, oleg@redhat.com, ak@linux.intel.com, eric.dumazet@gmail.com, gregkh@suse.de, dhowells@redhat.com, daniel.lezcano@free.fr, linux-fsdevel@vger.kernel.org, linux-security-module@vger.kernel.org, olofj@chromium.org, mhalcrow@google.com, dlaor@redhat.com To: Jamie Lokier Return-path: In-Reply-To: <20120112172241.GJ7180@jl-vm1.vm.bytemark.co.uk> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier wr= ote: > Will Drewry wrote: >> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt wrote: >> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote: >> > >> >> Filter programs may _only_ cross the execve(2) barrier if last fi= lter >> >> program was attached by a task with CAP_SYS_ADMIN capabilities in= its >> >> user namespace. =A0Once a task-local filter program is attached f= rom a >> >> process without privileges, execve will fail. =A0This ensures tha= t only >> >> privileged parent task can affect its privileged children (e.g., = setuid >> >> binary). >> > >> > This means that a non privileged user can not run another program = with >> > limited features? How would a process exec another program and fil= ter >> > it? I would assume that the filter would need to be attached first= and >> > then the execv() would be performed. But after the filter is attac= hed, >> > the execv is prevented? >> >> Yeah - it means tasks can filter themselves, but not each other. >> However, you can inject a filter for any dynamically linked executab= le >> using LD_PRELOAD. >> >> > Maybe I don't understand this correctly. >> >> You're right on. =A0This was to ensure that one process didn't cause >> crazy behavior in another. I think Alan has a better proposal than >> mine below. =A0(Goes back to catching up.) > > You can already use ptrace() to cause crazy behaviour in another > process, including modifying registers arbitrarily at syscall entry > and exit, aborting and emulating syscalls. > > ptrace() is quite slow and it would be really nice to speed it up, > especially for trapping a small subset of syscalls, or limiting some > kinds of access to some file descriptors, while everything else runs > at normal speed. > > Speeding up ptrace() with BPF filters would be a really nice. =A0Not > that I like ptrace(), but sometimes it's the only thing you can rely = on. > > LD_PRELOAD and code running in the target process address space can't > always be trusted in some contexts (e.g. the target process may modif= y > the tracing code or its data); whereas ptrace() is pretty complete an= d > reliable, if ugly. > > There's already a security model around who can use ptrace(); speedin= g > it up needn't break that. > > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been > needed as userspace could have done it, with exactly the restrictions > it wants. =A0Google's NaCl comes to mind as a potential user. That's not entirely true. ptrace supervisors are subject to races and always fail open. This makes them effective but not as robust as a seccomp solution can provide. With seccomp, it fails close. What I think would make sense would be to add a user-controllable failure mode with seccomp bpf that calls tracehook_ptrace_syscall_entry(regs). I've prototyped this and it works quite well, but I didn't want to conflate the discussions. Using ptrace() would also mean that all consumers of this interface would need a supervisor, but with seccomp, the filters are installed and require no supervisors to stick around for when failure occurs. Does that make sense? thanks! will