From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754495Ab2BABgw (ORCPT ); Tue, 31 Jan 2012 20:36:52 -0500 Received: from smarthost1.greenhost.nl ([195.190.28.78]:58861 "EHLO smarthost1.greenhost.nl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752448Ab2BABgr (ORCPT ); Tue, 31 Jan 2012 20:36:47 -0500 Message-ID: In-Reply-To: References: <1327706681-11959-1-git-send-email-wad@chromium.org> <1327706681-11959-2-git-send-email-wad@chromium.org> <13b3f9dcf188908604a9529ef1934ecf.squirrel@webmail.greenhost.nl> <8a37c5805a9941a8616f1c28245a0880.squirrel@webmail.greenhost.nl> Date: Wed, 1 Feb 2012 02:36:17 +0100 Subject: Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF From: "Indan Zupancic" To: "Will Drewry" Cc: linux-kernel@vger.kernel.org, keescook@chromium.org, john.johansen@canonical.com, serge.hallyn@canonical.com, coreyb@linux.vnet.ibm.com, pmoore@redhat.com, eparis@redhat.com, djm@mindrot.org, torvalds@linux-foundation.org, segoon@openwall.com, rostedt@goodmis.org, jmorris@namei.org, scarybeasts@gmail.com, avi@redhat.com, penberg@cs.helsinki.fi, viro@zeniv.linux.org.uk, luto@mit.edu, mingo@elte.hu, akpm@linux-foundation.org, khilman@ti.com, borislav.petkov@amd.com, amwang@redhat.com, oleg@redhat.com, ak@linux.intel.com, eric.dumazet@gmail.com, gregkh@suse.de, dhowells@redhat.com, daniel.lezcano@free.fr, linux-fsdevel@vger.kernel.org, linux-security-module@vger.kernel.org, olofj@chromium.org, mhalcrow@google.com, dlaor@redhat.com, corbet@lwn.net, alan@lxorguk.ukuu.org.uk, mcgrathr@chromium.org User-Agent: SquirrelMail/1.4.22 MIME-Version: 1.0 Content-Type: text/plain;charset=UTF-8 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-Spam-Score: 0.0 X-Scan-Signature: 7a6051f01481ae3e55b4de4b051228fc Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, January 31, 2012 12:04, Will Drewry wrote: > On Mon, Jan 30, 2012 at 7:42 PM, Indan Zupancic wrote: >>> I vote for: >>> >>> 3. Add tracehook support to all archs. > > I don't see these #3 as mutually exclusive :) They are if you really add tracehook support to all archs. ;-) > tracehook requires: > - task_pt_regs() in asm/processor.h or asm/ptrace.h > - arch_has_single_step() if there is hardware single-step support > - arch_has_block_step() if there is hardware block-step support > - asm/syscall.h supplying asm-generic/syscall.h interface > - linux/regset.h user_regset interfaces > - CORE_DUMP_USE_REGSET #define'd in linux/elf.h > -TIF_SYSCALL_TRACE calls tracehook_report_syscall_{entry,exit} > - TIF_NOTIFY_RESUME calls tracehook_notify_resume() > - signal delivery calls tracehook_signal_handler() Okay, that's a bit fuzzier than I expected. I suppose the archs implement some of that in another way currently? >>> Maybe not all archs, but at least some more. That way, every time someone >>> adds something tracehook specific, more archs support it. > > Well the other arch I want this on specifically for my purposes is > arm, but someone recently posted a partial asm/syscall.h for arm, but > I didn't see that one go anywhere just yet. (I know syscall_get_nr > can be tricky on arm because it doesn't (didn't) have a way of > tracking in-syscall state.) > > ref: https://lkml.org/lkml/2011/12/1/131 That totally ignores OABI, it should depend on CONFIG_AEABI and on !CONFIG_OABI_COMPAT. >>> syscall.h has no TRACEHOOK defines or anything though. > > Nope - it is just part of what is expected. > >>> Only syscall_rollback() looks tricky. I have no clue what the difference >>> between syscall_get_error() and syscall_get_return_value() is. But you >>> only need to add syscall_get_nr() and syscall_[gs]et_arguments(), which >>> should be possible for all archs. > > It seems even syscall_get_nr can have some wrinkles :) > > ref: http://lkml.indiana.edu/hypermail/linux/kernel/0906.3/00096.html That's 2009! I wonder why no progress happened since then. At least for SECCOMP you get the syscall nr directly as a parameter, so you at least don't actually need syscall_get_nr(). Same for ptrace. That doesn't help for /proc/$PID/syscall or coredumps though. One solution I can think of for the OABI compat problem is to let the OABI entry path store the syscall nr in thread_info->syscall and set an OABI task flag somewhere. Then syscall_get_nr() can check that to decide between r7 or the stored value. This would help /proc and doesn't cause any overhead for non-compat syscalls. Another ugly way of doing it would be to store the trap instruction for OABI calls, and when user space does a PTRACE_PEEK_TEXT on that it returns the saved instruction instead of rereading the actual memory. This avoids races and lets current user space work without modifications. Problem is that user space can't easily check if it's running on a fixed kernel. All in all I propose to only support this stuff for kernels not having OABI support, as that keeps things nice and simple, which was probably the point of moving to a new ARM ABI anyway. Tough, but that's what you get when you change ABIs. >>> How many archs don't support tracehook? > > 14 out of 26. However, 5 of those still have asm/syscall.h That's a lot of tricky work. >>> Well, the thing is, this recursion is controlled by user space depending >>> on how many filters they have installed. What is preventing them to force >>> you out of stack? > > Hrm true. The easiest option is to just convert it to iterative by > not using kref_t, but I'll look more closely. Probably. >>> So perhaps add at least some arbitrary filter limit to avoid this? > > Definitely possible -- probably as a sysctl. I'm not quite sure what > number makes sense yet, but I'll look at breaking the recursion first. > Thanks! Please no sysctl. The point of arbitrary limits is that they are arbitrarily high, but low enough to avoid any problems. A limit of 1K filters should be plenty for user space, but still put a limit on the total (memory) overhead of filters. >>>> I'll clarify a bit.  My original ptrace integration worked such that a >>>> tracer may only intercept syscall failures if it attached prior to the >>>> failing filter being installed.  I did it this way to avoid using >>>> ptrace to escape system call filtering.  However, since I don't have >>>> that as part of the patch series, it doesn't make sense to keep it. (I >>>> tracked a tracer pid_struct in the filters.)  If it needs to come back >>>> with later patchsets, then it can be tackled then! >>> >>> The problem of that is that filters can be shared between processes with >>> different ptracers. And you have all the hassle of keeping it up to date. >>> >>> I think seccomp should always come first and just trust ptrace. This >>> because it can deny all ptrace() calls for filtered tasks, so the only >>> untrusted tasks doing ptrace() are outside of seccomp's filtering control. >>> And those could do the same system calls themselves. >>> >>> The case where there is one task being filtered and allowed to do ptrace, >>> but not some other syscall, ptracing another filtered task which isn't >>> allowed to do ptrace, but allowed to do that other syscall, is quite far >>> fetched I think. If you really want to handle this, then you could run >>> the ptracer's filters against the tracee's post-ptrace syscall state. >>> This is best done in the ptracer's context, just before continuing the >>> system call. (You really want Oleg's SIKILL immediate patch then.) >>> >>> What about: >>> >>> 1) Seccomp filters can deny a syscall by killing the task. >>> >>> 2) Seccomp can deny a syscall and let it return an error value. >>> >>>   I know you're not fond of this one. It's just a performance >>>   optimisation as sometimes a lot of denied but harmless syscalls >>>   are called by glibc all the time, like getxattr(). Hardcoding >>>   the kill policy seems wrong when it can be avoided. If this is >>>   too hard then skip it, but if it's easy to add then please do. >>>   I'll take a look at this with your next patch version. > > It's easy on x86 harder on other arches. I would suggest saving > changing the __secure_computing signature until after the core > functionality lands, but that's just me. The only problem of that is that you will have two versions of seccomp filtering in the wild: One that does support this, and one that doesn't. It seems cleaner to consolidate the seccomp entry path with the ptrace path, and do the seccomp check first in there. This saves a few instructions on some archs for the seccomp check too and would simplify the entry code of most archs. Ptrace is expected to change the syscall nr already, so that can be used to set it to -1 or something. A different return value than -ENOSYS can be set in the syscall exit path, if required. All this can be shared with PTRACE_SYSEMU. I don't know. >>> 3) Seccomp can allow a syscall to proceed normally. >>> >>> 4) Seccomp can set a hint to skip ptrace syscall events for this syscall. >>>   A filter sets this by returning a specific value. >>> >>> 5) Ptrace always gets a syscall event when it asked for it. >>> >>> 6) Ptrace can set an option to honour seccomp's hint and to not get all >>>   syscall events. >>> >>> This way all seccomp needs to do is to set some flags which ptrace can check. > > I like the use of flags/options to trigger ptrace handling. If I were > to stack rank these for pursuit after the core functionality lands, > it'd be to add #6 (and its deps) then #2. With #6, #2 can be > simulated (by having a supervisor that changes the syscall number to > -1), but that is much less ideal than just returning SECCOMP_ERROR > instead of SECCOMP_ALLOW/DENY and letting an error code get bubbled > up. But it would require two filter versions and hence make it more difficult to generally support BPF syscall filtering. Greetings, Indan