From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758774Ab2BJCDw (ORCPT ); Thu, 9 Feb 2012 21:03:52 -0500 Received: from mail2.shareable.org ([80.68.89.115]:43500 "EHLO mail2.shareable.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932076Ab2BJCDt (ORCPT ); Thu, 9 Feb 2012 21:03:49 -0500 Date: Fri, 10 Feb 2012 02:02:55 +0000 From: Jamie Lokier To: Indan Zupancic Cc: Denys Vlasenko , Oleg Nesterov , Linus Torvalds , Andi Kleen , Andrew Lutomirski , Will Drewry , linux-kernel@vger.kernel.org, keescook@chromium.org, john.johansen@canonical.com, serge.hallyn@canonical.com, coreyb@linux.vnet.ibm.com, pmoore@redhat.com, eparis@redhat.com, djm@mindrot.org, segoon@openwall.com, rostedt@goodmis.org, jmorris@namei.org, scarybeasts@gmail.com, avi@redhat.com, penberg@cs.helsinki.fi, viro@zeniv.linux.org.uk, mingo@elte.hu, akpm@linux-foundation.org, khilman@ti.com, borislav.petkov@amd.com, amwang@redhat.com, ak@linux.intel.com, eric.dumazet@gmail.com, gregkh@suse.de, dhowells@redhat.com, daniel.lezcano@free.fr, linux-fsdevel@vger.kernel.org, linux-security-module@vger.kernel.org, olofj@chromium.org, mhalcrow@google.com, dlaor@redhat.com, Roland McGrath Subject: Re: Compat 32-bit syscall entry from 64-bit task!? Message-ID: <20120210020255.GA8333@jl-vm1.vm.bytemark.co.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Indan Zupancic wrote: > On Thu, January 26, 2012 12:47, Jamie Lokier wrote: > > Indan Zupancic wrote: > >> On Thu, January 26, 2012 11:31, Jamie Lokier wrote: > >> > Indan Zupancic wrote: > >> The jailer I wrote works pretty well as a simplistic strace replacement. > >> It can only print out the arguments we're checking, but that's usually > >> the more interesting info. > > > > In theory such a thing should be easy to write, but as we both found, > > ptrace() on Linux has a huge number of difficult quirks to deal with > > to trace reliably. At least it's getting better with later kernels. > > It's not that bad, there are a few quirks, but not that many. > The ptrace specific code is less than 500 lines of code, with > a couple of hundred lines of header files. Linux ptrace specific > stuff creeps in elsewhere too though, like that execve mess. I count 720 lines *just* to read the syscall number and arguments in strace-git, for the Linux archs it supports. That's only the Linux code, I excluded non-Linux, and it's only a little bit of syscall.c, I didn't include generic ptracing, fork-following, threaded-exec-fixups, signal handling etc. nor other arch-specific functions and ABI fixups. And it doesn't even have all archs currently in Linux mainline. > >> It's not a 32 versus 64-bit issue though, so it will be something on > >> its own anyway. Can as well add an extra ARM specific ptrace command > >> to get that info, or hack it in some other way. For instance, ip is > >> (ab)used to tell if it is syscall entry or exit, so doing these tricks > >> isn't anything new in ARM either. > > > > In theory, aren't we supposed to know whether it's entry/exit anyway? > > Why does strace care? Have there been kernel bugs in the past? Maybe > > it was just to deal with SIGTRAP-after-exit in the past, which could > > be delivered at an unpredictable time if blocked and then unblocked by > > sigreturn(). > > Maybe. I don't why ARM does that ip thing. > > Although in theory you know the entry/exits if you keep track, but one > mistake or unexpected behaviour (like execve for my code) and you can get > it wrong. So for robustness sake it's good if it can be double checked. I agree, and I think the PTRACE_EVENT_SYSCALL_ENTRY/EXIT events would be a clean way to represent that. I wonder if all archs report syscall-exit as the first event in traced fork children. Looking at arch/hexagon I'm guessing it doesn't, but it's hard to be sure and no practical way to test it :-/ That wouldn't matter if the events were robust. I read somewhere about a bug report where syscall-exit was seen after attach, but I don't remember where now. > I don't know anything about OABI, can you link an OABI program against > an EABI library? If you can then libc can be EABI and the kernel doesn't > need OABI support. That's not the point. If you're writing a ptrace jailer (as you are) a program can deliberately use OABI calls to subvert the tracer, even if it's using EABI for normal calls. For linking, you are mostly right. Ideally everything would be open and recompilable anyway, but that's sadly not always possible. OABI and EABI have different struct layouts among other changes, and EABI being newer tends to accompany other libc changes; embedded libc. aren't always as drop-in backward-compatible as glibc. > >> And then there's the whole confusion what that flag says, some might think > >> it says in what mode the tracee is instead of what mode the system call is. > >> That those two can be different is not obvious at all and seems very x86_64 > >> specific. > > > > My rough read of PARISC entry code suggests it has two entry methods, > > similar to ARM and x86_64, but I'm not really familiar with PARISC and > > I don't have a machine handy to try it out :-) > > It has a unified syscall table, so does it really matter? I don't know if the 32/64 matters. For security or accurate tracing, I wouldn't like to assume without checking if there are 64-on-32 argument alignment fixups. PARISC has a second set of HPUX-compatible system call numbers, handled in arch/parisc/hpux/*. I don't know if those are available to all programs and can be used to subvert a ptracer. Looking at hpux/gate.S I think they bypass ptrace entirely; maybe they can subvert it. > > I have a script in progress which extracts all the > > per-arch and per-ABI syscall numbers, syscall argument layouts and > > kernel function names to keep track of arch-specific fixups, from a > > Linux source tree. It currently works on all archs except it breaks > > on x86 which insists on being diferent ;-) > > That's handy, but I thought strace had such a script already? > See HACKING-scripts in strace source. Or is yours much better? The strace script only gets the syscall numbers (so doesn't help cross-check I've applied all arch-specific syscall fixups), doesn't work for all arch/ABI combinations without editing unistd.h, and requires a configured and partly built kernel for some archs. It's only really useful for getting new syscall numbers which you then hand-edit into the real table. You still have to set the number of arguments and check carefully you haven't missed any arch-specific fixups. All the best, -- Jamie From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jamie Lokier Subject: Re: Compat 32-bit syscall entry from 64-bit task!? Date: Fri, 10 Feb 2012 02:02:55 +0000 Message-ID: <20120210020255.GA8333@jl-vm1.vm.bytemark.co.uk> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Denys Vlasenko , Oleg Nesterov , Linus Torvalds , Andi Kleen , Andrew Lutomirski , Will Drewry , linux-kernel@vger.kernel.org, keescook@chromium.org, john.johansen@canonical.com, serge.hallyn@canonical.com, coreyb@linux.vnet.ibm.com, pmoore@redhat.com, eparis@redhat.com, djm@mindrot.org, segoon@openwall.com, rostedt@goodmis.org, jmorris@namei.org, scarybeasts@gmail.com, avi@redhat.com, penberg@cs.helsinki.fi, viro@zeniv.linux.org.uk, mingo@elte.hu, akpm@linux-foundation.org, khilman@ti.com, borislav.petkov@amd.com, amwang@redhat.com, ak@linux.intel.com, eric.dumazet@gmail.com, gregkh@suse.de, dhowells@redhat.com, daniel.lezcano@free.fr, linux-fsdevel@vger.kernel.org, linux-security-module@vger.kernel.org, olofj@chromium.org, mhalcrow@google.c To: Indan Zupancic Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org Indan Zupancic wrote: > On Thu, January 26, 2012 12:47, Jamie Lokier wrote: > > Indan Zupancic wrote: > >> On Thu, January 26, 2012 11:31, Jamie Lokier wrote: > >> > Indan Zupancic wrote: > >> The jailer I wrote works pretty well as a simplistic strace replacement. > >> It can only print out the arguments we're checking, but that's usually > >> the more interesting info. > > > > In theory such a thing should be easy to write, but as we both found, > > ptrace() on Linux has a huge number of difficult quirks to deal with > > to trace reliably. At least it's getting better with later kernels. > > It's not that bad, there are a few quirks, but not that many. > The ptrace specific code is less than 500 lines of code, with > a couple of hundred lines of header files. Linux ptrace specific > stuff creeps in elsewhere too though, like that execve mess. I count 720 lines *just* to read the syscall number and arguments in strace-git, for the Linux archs it supports. That's only the Linux code, I excluded non-Linux, and it's only a little bit of syscall.c, I didn't include generic ptracing, fork-following, threaded-exec-fixups, signal handling etc. nor other arch-specific functions and ABI fixups. And it doesn't even have all archs currently in Linux mainline. > >> It's not a 32 versus 64-bit issue though, so it will be something on > >> its own anyway. Can as well add an extra ARM specific ptrace command > >> to get that info, or hack it in some other way. For instance, ip is > >> (ab)used to tell if it is syscall entry or exit, so doing these tricks > >> isn't anything new in ARM either. > > > > In theory, aren't we supposed to know whether it's entry/exit anyway? > > Why does strace care? Have there been kernel bugs in the past? Maybe > > it was just to deal with SIGTRAP-after-exit in the past, which could > > be delivered at an unpredictable time if blocked and then unblocked by > > sigreturn(). > > Maybe. I don't why ARM does that ip thing. > > Although in theory you know the entry/exits if you keep track, but one > mistake or unexpected behaviour (like execve for my code) and you can get > it wrong. So for robustness sake it's good if it can be double checked. I agree, and I think the PTRACE_EVENT_SYSCALL_ENTRY/EXIT events would be a clean way to represent that. I wonder if all archs report syscall-exit as the first event in traced fork children. Looking at arch/hexagon I'm guessing it doesn't, but it's hard to be sure and no practical way to test it :-/ That wouldn't matter if the events were robust. I read somewhere about a bug report where syscall-exit was seen after attach, but I don't remember where now. > I don't know anything about OABI, can you link an OABI program against > an EABI library? If you can then libc can be EABI and the kernel doesn't > need OABI support. That's not the point. If you're writing a ptrace jailer (as you are) a program can deliberately use OABI calls to subvert the tracer, even if it's using EABI for normal calls. For linking, you are mostly right. Ideally everything would be open and recompilable anyway, but that's sadly not always possible. OABI and EABI have different struct layouts among other changes, and EABI being newer tends to accompany other libc changes; embedded libc. aren't always as drop-in backward-compatible as glibc. > >> And then there's the whole confusion what that flag says, some might think > >> it says in what mode the tracee is instead of what mode the system call is. > >> That those two can be different is not obvious at all and seems very x86_64 > >> specific. > > > > My rough read of PARISC entry code suggests it has two entry methods, > > similar to ARM and x86_64, but I'm not really familiar with PARISC and > > I don't have a machine handy to try it out :-) > > It has a unified syscall table, so does it really matter? I don't know if the 32/64 matters. For security or accurate tracing, I wouldn't like to assume without checking if there are 64-on-32 argument alignment fixups. PARISC has a second set of HPUX-compatible system call numbers, handled in arch/parisc/hpux/*. I don't know if those are available to all programs and can be used to subvert a ptracer. Looking at hpux/gate.S I think they bypass ptrace entirely; maybe they can subvert it. > > I have a script in progress which extracts all the > > per-arch and per-ABI syscall numbers, syscall argument layouts and > > kernel function names to keep track of arch-specific fixups, from a > > Linux source tree. It currently works on all archs except it breaks > > on x86 which insists on being diferent ;-) > > That's handy, but I thought strace had such a script already? > See HACKING-scripts in strace source. Or is yours much better? The strace script only gets the syscall numbers (so doesn't help cross-check I've applied all arch-specific syscall fixups), doesn't work for all arch/ABI combinations without editing unistd.h, and requires a configured and partly built kernel for some archs. It's only really useful for getting new syscall numbers which you then hand-edit into the real table. You still have to set the number of arguments and check carefully you haven't missed any arch-specific fixups. All the best, -- Jamie