From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1758774Ab2BJCDw (ORCPT <rfc822;w@1wt.eu>);
	Thu, 9 Feb 2012 21:03:52 -0500
Received: from mail2.shareable.org ([80.68.89.115]:43500 "EHLO
	mail2.shareable.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S932076Ab2BJCDt (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 9 Feb 2012 21:03:49 -0500
Date: Fri, 10 Feb 2012 02:02:55 +0000
From: Jamie Lokier <jamie@shareable.org>
To: Indan Zupancic <indan@nul.nu>
Cc: Denys Vlasenko <vda.linux@googlemail.com>, Oleg Nesterov <oleg@redhat.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andi Kleen <andi@firstfloor.org>, Andrew Lutomirski <luto@mit.edu>,
        Will Drewry <wad@chromium.org>, linux-kernel@vger.kernel.org,
        keescook@chromium.org, john.johansen@canonical.com,
        serge.hallyn@canonical.com, coreyb@linux.vnet.ibm.com,
        pmoore@redhat.com, eparis@redhat.com, djm@mindrot.org,
        segoon@openwall.com, rostedt@goodmis.org, jmorris@namei.org,
        scarybeasts@gmail.com, avi@redhat.com, penberg@cs.helsinki.fi,
        viro@zeniv.linux.org.uk, mingo@elte.hu, akpm@linux-foundation.org,
        khilman@ti.com, borislav.petkov@amd.com, amwang@redhat.com,
        ak@linux.intel.com, eric.dumazet@gmail.com, gregkh@suse.de,
        dhowells@redhat.com, daniel.lezcano@free.fr,
        linux-fsdevel@vger.kernel.org, linux-security-module@vger.kernel.org,
        olofj@chromium.org, mhalcrow@google.com, dlaor@redhat.com,
        Roland McGrath <mcgrathr@chromium.org>
Subject: Re: Compat 32-bit syscall entry from 64-bit task!?
Message-ID: <20120210020255.GA8333@jl-vm1.vm.bytemark.co.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <da6b0694076aad94aa8afa3740126cca.squirrel@webmail.greenhost.nl>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Indan Zupancic wrote:
> On Thu, January 26, 2012 12:47, Jamie Lokier wrote:
> > Indan Zupancic wrote:
> >> On Thu, January 26, 2012 11:31, Jamie Lokier wrote:
> >> > Indan Zupancic wrote:
> >> The jailer I wrote works pretty well as a simplistic strace replacement.
> >> It can only print out the arguments we're checking, but that's usually
> >> the more interesting info.
> >
> > In theory such a thing should be easy to write, but as we both found,
> > ptrace() on Linux has a huge number of difficult quirks to deal with
> > to trace reliably.  At least it's getting better with later kernels.
> 
> It's not that bad, there are a few quirks, but not that many.
> The ptrace specific code is less than 500 lines of code, with
> a couple of hundred lines of header files. Linux ptrace specific
> stuff creeps in elsewhere too though, like that execve mess.

I count 720 lines *just* to read the syscall number and arguments in
strace-git, for the Linux archs it supports.

That's only the Linux code, I excluded non-Linux, and it's only a
little bit of syscall.c, I didn't include generic ptracing,
fork-following, threaded-exec-fixups, signal handling etc. nor other
arch-specific functions and ABI fixups.  And it doesn't even have all
archs currently in Linux mainline.

> >> It's not a 32 versus 64-bit issue though, so it will be something on
> >> its own anyway. Can as well add an extra ARM specific ptrace command
> >> to get that info, or hack it in some other way. For instance, ip is
> >> (ab)used to tell if it is syscall entry or exit, so doing these tricks
> >> isn't anything new in ARM either.
> >
> > In theory, aren't we supposed to know whether it's entry/exit anyway?
> > Why does strace care?  Have there been kernel bugs in the past?  Maybe
> > it was just to deal with SIGTRAP-after-exit in the past, which could
> > be delivered at an unpredictable time if blocked and then unblocked by
> > sigreturn().
> 
> Maybe. I don't why ARM does that ip thing.
> 
> Although in theory you know the entry/exits if you keep track, but one
> mistake or unexpected behaviour (like execve for my code) and you can get
> it wrong. So for robustness sake it's good if it can be double checked.

I agree, and I think the PTRACE_EVENT_SYSCALL_ENTRY/EXIT events would
be a clean way to represent that.

I wonder if all archs report syscall-exit as the first event in traced
fork children.  Looking at arch/hexagon I'm guessing it doesn't, but
it's hard to be sure and no practical way to test it :-/

That wouldn't matter if the events were robust.

I read somewhere about a bug report where syscall-exit was seen after
attach, but I don't remember where now.

> I don't know anything about OABI, can you link an OABI program against
> an EABI library? If you can then libc can be EABI and the kernel doesn't
> need OABI support.

That's not the point.  If you're writing a ptrace jailer (as you are)
a program can deliberately use OABI calls to subvert the tracer, even
if it's using EABI for normal calls.

For linking, you are mostly right.  Ideally everything would be open
and recompilable anyway, but that's sadly not always possible.  OABI
and EABI have different struct layouts among other changes, and EABI
being newer tends to accompany other libc changes; embedded libc.
aren't always as drop-in backward-compatible as glibc.

> >> And then there's the whole confusion what that flag says, some might think
> >> it says in what mode the tracee is instead of what mode the system call is.
> >> That those two can be different is not obvious at all and seems very x86_64
> >> specific.
> >
> > My rough read of PARISC entry code suggests it has two entry methods,
> > similar to ARM and x86_64, but I'm not really familiar with PARISC and
> > I don't have a machine handy to try it out :-)
> 
> It has a unified syscall table, so does it really matter?

I don't know if the 32/64 matters.  For security or accurate tracing,
I wouldn't like to assume without checking if there are 64-on-32
argument alignment fixups.

PARISC has a second set of HPUX-compatible system call numbers,
handled in arch/parisc/hpux/*.  I don't know if those are available to
all programs and can be used to subvert a ptracer.  Looking at
hpux/gate.S I think they bypass ptrace entirely; maybe they can subvert it.

> > I have a script in progress which extracts all the
> > per-arch and per-ABI syscall numbers, syscall argument layouts and
> > kernel function names to keep track of arch-specific fixups, from a
> > Linux source tree.  It currently works on all archs except it breaks
> > on x86 which insists on being diferent ;-)
> 
> That's handy, but I thought strace had such a script already?
> See HACKING-scripts in strace source. Or is yours much better?

The strace script only gets the syscall numbers (so doesn't help
cross-check I've applied all arch-specific syscall fixups), doesn't
work for all arch/ABI combinations without editing unistd.h, and
requires a configured and partly built kernel for some archs.  It's
only really useful for getting new syscall numbers which you then
hand-edit into the real table.  You still have to set the number of
arguments and check carefully you haven't missed any arch-specific
fixups.

All the best,
-- Jamie

From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: Compat 32-bit syscall entry from 64-bit task!?
Date: Fri, 10 Feb 2012 02:02:55 +0000
Message-ID: <20120210020255.GA8333@jl-vm1.vm.bytemark.co.uk>
References: <da6b0694076aad94aa8afa3740126cca.squirrel@webmail.greenhost.nl>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Denys Vlasenko <vda.linux@googlemail.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andi Kleen <andi@firstfloor.org>,
	Andrew Lutomirski <luto@mit.edu>,
	Will Drewry <wad@chromium.org>, linux-kernel@vger.kernel.org,
	keescook@chromium.org, john.johansen@canonical.com,
	serge.hallyn@canonical.com, coreyb@linux.vnet.ibm.com,
	pmoore@redhat.com, eparis@redhat.com, djm@mindrot.org,
	segoon@openwall.com, rostedt@goodmis.org, jmorris@namei.org,
	scarybeasts@gmail.com, avi@redhat.com, penberg@cs.helsinki.fi,
	viro@zeniv.linux.org.uk, mingo@elte.hu, akpm@linux-foundation.org,
	khilman@ti.com, borislav.petkov@amd.com, amwang@redhat.com,
	ak@linux.intel.com, eric.dumazet@gmail.com, gregkh@suse.de,
	dhowells@redhat.com, daniel.lezcano@free.fr,
	linux-fsdevel@vger.kernel.org,
	linux-security-module@vger.kernel.org, olofj@chromium.org,
	mhalcrow@google.c
To: Indan Zupancic <indan@nul.nu>
Return-path: <linux-kernel-owner@vger.kernel.org>
Content-Disposition: inline
In-Reply-To: <da6b0694076aad94aa8afa3740126cca.squirrel@webmail.greenhost.nl>
Sender: linux-kernel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

Indan Zupancic wrote:
> On Thu, January 26, 2012 12:47, Jamie Lokier wrote:
> > Indan Zupancic wrote:
> >> On Thu, January 26, 2012 11:31, Jamie Lokier wrote:
> >> > Indan Zupancic wrote:
> >> The jailer I wrote works pretty well as a simplistic strace replacement.
> >> It can only print out the arguments we're checking, but that's usually
> >> the more interesting info.
> >
> > In theory such a thing should be easy to write, but as we both found,
> > ptrace() on Linux has a huge number of difficult quirks to deal with
> > to trace reliably.  At least it's getting better with later kernels.
> 
> It's not that bad, there are a few quirks, but not that many.
> The ptrace specific code is less than 500 lines of code, with
> a couple of hundred lines of header files. Linux ptrace specific
> stuff creeps in elsewhere too though, like that execve mess.

I count 720 lines *just* to read the syscall number and arguments in
strace-git, for the Linux archs it supports.

That's only the Linux code, I excluded non-Linux, and it's only a
little bit of syscall.c, I didn't include generic ptracing,
fork-following, threaded-exec-fixups, signal handling etc. nor other
arch-specific functions and ABI fixups.  And it doesn't even have all
archs currently in Linux mainline.

> >> It's not a 32 versus 64-bit issue though, so it will be something on
> >> its own anyway. Can as well add an extra ARM specific ptrace command
> >> to get that info, or hack it in some other way. For instance, ip is
> >> (ab)used to tell if it is syscall entry or exit, so doing these tricks
> >> isn't anything new in ARM either.
> >
> > In theory, aren't we supposed to know whether it's entry/exit anyway?
> > Why does strace care?  Have there been kernel bugs in the past?  Maybe
> > it was just to deal with SIGTRAP-after-exit in the past, which could
> > be delivered at an unpredictable time if blocked and then unblocked by
> > sigreturn().
> 
> Maybe. I don't why ARM does that ip thing.
> 
> Although in theory you know the entry/exits if you keep track, but one
> mistake or unexpected behaviour (like execve for my code) and you can get
> it wrong. So for robustness sake it's good if it can be double checked.

I agree, and I think the PTRACE_EVENT_SYSCALL_ENTRY/EXIT events would
be a clean way to represent that.

I wonder if all archs report syscall-exit as the first event in traced
fork children.  Looking at arch/hexagon I'm guessing it doesn't, but
it's hard to be sure and no practical way to test it :-/

That wouldn't matter if the events were robust.

I read somewhere about a bug report where syscall-exit was seen after
attach, but I don't remember where now.

> I don't know anything about OABI, can you link an OABI program against
> an EABI library? If you can then libc can be EABI and the kernel doesn't
> need OABI support.

That's not the point.  If you're writing a ptrace jailer (as you are)
a program can deliberately use OABI calls to subvert the tracer, even
if it's using EABI for normal calls.

For linking, you are mostly right.  Ideally everything would be open
and recompilable anyway, but that's sadly not always possible.  OABI
and EABI have different struct layouts among other changes, and EABI
being newer tends to accompany other libc changes; embedded libc.
aren't always as drop-in backward-compatible as glibc.

> >> And then there's the whole confusion what that flag says, some might think
> >> it says in what mode the tracee is instead of what mode the system call is.
> >> That those two can be different is not obvious at all and seems very x86_64
> >> specific.
> >
> > My rough read of PARISC entry code suggests it has two entry methods,
> > similar to ARM and x86_64, but I'm not really familiar with PARISC and
> > I don't have a machine handy to try it out :-)
> 
> It has a unified syscall table, so does it really matter?

I don't know if the 32/64 matters.  For security or accurate tracing,
I wouldn't like to assume without checking if there are 64-on-32
argument alignment fixups.

PARISC has a second set of HPUX-compatible system call numbers,
handled in arch/parisc/hpux/*.  I don't know if those are available to
all programs and can be used to subvert a ptracer.  Looking at
hpux/gate.S I think they bypass ptrace entirely; maybe they can subvert it.

> > I have a script in progress which extracts all the
> > per-arch and per-ABI syscall numbers, syscall argument layouts and
> > kernel function names to keep track of arch-specific fixups, from a
> > Linux source tree.  It currently works on all archs except it breaks
> > on x86 which insists on being diferent ;-)
> 
> That's handy, but I thought strace had such a script already?
> See HACKING-scripts in strace source. Or is yours much better?

The strace script only gets the syscall numbers (so doesn't help
cross-check I've applied all arch-specific syscall fixups), doesn't
work for all arch/ABI combinations without editing unistd.h, and
requires a configured and partly built kernel for some archs.  It's
only really useful for getting new syscall numbers which you then
hand-edit into the real table.  You still have to set the number of
arguments and check carefully you haven't missed any arch-specific
fixups.

All the best,
-- Jamie