On Wed, Jan 18, 2012 at 5:12 AM, Indan Zupancic <indan@nul.nu> wrote:
>
> So there is this gap and there is no good way to handle it at all for
> user space? And even if it's fixed in the kernel, that won't help with
> older kernels, so it will stay a problem for a while.

Correct.

> Can this int 0x80 trick be blocked for ptraced task (preferably always),
> pretty please?

Nope. Not that I can tell. The "unable to read $pc-2" is a hardware
feature, and we cannot stop users from running the "int 0x80" code.
The only way to block it is to simply not enable the 32-bit
compatibility mode at all, at which point the "int 0x80" interface
simply doesn't exist.

And sure, we could do something in the kernel (like saying that you
cannot do "int 0x80" from 64-bit code by explicitly testing in the
ia32_syscall function), but that has the same "even if it's fixed in
the kernel" issue.

You can test this feature out with a test-program something like this:

  #include <errno.h>
  #include <stdlib.h>
  #include <signal.h>

  #define _GNU_SOURCE
  #include <unistd.h>
  #include <sys/syscall.h>

  void handler(int sig)
  {
	printf("SIGWINCH\n");
  }

  int main(unsigned int argc, char **argv)
  {
	signal(SIGWINCH, handler);
	asm("int $0x80": :"a" (29));	/* sys_pause - 32-bit */
	syscall(34);	/* sys_pause - 64-bit */
  }

which does two "pause()" system calls from 64-bit mode, the first one
using the legacy system call interface.

At least "strace" gets really confused, and will show the first one as

   shmget(0x1c, 140734112566944, 0)        = ? ERESTARTNOHAND (To be restarted)

because it assumes that in 64-bit mode, system call number 29 means
"shmget". It doesn't even look at $pc-2, which (since this code
doesn't try to obfuscate it) would have worked in this case.

I actually checked the strace source code. It has

  #  if 0
                /* This version analyzes the opcode of a syscall instruction.
                 * (int 0x80 on i386 vs. syscall on x86-64)
                 * It works, but is too complicated.
                 */
                unsigned long val, rip, i;

                if (upeek(tcp, 8*RIP, &rip) < 0)
                        perror("upeek(RIP)");

                /* sizeof(syscall) == sizeof(int 0x80) == 2 */
                rip -= 2;
                errno = 0;
              ...

so there is code there that could make it work, but it's #ifdef'ed
out. The actually used code just does

                /* Check CS register value. On x86-64 linux it is:
                 *      0x33    for long mode (64 bit)
                 *      0x23    for compatibility mode (32 bit)
                 * It takes only one ptrace and thus doesn't need
                 * to be cached.
                 */
                if (upeek(tcp, 8*CS, &val) < 0)
                        return -1;
                switch (val) {
                        case 0x23: currpers = 1; break;
                        case 0x33: currpers = 0; break;

which is the reasonable and obvious approach.

I'm looking at "struct user_regs_struct" and there really isn't any
non-architected state there outside of "high bits".

There are high bits that we can hide things in outside of orig_ax - we
do have 64 bits for "cs" for example - but it all boils down to the
same issue: we *will* break something that thinks it knows the details
of this. The advantage of "orig_eax" would be that at least it makes
conceptual sense there.

Using the high bits of 'eflags' might work. Hopefully nobody tests
that. IOW, something like the attached might work. It just sets bit#32
in eflags if the system call is a compat call.

With that, ptrace would at least be able to tell (assuming a new
kernel, of course - it would still need to have the "look at cs" as a
fallback) if it's a compat call or not, but it could do something like

   mode = (eflags >> 32) & 3;
   switch (mode) {
   case 0:
          .. guess it from CS ..
   case 1:
           64-bit
   case 2:
            32-bit
   default:
            Oddity.
   }

or something like that. The idea being that you can also see from
eflags whether the new feature is supported or not.

THIS IS TOTALLY UNTESTED!

                      Linus