All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrei Vagin <avagin@gmail.com>
To: Jann Horn <jannh@google.com>
Cc: the arch/x86 maintainers <x86@kernel.org>,
	Andy Lutomirski <luto@kernel.org>,
	linux-kernel@vger.kernel.org, linux-api@vger.kernel.org,
	linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com,
	Andrew Morton <akpm@linux-foundation.org>,
	Anton Ivanov <anton.ivanov@cambridgegreys.com>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Dmitry Safonov <0x7f454c46@gmail.com>,
	Ingo Molnar <mingo@redhat.com>, Jeff Dike <jdike@addtoit.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Richard Weinberger <richard@nod.at>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall
Date: Fri, 2 Jul 2021 15:48:53 -0700	[thread overview]
Message-ID: <YN+X1QKWQOKekK4E@gmail.com> (raw)
In-Reply-To: <CAG48ez37ZUNvWy1eOvrW13kFRM-_ZW175x99Nyjq43w4Qz1qJQ@mail.gmail.com>

On Fri, Jul 02, 2021 at 10:56:38PM +0200, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> > This change introduces the new system call:
> > process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
> >                 siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)
> >
> > process_vm_exec allows to execute the current process in an address
> > space of another process.
> >
> > process_vm_exec swaps the current address space with an address space of
> > a specified process, sets a state from sigcontex and resumes the process.
> > When a process receives a signal or calls a system call,
> > process_vm_exec saves the process state back to sigcontext, restores the
> > origin address space, restores the origin process state, and returns to
> > userspace.
> >
> > If it was interrupted by a signal and the signal is in the user_mask,
> > the signal is dequeued and information about it is saved in uinfo.
> > If process_vm_exec is interrupted by a system call, a synthetic siginfo
> > for the SIGSYS signal is generated.
> >
> > The behavior of this system call is similar to PTRACE_SYSEMU but
> > everything is happing in the context of one process, so
> > process_vm_exec shows a better performance.
> >
> > PTRACE_SYSEMU is primarily used to implement sandboxes (application
> > kernels) like User-mode Linux or gVisor. These type of sandboxes
> > intercepts applications system calls and acts as the guest kernel.
> > A simple benchmark, where a "tracee" process executes systems calls in a
> > loop and a "tracer" process traps syscalls and handles them just
> > incrementing the tracee instruction pointer to skip the syscall
> > instruction shows that process_vm_exec works more than 5 times faster
> > than PTRACE_SYSEMU.
> [...]
> > +long swap_vm_exec_context(struct sigcontext __user *uctx)
> > +{
> > +       struct sigcontext ctx = {};
> > +       sigset_t set = {};
> > +
> > +
> > +       if (copy_from_user(&ctx, uctx, CONTEXT_COPY_SIZE))
> > +               return -EFAULT;
> > +       /* A floating point state is managed from user-space. */
> > +       if (ctx.fpstate != 0)
> > +               return -EINVAL;
> > +       if (!user_access_begin(uctx, sizeof(*uctx)))
> > +               return -EFAULT;
> > +       unsafe_put_sigcontext(uctx, NULL, current_pt_regs(), (&set), Efault);
> > +       user_access_end();
> > +
> > +       if (__restore_sigcontext(current_pt_regs(), &ctx, 0))
> > +               goto badframe;
> > +
> > +       return 0;
> > +Efault:
> > +       user_access_end();
> > +badframe:
> > +       signal_fault(current_pt_regs(), uctx, "swap_vm_exec_context");
> > +       return -EFAULT;
> > +}
> 
> Comparing the pieces of context that restore_sigcontext() restores
> with what a normal task switch does (see __switch_to() and callees), I
> noticed: On CPUs with FSGSBASE support, I think sandboxed code could
> overwrite FSBASE/GSBASE using the WRFSBASE/WRGSBASE instructions,
> causing the supervisor to access attacker-controlled addresses when it
> tries to access a thread-local variable like "errno"? Signal handling
> saves the segment registers, but not the FS/GS base addresses.
> 
> 
> jannh@laptop:~/test$ cat signal_gsbase.c
> // compile with -mfsgsbase
> #include <stdio.h>
> #include <signal.h>
> #include <immintrin.h>
> 
> void signal_handler(int sig, siginfo_t *info, void *ucontext_) {
>   puts("signal handler");
>   _writegsbase_u64(0x12345678);
> }
> 
> int main(void) {
>   struct sigaction new_act = {
>     .sa_sigaction = signal_handler,
>     .sa_flags = SA_SIGINFO
>   };
>   sigaction(SIGUSR1, &new_act, NULL);
> 
>   printf("original gsbase is 0x%lx\n", _readgsbase_u64());
>   raise(SIGUSR1);
>   printf("post-signal gsbase is 0x%lx\n", _readgsbase_u64());
> }
> jannh@laptop:~/test$ gcc -o signal_gsbase signal_gsbase.c -mfsgsbase
> jannh@laptop:~/test$ ./signal_gsbase
> original gsbase is 0x0
> signal handler
> post-signal gsbase is 0x12345678
> jannh@laptop:~/test$
> 
> 
> So to make this usable for a sandboxing usecase, you'd also have to
> save and restore FSBASE/GSBASE, just like __switch_to().

You are right. I've found this too when I implemented the gviosr user-space
part.

Here is the tree whether this problem has been fixed:
https://github.com/avagin/linux-task-diag/commits/wip/gvisor-5.10


WARNING: multiple messages have this Message-ID (diff)
From: Andrei Vagin <avagin@gmail.com>
To: Jann Horn <jannh@google.com>
Cc: the arch/x86 maintainers <x86@kernel.org>,
	Andy Lutomirski <luto@kernel.org>,
	linux-kernel@vger.kernel.org, linux-api@vger.kernel.org,
	linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com,
	Andrew Morton <akpm@linux-foundation.org>,
	Anton Ivanov <anton.ivanov@cambridgegreys.com>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Dmitry Safonov <0x7f454c46@gmail.com>,
	Ingo Molnar <mingo@redhat.com>, Jeff Dike <jdike@addtoit.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Richard Weinberger <richard@nod.at>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall
Date: Fri, 2 Jul 2021 15:48:53 -0700	[thread overview]
Message-ID: <YN+X1QKWQOKekK4E@gmail.com> (raw)
In-Reply-To: <CAG48ez37ZUNvWy1eOvrW13kFRM-_ZW175x99Nyjq43w4Qz1qJQ@mail.gmail.com>

On Fri, Jul 02, 2021 at 10:56:38PM +0200, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> > This change introduces the new system call:
> > process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
> >                 siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)
> >
> > process_vm_exec allows to execute the current process in an address
> > space of another process.
> >
> > process_vm_exec swaps the current address space with an address space of
> > a specified process, sets a state from sigcontex and resumes the process.
> > When a process receives a signal or calls a system call,
> > process_vm_exec saves the process state back to sigcontext, restores the
> > origin address space, restores the origin process state, and returns to
> > userspace.
> >
> > If it was interrupted by a signal and the signal is in the user_mask,
> > the signal is dequeued and information about it is saved in uinfo.
> > If process_vm_exec is interrupted by a system call, a synthetic siginfo
> > for the SIGSYS signal is generated.
> >
> > The behavior of this system call is similar to PTRACE_SYSEMU but
> > everything is happing in the context of one process, so
> > process_vm_exec shows a better performance.
> >
> > PTRACE_SYSEMU is primarily used to implement sandboxes (application
> > kernels) like User-mode Linux or gVisor. These type of sandboxes
> > intercepts applications system calls and acts as the guest kernel.
> > A simple benchmark, where a "tracee" process executes systems calls in a
> > loop and a "tracer" process traps syscalls and handles them just
> > incrementing the tracee instruction pointer to skip the syscall
> > instruction shows that process_vm_exec works more than 5 times faster
> > than PTRACE_SYSEMU.
> [...]
> > +long swap_vm_exec_context(struct sigcontext __user *uctx)
> > +{
> > +       struct sigcontext ctx = {};
> > +       sigset_t set = {};
> > +
> > +
> > +       if (copy_from_user(&ctx, uctx, CONTEXT_COPY_SIZE))
> > +               return -EFAULT;
> > +       /* A floating point state is managed from user-space. */
> > +       if (ctx.fpstate != 0)
> > +               return -EINVAL;
> > +       if (!user_access_begin(uctx, sizeof(*uctx)))
> > +               return -EFAULT;
> > +       unsafe_put_sigcontext(uctx, NULL, current_pt_regs(), (&set), Efault);
> > +       user_access_end();
> > +
> > +       if (__restore_sigcontext(current_pt_regs(), &ctx, 0))
> > +               goto badframe;
> > +
> > +       return 0;
> > +Efault:
> > +       user_access_end();
> > +badframe:
> > +       signal_fault(current_pt_regs(), uctx, "swap_vm_exec_context");
> > +       return -EFAULT;
> > +}
> 
> Comparing the pieces of context that restore_sigcontext() restores
> with what a normal task switch does (see __switch_to() and callees), I
> noticed: On CPUs with FSGSBASE support, I think sandboxed code could
> overwrite FSBASE/GSBASE using the WRFSBASE/WRGSBASE instructions,
> causing the supervisor to access attacker-controlled addresses when it
> tries to access a thread-local variable like "errno"? Signal handling
> saves the segment registers, but not the FS/GS base addresses.
> 
> 
> jannh@laptop:~/test$ cat signal_gsbase.c
> // compile with -mfsgsbase
> #include <stdio.h>
> #include <signal.h>
> #include <immintrin.h>
> 
> void signal_handler(int sig, siginfo_t *info, void *ucontext_) {
>   puts("signal handler");
>   _writegsbase_u64(0x12345678);
> }
> 
> int main(void) {
>   struct sigaction new_act = {
>     .sa_sigaction = signal_handler,
>     .sa_flags = SA_SIGINFO
>   };
>   sigaction(SIGUSR1, &new_act, NULL);
> 
>   printf("original gsbase is 0x%lx\n", _readgsbase_u64());
>   raise(SIGUSR1);
>   printf("post-signal gsbase is 0x%lx\n", _readgsbase_u64());
> }
> jannh@laptop:~/test$ gcc -o signal_gsbase signal_gsbase.c -mfsgsbase
> jannh@laptop:~/test$ ./signal_gsbase
> original gsbase is 0x0
> signal handler
> post-signal gsbase is 0x12345678
> jannh@laptop:~/test$
> 
> 
> So to make this usable for a sandboxing usecase, you'd also have to
> save and restore FSBASE/GSBASE, just like __switch_to().

You are right. I've found this too when I implemented the gviosr user-space
part.

Here is the tree whether this problem has been fixed:
https://github.com/avagin/linux-task-diag/commits/wip/gvisor-5.10


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


  reply	other threads:[~2021-07-02 22:52 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14  5:52 [PATCH 0/4 POC] Allow executing code and syscalls in another address space Andrei Vagin
2021-04-14  5:52 ` Andrei Vagin
2021-04-14  5:52 ` [PATCH 1/4] signal: add a helper to restore a process state from sigcontex Andrei Vagin
2021-04-14  5:52   ` Andrei Vagin
2021-04-14  5:52 ` [PATCH 2/4] arch/x86: implement the process_vm_exec syscall Andrei Vagin
2021-04-14  5:52   ` Andrei Vagin
2021-04-14 17:09   ` Oleg Nesterov
2021-04-14 17:09     ` Oleg Nesterov
2021-04-23  6:59     ` Andrei Vagin
2021-04-23  6:59       ` Andrei Vagin
2021-06-28 16:13   ` Jann Horn
2021-06-28 16:13     ` Jann Horn
2021-06-28 16:30     ` Andy Lutomirski
2021-06-28 17:14       ` Jann Horn
2021-06-28 17:14         ` Jann Horn
2021-06-28 18:18         ` Eric W. Biederman
2021-06-28 18:18           ` Eric W. Biederman
2021-06-29  1:01           ` Andrei Vagin
2021-06-29  1:01             ` Andrei Vagin
2021-07-02  6:22     ` Andrei Vagin
2021-07-02  6:22       ` Andrei Vagin
2021-07-02 11:51       ` Jann Horn
2021-07-02 11:51         ` Jann Horn
2021-07-02 11:51         ` Jann Horn
2021-07-02 20:40         ` Andy Lutomirski
2021-07-02 20:40           ` Andy Lutomirski
2021-07-02  8:51   ` Peter Zijlstra
2021-07-02  8:51     ` Peter Zijlstra
2021-07-02 22:21     ` Andrei Vagin
2021-07-02 22:21       ` Andrei Vagin
2021-07-02 20:56   ` Jann Horn
2021-07-02 20:56     ` Jann Horn
2021-07-02 22:48     ` Andrei Vagin [this message]
2021-07-02 22:48       ` Andrei Vagin
2021-04-14  5:52 ` [PATCH 3/4] arch/x86: allow to execute syscalls via process_vm_exec Andrei Vagin
2021-04-14  5:52   ` Andrei Vagin
2021-04-14  5:52 ` [PATCH 4/4] selftests: add tests for process_vm_exec Andrei Vagin
2021-04-14  5:52   ` Andrei Vagin
2021-04-14  6:46 ` [PATCH 0/4 POC] Allow executing code and syscalls in another address space Jann Horn
2021-04-14  6:46   ` Jann Horn
2021-04-14 22:10   ` Andrei Vagin
2021-04-14 22:10     ` Andrei Vagin
2021-07-02  6:57   ` Andrei Vagin
2021-07-02  6:57     ` Andrei Vagin
2021-07-02 15:12     ` Jann Horn
2021-07-02 15:12       ` Jann Horn
2021-07-02 15:12       ` Jann Horn
2021-07-18  0:38       ` Andrei Vagin
2021-07-18  0:38         ` Andrei Vagin
2021-04-14  7:22 ` Anton Ivanov
2021-04-14  7:22   ` Anton Ivanov
2021-04-14  7:34   ` Johannes Berg
2021-04-14  7:34     ` Johannes Berg
2021-04-14  9:24     ` Benjamin Berg
2021-04-14  9:24       ` Benjamin Berg
2021-04-14 10:27 ` Florian Weimer
2021-04-14 10:27   ` Florian Weimer
2021-04-14 11:24   ` Jann Horn
2021-04-14 11:24     ` Jann Horn
2021-04-14 12:20     ` Florian Weimer
2021-04-14 12:20       ` Florian Weimer
2021-04-14 13:58       ` Jann Horn
2021-04-14 13:58         ` Jann Horn
2021-04-16 19:29 ` Kirill Smelkov
2021-04-16 19:29   ` Kirill Smelkov
2021-04-17 16:28 ` sbaugh
2021-04-17 16:28   ` sbaugh
2021-07-02 22:44 ` Andy Lutomirski
2021-07-02 22:44   ` Andy Lutomirski
2021-07-18  1:34   ` Andrei Vagin
2021-07-18  1:34     ` Andrei Vagin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YN+X1QKWQOKekK4E@gmail.com \
    --to=avagin@gmail.com \
    --cc=0x7f454c46@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=anton.ivanov@cambridgegreys.com \
    --cc=avagin@google.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=criu@openvz.org \
    --cc=jannh@google.com \
    --cc=jdike@addtoit.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-um@lists.infradead.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mtk.manpages@gmail.com \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=richard@nod.at \
    --cc=rppt@linux.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.