All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jann Horn <jannh@google.com>
To: Andrei Vagin <avagin@gmail.com>
Cc: kernel list <linux-kernel@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>,
	linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com,
	Andrew Morton <akpm@linux-foundation.org>,
	Andy Lutomirski <luto@kernel.org>,
	Anton Ivanov <anton.ivanov@cambridgegreys.com>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Dmitry Safonov <0x7f454c46@gmail.com>,
	Ingo Molnar <mingo@redhat.com>, Jeff Dike <jdike@addtoit.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Richard Weinberger <richard@nod.at>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space
Date: Wed, 14 Apr 2021 08:46:40 +0200	[thread overview]
Message-ID: <CAG48ez0jfsS=gKN0Vo_VS2EvvMBvEr+QNz0vDKPeSAzsrsRwPQ@mail.gmail.com> (raw)
In-Reply-To: <20210414055217.543246-1-avagin@gmail.com>

On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> We already have process_vm_readv and process_vm_writev to read and write
> to a process memory faster than we can do this with ptrace. And now it
> is time for process_vm_exec that allows executing code in an address
> space of another process. We can do this with ptrace but it is much
> slower.
>
> = Use-cases =

It seems to me like your proposed API doesn't really fit either one of
those usecases well...

> Here are two known use-cases. The first one is “application kernel”
> sandboxes like User-mode Linux and gVisor. In this case, we have a
> process that runs the sandbox kernel and a set of stub processes that
> are used to manage guest address spaces. Guest code is executed in the
> context of stub processes but all system calls are intercepted and
> handled in the sandbox kernel. Right now, these sort of sandboxes use
> PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> significantly speed them up.

In this case, since you really only want an mm_struct to run code
under, it seems weird to create a whole task with its own PID and so
on. It seems to me like something similar to the /dev/kvm API would be
more appropriate here? Implementation options that I see for that
would be:

1. mm_struct-based:
      a set of syscalls to create a new mm_struct,
      change memory mappings under that mm_struct, and switch to it
2. pagetable-mirroring-based:
      like /dev/kvm, an API to create a new pagetable, mirror parts of
      the mm_struct's pagetables over into it with modified permissions
      (like KVM_SET_USER_MEMORY_REGION),
      and run code under that context.
      page fault handling would first handle the fault against mm->pgd
      as normal, then mirror the PTE over into the secondary pagetables.
      invalidation could be handled with MMU notifiers.

> Another use-case is CRIU (Checkpoint/Restore in User-space). Several
> process properties can be received only from the process itself. Right
> now, we use a parasite code that is injected into the process. We do
> this with ptrace but it is slow, unsafe, and tricky.

But this API will only let you run code under the *mm* of the target
process, not fully in the context of a target *task*, right? So you
still won't be able to use this for accessing anything other than
memory? That doesn't seem very generically useful to me.

Also, I don't doubt that anything involving ptrace is kinda tricky,
but it would be nice to have some more detail on what exactly makes
this slow, unsafe and tricky. Are there API additions for ptrace that
would make this work better? I imagine you're thinking of things like
an API for injecting a syscall into the target process without having
to first somehow find an existing SYSCALL instruction in the target
process?

> process_vm_exec can
> simplify the process of injecting a parasite code and it will allow
> pre-dump memory without stopping processes. The pre-dump here is when we
> enable a memory tracker and dump the memory while a process is continue
> running. On each interaction we dump memory that has been changed from
> the previous iteration. In the final step, we will stop processes and
> dump their full state. Right now the most effective way to dump process
> memory is to create a set of pipes and splice memory into these pipes
> from the parasite code. With process_vm_exec, we will be able to call
> vmsplice directly. It means that we will not need to stop a process to
> inject the parasite code.

Alternatively you could add splice support to /proc/$pid/mem or add a
syscall similar to process_vm_readv() that splices into a pipe, right?

WARNING: multiple messages have this Message-ID (diff)
From: Jann Horn <jannh@google.com>
To: Andrei Vagin <avagin@gmail.com>
Cc: kernel list <linux-kernel@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>,
	linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com,
	Andrew Morton <akpm@linux-foundation.org>,
	Andy Lutomirski <luto@kernel.org>,
	Anton Ivanov <anton.ivanov@cambridgegreys.com>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Dmitry Safonov <0x7f454c46@gmail.com>,
	Ingo Molnar <mingo@redhat.com>, Jeff Dike <jdike@addtoit.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Richard Weinberger <richard@nod.at>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space
Date: Wed, 14 Apr 2021 08:46:40 +0200	[thread overview]
Message-ID: <CAG48ez0jfsS=gKN0Vo_VS2EvvMBvEr+QNz0vDKPeSAzsrsRwPQ@mail.gmail.com> (raw)
In-Reply-To: <20210414055217.543246-1-avagin@gmail.com>

On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> We already have process_vm_readv and process_vm_writev to read and write
> to a process memory faster than we can do this with ptrace. And now it
> is time for process_vm_exec that allows executing code in an address
> space of another process. We can do this with ptrace but it is much
> slower.
>
> = Use-cases =

It seems to me like your proposed API doesn't really fit either one of
those usecases well...

> Here are two known use-cases. The first one is “application kernel”
> sandboxes like User-mode Linux and gVisor. In this case, we have a
> process that runs the sandbox kernel and a set of stub processes that
> are used to manage guest address spaces. Guest code is executed in the
> context of stub processes but all system calls are intercepted and
> handled in the sandbox kernel. Right now, these sort of sandboxes use
> PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> significantly speed them up.

In this case, since you really only want an mm_struct to run code
under, it seems weird to create a whole task with its own PID and so
on. It seems to me like something similar to the /dev/kvm API would be
more appropriate here? Implementation options that I see for that
would be:

1. mm_struct-based:
      a set of syscalls to create a new mm_struct,
      change memory mappings under that mm_struct, and switch to it
2. pagetable-mirroring-based:
      like /dev/kvm, an API to create a new pagetable, mirror parts of
      the mm_struct's pagetables over into it with modified permissions
      (like KVM_SET_USER_MEMORY_REGION),
      and run code under that context.
      page fault handling would first handle the fault against mm->pgd
      as normal, then mirror the PTE over into the secondary pagetables.
      invalidation could be handled with MMU notifiers.

> Another use-case is CRIU (Checkpoint/Restore in User-space). Several
> process properties can be received only from the process itself. Right
> now, we use a parasite code that is injected into the process. We do
> this with ptrace but it is slow, unsafe, and tricky.

But this API will only let you run code under the *mm* of the target
process, not fully in the context of a target *task*, right? So you
still won't be able to use this for accessing anything other than
memory? That doesn't seem very generically useful to me.

Also, I don't doubt that anything involving ptrace is kinda tricky,
but it would be nice to have some more detail on what exactly makes
this slow, unsafe and tricky. Are there API additions for ptrace that
would make this work better? I imagine you're thinking of things like
an API for injecting a syscall into the target process without having
to first somehow find an existing SYSCALL instruction in the target
process?

> process_vm_exec can
> simplify the process of injecting a parasite code and it will allow
> pre-dump memory without stopping processes. The pre-dump here is when we
> enable a memory tracker and dump the memory while a process is continue
> running. On each interaction we dump memory that has been changed from
> the previous iteration. In the final step, we will stop processes and
> dump their full state. Right now the most effective way to dump process
> memory is to create a set of pipes and splice memory into these pipes
> from the parasite code. With process_vm_exec, we will be able to call
> vmsplice directly. It means that we will not need to stop a process to
> inject the parasite code.

Alternatively you could add splice support to /proc/$pid/mem or add a
syscall similar to process_vm_readv() that splices into a pipe, right?

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

  parent reply	other threads:[~2021-04-14  6:47 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14  5:52 [PATCH 0/4 POC] Allow executing code and syscalls in another address space Andrei Vagin
2021-04-14  5:52 ` Andrei Vagin
2021-04-14  5:52 ` [PATCH 1/4] signal: add a helper to restore a process state from sigcontex Andrei Vagin
2021-04-14  5:52   ` Andrei Vagin
2021-04-14  5:52 ` [PATCH 2/4] arch/x86: implement the process_vm_exec syscall Andrei Vagin
2021-04-14  5:52   ` Andrei Vagin
2021-04-14 17:09   ` Oleg Nesterov
2021-04-14 17:09     ` Oleg Nesterov
2021-04-23  6:59     ` Andrei Vagin
2021-04-23  6:59       ` Andrei Vagin
2021-06-28 16:13   ` Jann Horn
2021-06-28 16:13     ` Jann Horn
2021-06-28 16:30     ` Andy Lutomirski
2021-06-28 17:14       ` Jann Horn
2021-06-28 17:14         ` Jann Horn
2021-06-28 18:18         ` Eric W. Biederman
2021-06-28 18:18           ` Eric W. Biederman
2021-06-29  1:01           ` Andrei Vagin
2021-06-29  1:01             ` Andrei Vagin
2021-07-02  6:22     ` Andrei Vagin
2021-07-02  6:22       ` Andrei Vagin
2021-07-02 11:51       ` Jann Horn
2021-07-02 11:51         ` Jann Horn
2021-07-02 11:51         ` Jann Horn
2021-07-02 20:40         ` Andy Lutomirski
2021-07-02 20:40           ` Andy Lutomirski
2021-07-02  8:51   ` Peter Zijlstra
2021-07-02  8:51     ` Peter Zijlstra
2021-07-02 22:21     ` Andrei Vagin
2021-07-02 22:21       ` Andrei Vagin
2021-07-02 20:56   ` Jann Horn
2021-07-02 20:56     ` Jann Horn
2021-07-02 22:48     ` Andrei Vagin
2021-07-02 22:48       ` Andrei Vagin
2021-04-14  5:52 ` [PATCH 3/4] arch/x86: allow to execute syscalls via process_vm_exec Andrei Vagin
2021-04-14  5:52   ` Andrei Vagin
2021-04-14  5:52 ` [PATCH 4/4] selftests: add tests for process_vm_exec Andrei Vagin
2021-04-14  5:52   ` Andrei Vagin
2021-04-14  6:46 ` Jann Horn [this message]
2021-04-14  6:46   ` [PATCH 0/4 POC] Allow executing code and syscalls in another address space Jann Horn
2021-04-14 22:10   ` Andrei Vagin
2021-04-14 22:10     ` Andrei Vagin
2021-07-02  6:57   ` Andrei Vagin
2021-07-02  6:57     ` Andrei Vagin
2021-07-02 15:12     ` Jann Horn
2021-07-02 15:12       ` Jann Horn
2021-07-02 15:12       ` Jann Horn
2021-07-18  0:38       ` Andrei Vagin
2021-07-18  0:38         ` Andrei Vagin
2021-04-14  7:22 ` Anton Ivanov
2021-04-14  7:22   ` Anton Ivanov
2021-04-14  7:34   ` Johannes Berg
2021-04-14  7:34     ` Johannes Berg
2021-04-14  9:24     ` Benjamin Berg
2021-04-14  9:24       ` Benjamin Berg
2021-04-14 10:27 ` Florian Weimer
2021-04-14 10:27   ` Florian Weimer
2021-04-14 11:24   ` Jann Horn
2021-04-14 11:24     ` Jann Horn
2021-04-14 12:20     ` Florian Weimer
2021-04-14 12:20       ` Florian Weimer
2021-04-14 13:58       ` Jann Horn
2021-04-14 13:58         ` Jann Horn
2021-04-16 19:29 ` Kirill Smelkov
2021-04-16 19:29   ` Kirill Smelkov
2021-04-17 16:28 ` sbaugh
2021-04-17 16:28   ` sbaugh
2021-07-02 22:44 ` Andy Lutomirski
2021-07-02 22:44   ` Andy Lutomirski
2021-07-18  1:34   ` Andrei Vagin
2021-07-18  1:34     ` Andrei Vagin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAG48ez0jfsS=gKN0Vo_VS2EvvMBvEr+QNz0vDKPeSAzsrsRwPQ@mail.gmail.com' \
    --to=jannh@google.com \
    --cc=0x7f454c46@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=anton.ivanov@cambridgegreys.com \
    --cc=avagin@gmail.com \
    --cc=avagin@google.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=criu@openvz.org \
    --cc=jdike@addtoit.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-um@lists.infradead.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mtk.manpages@gmail.com \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=richard@nod.at \
    --cc=rppt@linux.ibm.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.