linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrei Vagin <avagin@gmail.com>
To: Jann Horn <jannh@google.com>
Cc: kernel list <linux-kernel@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>,
	linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com,
	Andrew Morton <akpm@linux-foundation.org>,
	Andy Lutomirski <luto@kernel.org>,
	Anton Ivanov <anton.ivanov@cambridgegreys.com>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Dmitry Safonov <0x7f454c46@gmail.com>,
	Ingo Molnar <mingo@redhat.com>, Jeff Dike <jdike@addtoit.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Richard Weinberger <richard@nod.at>,
	Thomas Gleixner <tglx@linutronix.de>,
	linux-mm@kvack.org
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space
Date: Thu, 1 Jul 2021 23:57:53 -0700	[thread overview]
Message-ID: <YN648cPBDIGKYlYa@gmail.com> (raw)
In-Reply-To: <CAG48ez0jfsS=gKN0Vo_VS2EvvMBvEr+QNz0vDKPeSAzsrsRwPQ@mail.gmail.com>

On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> > We already have process_vm_readv and process_vm_writev to read and write
> > to a process memory faster than we can do this with ptrace. And now it
> > is time for process_vm_exec that allows executing code in an address
> > space of another process. We can do this with ptrace but it is much
> > slower.
> >
> > = Use-cases =
> 
> It seems to me like your proposed API doesn't really fit either one of
> those usecases well...
> 
> > Here are two known use-cases. The first one is “application kernel”
> > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > process that runs the sandbox kernel and a set of stub processes that
> > are used to manage guest address spaces. Guest code is executed in the
> > context of stub processes but all system calls are intercepted and
> > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > significantly speed them up.
> 
> In this case, since you really only want an mm_struct to run code
> under, it seems weird to create a whole task with its own PID and so
> on. It seems to me like something similar to the /dev/kvm API would be
> more appropriate here? Implementation options that I see for that
> would be:
> 
> 1. mm_struct-based:
>       a set of syscalls to create a new mm_struct,
>       change memory mappings under that mm_struct, and switch to it

I like the idea to have a handle for mm. Instead of pid, we will pass
this handle to process_vm_exec. We have pidfd for processes and we can
introduce mmfd for mm_struct.


> 2. pagetable-mirroring-based:
>       like /dev/kvm, an API to create a new pagetable, mirror parts of
>       the mm_struct's pagetables over into it with modified permissions
>       (like KVM_SET_USER_MEMORY_REGION),
>       and run code under that context.
>       page fault handling would first handle the fault against mm->pgd
>       as normal, then mirror the PTE over into the secondary pagetables.
>       invalidation could be handled with MMU notifiers.
>

I found this idea interesting and decided to look at it more closely.
After reading the kernel code for a few days, I realized that it would
not be easy to implement something like this, but more important is that
I don’t understand what problem it solves. Will it simplify the
user-space code? I don’t think so. Will it improve performance? It is
unclear for me too.

First, in the KVM case, we have a few big linear mappings and need to
support one “shadow” address space. In the case of sandboxes, we can
have a tremendous amount of mappings and many address spaces that we
need to manage.  Memory mappings will be mapped with different addresses
in a supervisor address space and “guest” address spaces. If guest
address spaces will not have their mm_structs, we will need to reinvent
vma-s in some form. If guest address spaces have mm_structs, this will
look similar to https://lwn.net/Articles/830648/.

Second, each pagetable is tied up with mm_stuct. You suggest creating
new pagetables that will not have their mm_struct-s (sorry if I
misunderstood something). I am not sure that it will be easy to
implement. How many corner cases will be there?

As for page faults in a secondary address space, we will need to find a
fault address in the main address space, handle the fault there and then
mirror the PTE to the secondary pagetable. Effectively, it means that
page faults will be handled in two address spaces. Right now, we use
memfd and shared mappings. It means that each fault is handled only in
one address space, and we map a guest memory region to the supervisor
address space only when we need to access it. A large portion of guest
anonymous memory is never mapped to the supervisor address space.
Will an overhead of mirrored address spaces be smaller than memfd shared
mappings? I am not sure.

Third, this approach will not get rid of having process_vm_exec. We will
need to switch to a guest address space with a specified state and
switch back on faults or syscalls. If the main concern is the ability to
run syscalls on a remote mm, we can think about how to fix this. I see
two ways what we can do here:

* Specify the exact list of system calls that are allowed. The first
three candidates are mmap, munmap, and vmsplice.

* Instead of allowing us to run system calls, we can implement this in
the form of commands. In the case of sandboxes, we need to implement
only two commands to create and destroy memory mappings in a target
address space.

Thanks,
Andrei

  parent reply	other threads:[~2021-07-02  7:01 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14  5:52 [PATCH 0/4 POC] Allow executing code and syscalls in another address space Andrei Vagin
2021-04-14  5:52 ` [PATCH 1/4] signal: add a helper to restore a process state from sigcontex Andrei Vagin
2021-04-14  5:52 ` [PATCH 2/4] arch/x86: implement the process_vm_exec syscall Andrei Vagin
2021-04-14 17:09   ` Oleg Nesterov
2021-04-23  6:59     ` Andrei Vagin
2021-06-28 16:13   ` Jann Horn
2021-06-28 16:30     ` Andy Lutomirski
2021-06-28 17:14       ` Jann Horn
2021-06-28 18:18         ` Eric W. Biederman
2021-06-29  1:01           ` Andrei Vagin
2021-07-02  6:22     ` Andrei Vagin
2021-07-02 11:51       ` Jann Horn
2021-07-02 20:40         ` Andy Lutomirski
2021-07-02  8:51   ` Peter Zijlstra
2021-07-02 22:21     ` Andrei Vagin
2021-07-02 20:56   ` Jann Horn
2021-07-02 22:48     ` Andrei Vagin
2021-04-14  5:52 ` [PATCH 3/4] arch/x86: allow to execute syscalls via process_vm_exec Andrei Vagin
2021-04-14  5:52 ` [PATCH 4/4] selftests: add tests for process_vm_exec Andrei Vagin
2021-04-14  6:46 ` [PATCH 0/4 POC] Allow executing code and syscalls in another address space Jann Horn
2021-04-14 22:10   ` Andrei Vagin
2021-07-02  6:57   ` Andrei Vagin [this message]
2021-07-02 15:12     ` Jann Horn
2021-07-18  0:38       ` Andrei Vagin
2021-04-14  7:22 ` Anton Ivanov
2021-04-14  7:34   ` Johannes Berg
2021-04-14  9:24     ` Benjamin Berg
2021-04-14 10:27 ` Florian Weimer
2021-04-14 11:24   ` Jann Horn
2021-04-14 12:20     ` Florian Weimer
2021-04-14 13:58       ` Jann Horn
2021-04-16 19:29 ` Kirill Smelkov
2021-04-17 16:28 ` sbaugh
2021-07-02 22:44 ` Andy Lutomirski
2021-07-18  1:34   ` Andrei Vagin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YN648cPBDIGKYlYa@gmail.com \
    --to=avagin@gmail.com \
    --cc=0x7f454c46@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=anton.ivanov@cambridgegreys.com \
    --cc=avagin@google.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=criu@openvz.org \
    --cc=jannh@google.com \
    --cc=jdike@addtoit.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-um@lists.infradead.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mtk.manpages@gmail.com \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=richard@nod.at \
    --cc=rppt@linux.ibm.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).