linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall
       [not found]   ` <CAG48ez3UrzPE8rkucTgCu8ggcTEjx_h3Gj2FES1qM-uv2KD8bQ@mail.gmail.com>
@ 2021-07-02  6:22     ` Andrei Vagin
  2021-07-02 11:51       ` Jann Horn
  0 siblings, 1 reply; 6+ messages in thread
From: Andrei Vagin @ 2021-07-02  6:22 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-kernel, linux-api, linux-um, criu, avagin, Andrew Morton,
	Andy Lutomirski, Anton Ivanov, Christian Brauner, Dmitry Safonov,
	Ingo Molnar, Jeff Dike, Mike Rapoport, Michael Kerrisk,
	Oleg Nesterov, Peter Zijlstra, Richard Weinberger,
	Thomas Gleixner, linux-mm

On Mon, Jun 28, 2021 at 06:13:29PM +0200, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> > This change introduces the new system call:
> > process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
> >                 siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)
> >
> > process_vm_exec allows to execute the current process in an address
> > space of another process.
> [...]
> 
> I still think that this whole API is fundamentally the wrong approach
> because it tries to shoehorn multiple usecases with different
> requirements into a single API. But that aside:

Here, I can't agree with you, but this is discussed in the parallel
thread.

> 
> > +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> > +{
> > +       struct task_struct *tsk = current;
> > +       struct mm_struct *active_mm;
> > +
> > +       task_lock(tsk);
> > +       /* Hold off tlb flush IPIs while switching mm's */
> > +       local_irq_disable();
> > +
> > +       sync_mm_rss(prev_mm);
> > +
> > +       vmacache_flush(tsk);
> > +
> > +       active_mm = tsk->active_mm;
> > +       if (active_mm != target_mm) {
> > +               mmgrab(target_mm);
> > +               tsk->active_mm = target_mm;
> > +       }
> > +       tsk->mm = target_mm;
> 
> I'm pretty sure you're not currently allowed to overwrite the ->mm
> pointer of a userspace thread. For example, zap_threads() assumes that
> all threads running under a process have the same ->mm. (And if you're
> fiddling with ->mm stuff, you should probably CC linux-mm@.)
> 
> As far as I understand, only kthreads are allowed to do this (as
> implemented in kthread_use_mm()).

kthread_use_mm() was renamed from use_mm in the v5.8 kernel. Before
that, it wasn't used for user processes in the kernel, but it was
exported for modules, and we used it without any visible problems. We
understood that there could be some issues like zap_threads and it was
one of reasons why we decided to introduce this system call.

I understand that there are no places in the kernel where we change mm
of user threads back and forth, but are there any real concerns why we
should not do that? I agree that zap_threads should be fixed, but it
will the easy one.

> 
> > +       switch_mm_irqs_off(active_mm, target_mm, tsk);
> > +       local_irq_enable();
> > +       task_unlock(tsk);
> > +#ifdef finish_arch_post_lock_switch
> > +       finish_arch_post_lock_switch();
> > +#endif
> > +
> > +       if (active_mm != target_mm)
> > +               mmdrop(active_mm);
> > +}


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space
       [not found] ` <CAG48ez0jfsS=gKN0Vo_VS2EvvMBvEr+QNz0vDKPeSAzsrsRwPQ@mail.gmail.com>
@ 2021-07-02  6:57   ` Andrei Vagin
  2021-07-02 15:12     ` Jann Horn
  0 siblings, 1 reply; 6+ messages in thread
From: Andrei Vagin @ 2021-07-02  6:57 UTC (permalink / raw)
  To: Jann Horn
  Cc: kernel list, Linux API, linux-um, criu, avagin, Andrew Morton,
	Andy Lutomirski, Anton Ivanov, Christian Brauner, Dmitry Safonov,
	Ingo Molnar, Jeff Dike, Mike Rapoport, Michael Kerrisk,
	Oleg Nesterov, Peter Zijlstra, Richard Weinberger,
	Thomas Gleixner, linux-mm

On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> > We already have process_vm_readv and process_vm_writev to read and write
> > to a process memory faster than we can do this with ptrace. And now it
> > is time for process_vm_exec that allows executing code in an address
> > space of another process. We can do this with ptrace but it is much
> > slower.
> >
> > = Use-cases =
> 
> It seems to me like your proposed API doesn't really fit either one of
> those usecases well...
> 
> > Here are two known use-cases. The first one is “application kernel”
> > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > process that runs the sandbox kernel and a set of stub processes that
> > are used to manage guest address spaces. Guest code is executed in the
> > context of stub processes but all system calls are intercepted and
> > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > significantly speed them up.
> 
> In this case, since you really only want an mm_struct to run code
> under, it seems weird to create a whole task with its own PID and so
> on. It seems to me like something similar to the /dev/kvm API would be
> more appropriate here? Implementation options that I see for that
> would be:
> 
> 1. mm_struct-based:
>       a set of syscalls to create a new mm_struct,
>       change memory mappings under that mm_struct, and switch to it

I like the idea to have a handle for mm. Instead of pid, we will pass
this handle to process_vm_exec. We have pidfd for processes and we can
introduce mmfd for mm_struct.


> 2. pagetable-mirroring-based:
>       like /dev/kvm, an API to create a new pagetable, mirror parts of
>       the mm_struct's pagetables over into it with modified permissions
>       (like KVM_SET_USER_MEMORY_REGION),
>       and run code under that context.
>       page fault handling would first handle the fault against mm->pgd
>       as normal, then mirror the PTE over into the secondary pagetables.
>       invalidation could be handled with MMU notifiers.
>

I found this idea interesting and decided to look at it more closely.
After reading the kernel code for a few days, I realized that it would
not be easy to implement something like this, but more important is that
I don’t understand what problem it solves. Will it simplify the
user-space code? I don’t think so. Will it improve performance? It is
unclear for me too.

First, in the KVM case, we have a few big linear mappings and need to
support one “shadow” address space. In the case of sandboxes, we can
have a tremendous amount of mappings and many address spaces that we
need to manage.  Memory mappings will be mapped with different addresses
in a supervisor address space and “guest” address spaces. If guest
address spaces will not have their mm_structs, we will need to reinvent
vma-s in some form. If guest address spaces have mm_structs, this will
look similar to https://lwn.net/Articles/830648/.

Second, each pagetable is tied up with mm_stuct. You suggest creating
new pagetables that will not have their mm_struct-s (sorry if I
misunderstood something). I am not sure that it will be easy to
implement. How many corner cases will be there?

As for page faults in a secondary address space, we will need to find a
fault address in the main address space, handle the fault there and then
mirror the PTE to the secondary pagetable. Effectively, it means that
page faults will be handled in two address spaces. Right now, we use
memfd and shared mappings. It means that each fault is handled only in
one address space, and we map a guest memory region to the supervisor
address space only when we need to access it. A large portion of guest
anonymous memory is never mapped to the supervisor address space.
Will an overhead of mirrored address spaces be smaller than memfd shared
mappings? I am not sure.

Third, this approach will not get rid of having process_vm_exec. We will
need to switch to a guest address space with a specified state and
switch back on faults or syscalls. If the main concern is the ability to
run syscalls on a remote mm, we can think about how to fix this. I see
two ways what we can do here:

* Specify the exact list of system calls that are allowed. The first
three candidates are mmap, munmap, and vmsplice.

* Instead of allowing us to run system calls, we can implement this in
the form of commands. In the case of sandboxes, we need to implement
only two commands to create and destroy memory mappings in a target
address space.

Thanks,
Andrei


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall
  2021-07-02  6:22     ` [PATCH 2/4] arch/x86: implement the process_vm_exec syscall Andrei Vagin
@ 2021-07-02 11:51       ` Jann Horn
  2021-07-02 20:40         ` Andy Lutomirski
  0 siblings, 1 reply; 6+ messages in thread
From: Jann Horn @ 2021-07-02 11:51 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: linux-kernel, linux-api, linux-um, criu, avagin, Andrew Morton,
	Andy Lutomirski, Anton Ivanov, Christian Brauner, Dmitry Safonov,
	Ingo Molnar, Jeff Dike, Mike Rapoport, Michael Kerrisk,
	Oleg Nesterov, Peter Zijlstra, Richard Weinberger,
	Thomas Gleixner, linux-mm

On Fri, Jul 2, 2021 at 8:25 AM Andrei Vagin <avagin@gmail.com> wrote:
> On Mon, Jun 28, 2021 at 06:13:29PM +0200, Jann Horn wrote:
> > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> > > +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> > > +{
> > > +       struct task_struct *tsk = current;
> > > +       struct mm_struct *active_mm;
> > > +
> > > +       task_lock(tsk);
> > > +       /* Hold off tlb flush IPIs while switching mm's */
> > > +       local_irq_disable();
> > > +
> > > +       sync_mm_rss(prev_mm);
> > > +
> > > +       vmacache_flush(tsk);
> > > +
> > > +       active_mm = tsk->active_mm;
> > > +       if (active_mm != target_mm) {
> > > +               mmgrab(target_mm);
> > > +               tsk->active_mm = target_mm;
> > > +       }
> > > +       tsk->mm = target_mm;
> >
> > I'm pretty sure you're not currently allowed to overwrite the ->mm
> > pointer of a userspace thread. For example, zap_threads() assumes that
> > all threads running under a process have the same ->mm. (And if you're
> > fiddling with ->mm stuff, you should probably CC linux-mm@.)
> >
> > As far as I understand, only kthreads are allowed to do this (as
> > implemented in kthread_use_mm()).
>
> kthread_use_mm() was renamed from use_mm in the v5.8 kernel. Before
> that, it wasn't used for user processes in the kernel, but it was
> exported for modules, and we used it without any visible problems. We
> understood that there could be some issues like zap_threads and it was
> one of reasons why we decided to introduce this system call.
>
> I understand that there are no places in the kernel where we change mm
> of user threads back and forth, but are there any real concerns why we
> should not do that? I agree that zap_threads should be fixed, but it
> will the easy one.

My point is that if you break a preexisting assumption like this,
you'll have to go through the kernel and search for places that rely
on this assumption, and fix them up, which may potentially require
thinking about what kinds of semantics would actually be appropriate
there. Like the MCE killing logic (collect_procs_anon() and such). And
current_is_single_threaded(), in which the current patch probably
leads to logic security bugs. And __uprobe_perf_filter(). Before my
refactoring of the ELF coredump logic in kernel 5.10 (commit
b2767d97f5ff75 and the ones before it), you'd have also probably
created memory corruption bugs in races between elf_core_dump() and
syscalls like mmap()/munmap(). (Note that this is not necessarily an
exhaustive list.)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space
  2021-07-02  6:57   ` [PATCH 0/4 POC] Allow executing code and syscalls in another address space Andrei Vagin
@ 2021-07-02 15:12     ` Jann Horn
  2021-07-18  0:38       ` Andrei Vagin
  0 siblings, 1 reply; 6+ messages in thread
From: Jann Horn @ 2021-07-02 15:12 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: kernel list, Linux API, linux-um, criu, avagin, Andrew Morton,
	Andy Lutomirski, Anton Ivanov, Christian Brauner, Dmitry Safonov,
	Ingo Molnar, Jeff Dike, Mike Rapoport, Michael Kerrisk,
	Oleg Nesterov, Peter Zijlstra, Richard Weinberger,
	Thomas Gleixner, linux-mm

On Fri, Jul 2, 2021 at 9:01 AM Andrei Vagin <avagin@gmail.com> wrote:
> On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> > > We already have process_vm_readv and process_vm_writev to read and write
> > > to a process memory faster than we can do this with ptrace. And now it
> > > is time for process_vm_exec that allows executing code in an address
> > > space of another process. We can do this with ptrace but it is much
> > > slower.
> > >
> > > = Use-cases =
> >
> > It seems to me like your proposed API doesn't really fit either one of
> > those usecases well...
> >
> > > Here are two known use-cases. The first one is “application kernel”
> > > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > > process that runs the sandbox kernel and a set of stub processes that
> > > are used to manage guest address spaces. Guest code is executed in the
> > > context of stub processes but all system calls are intercepted and
> > > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > > significantly speed them up.
> >
> > In this case, since you really only want an mm_struct to run code
> > under, it seems weird to create a whole task with its own PID and so
> > on. It seems to me like something similar to the /dev/kvm API would be
> > more appropriate here? Implementation options that I see for that
> > would be:
> >
> > 1. mm_struct-based:
> >       a set of syscalls to create a new mm_struct,
> >       change memory mappings under that mm_struct, and switch to it
>
> I like the idea to have a handle for mm. Instead of pid, we will pass
> this handle to process_vm_exec. We have pidfd for processes and we can
> introduce mmfd for mm_struct.

I personally think that it might be quite unwieldy when it comes to
the restrictions you get from trying to have shared memory with the
owning process - I'm having trouble figuring out how you can implement
copy-on-write semantics without relying on copy-on-write logic in the
host OS and without being able to use userfaultfd.

But if that's not a problem somehow, and you can find some reasonable
way to handle memory usage accounting and fix up everything that
assumes that multithreaded userspace threads don't switch ->mm, I
guess this might work for your usecase.

> > 2. pagetable-mirroring-based:
> >       like /dev/kvm, an API to create a new pagetable, mirror parts of
> >       the mm_struct's pagetables over into it with modified permissions
> >       (like KVM_SET_USER_MEMORY_REGION),
> >       and run code under that context.
> >       page fault handling would first handle the fault against mm->pgd
> >       as normal, then mirror the PTE over into the secondary pagetables.
> >       invalidation could be handled with MMU notifiers.
> >
>
> I found this idea interesting and decided to look at it more closely.
> After reading the kernel code for a few days, I realized that it would
> not be easy to implement something like this,

Yeah, it might need architecture-specific code to flip the page tables
on userspace entry/exit, and maybe also for mirroring them. And for
the TLB flushing logic...

> but more important is that
> I don’t understand what problem it solves. Will it simplify the
> user-space code? I don’t think so. Will it improve performance? It is
> unclear for me too.

Some reasons I can think of are:

 - direct guest memory access: I imagined you'd probably want to be able to
   directly access userspace memory from the supervisor, and
   with this approach that'd become easy.

 - integration with on-demand paging of the host OS: You'd be able to
   create things like file-backed copy-on-write mappings from the
   host filesystem, or implement your own mappings backed by some kind
   of storage using userfaultfd.

 - sandboxing: For sandboxing usecases (not your usecase), it would be
   possible to e.g. create a read-only clone of the entire address space of a
   process and give write access to specific parts of it, or something
   like that.
   These address space clones could potentially be created and destroyed
   fairly quickly.

 - accounting: memory usage would be automatically accounted to the
   supervisor process, so even without a parasite process, you'd be able
   to see the memory usage correctly in things like "top".

 - small (non-pageable) memory footprint in the host kernel:
   The only things the host kernel would have to persistently store would be
   the normal MM data structures for the supervisor plus the mappings
   from "guest userspace" memory ranges to supervisor memory ranges;
   userspace pagetables would be discardable, and could even be shared
   with those of the supervisor in cases where the alignment fits.
   So with this, large anonymous mappings with 4K granularity only cost you
   ~0.20% overhead across host and guest address space; without this, if you
   used shared mappings instead, you'd pay twice that for every 2MiB range
   from which parts are accessed in both contexts, plus probably another
   ~0.2% or so for the "struct address_space"?

 - all memory-management-related syscalls could be directly performed
   in the "kernel" process

But yeah, some of those aren't really relevant for your usecase, and I
guess things like the accounting aspect could just as well be solved
differently...

> First, in the KVM case, we have a few big linear mappings and need to
> support one “shadow” address space. In the case of sandboxes, we can
> have a tremendous amount of mappings and many address spaces that we
> need to manage.  Memory mappings will be mapped with different addresses
> in a supervisor address space and “guest” address spaces. If guest
> address spaces will not have their mm_structs, we will need to reinvent
> vma-s in some form. If guest address spaces have mm_structs, this will
> look similar to https://lwn.net/Articles/830648/.
>
> Second, each pagetable is tied up with mm_stuct. You suggest creating
> new pagetables that will not have their mm_struct-s (sorry if I
> misunderstood something).

Yeah, that's what I had in mind, page tables without an mm_struct.

> I am not sure that it will be easy to
> implement. How many corner cases will be there?

Yeah, it would require some work around TLB flushing and entry/exit
from userspace. But from a high-level perspective it feels to me like
a change with less systematic impact. Maybe I'm wrong about that.

> As for page faults in a secondary address space, we will need to find a
> fault address in the main address space, handle the fault there and then
> mirror the PTE to the secondary pagetable.

Right.

> Effectively, it means that
> page faults will be handled in two address spaces. Right now, we use
> memfd and shared mappings. It means that each fault is handled only in
> one address space, and we map a guest memory region to the supervisor
> address space only when we need to access it. A large portion of guest
> anonymous memory is never mapped to the supervisor address space.
> Will an overhead of mirrored address spaces be smaller than memfd shared
> mappings? I am not sure.

But as long as the mappings are sufficiently big and aligned properly,
or you explicitly manage the supervisor address space, some of that
cost disappears: E.g. even if a page is mapped in both address spaces,
you wouldn't have a memory cost for the second mapping if the page
tables are shared.

> Third, this approach will not get rid of having process_vm_exec. We will
> need to switch to a guest address space with a specified state and
> switch back on faults or syscalls.

Yeah, you'd still need a syscall for running code under a different
set of page tables. But that's something that KVM _almost_ already
does.

> If the main concern is the ability to
> run syscalls on a remote mm, we can think about how to fix this. I see
> two ways what we can do here:
>
> * Specify the exact list of system calls that are allowed. The first
> three candidates are mmap, munmap, and vmsplice.
>
> * Instead of allowing us to run system calls, we can implement this in
> the form of commands. In the case of sandboxes, we need to implement
> only two commands to create and destroy memory mappings in a target
> address space.

FWIW, there is precedent for something similar: The Android folks
already added process_madvise() for remotely messing with the VMAs of
another process to some degree.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall
  2021-07-02 11:51       ` Jann Horn
@ 2021-07-02 20:40         ` Andy Lutomirski
  0 siblings, 0 replies; 6+ messages in thread
From: Andy Lutomirski @ 2021-07-02 20:40 UTC (permalink / raw)
  To: Jann Horn, Andrei Vagin
  Cc: Linux Kernel Mailing List, Linux API, linux-um, criu, avagin,
	Andrew Morton, Anton Ivanov, Christian Brauner, Dmitry Safonov,
	Ingo Molnar, Jeff Dike, Mike Rapoport, Michael Kerrisk,
	Oleg Nesterov, Peter Zijlstra (Intel),
	Richard Weinberger, Thomas Gleixner, linux-mm



On Fri, Jul 2, 2021, at 4:51 AM, Jann Horn wrote:
> On Fri, Jul 2, 2021 at 8:25 AM Andrei Vagin <avagin@gmail.com> wrote:
> > On Mon, Jun 28, 2021 at 06:13:29PM +0200, Jann Horn wrote:
> > > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> > > > +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> > > > +{
> > > > +       struct task_struct *tsk = current;
> > > > +       struct mm_struct *active_mm;
> > > > +
> > > > +       task_lock(tsk);
> > > > +       /* Hold off tlb flush IPIs while switching mm's */
> > > > +       local_irq_disable();
> > > > +
> > > > +       sync_mm_rss(prev_mm);
> > > > +
> > > > +       vmacache_flush(tsk);
> > > > +
> > > > +       active_mm = tsk->active_mm;
> > > > +       if (active_mm != target_mm) {
> > > > +               mmgrab(target_mm);
> > > > +               tsk->active_mm = target_mm;
> > > > +       }
> > > > +       tsk->mm = target_mm;
> > >
> > > I'm pretty sure you're not currently allowed to overwrite the ->mm
> > > pointer of a userspace thread. For example, zap_threads() assumes that
> > > all threads running under a process have the same ->mm. (And if you're
> > > fiddling with ->mm stuff, you should probably CC linux-mm@.)
> > >
> > > As far as I understand, only kthreads are allowed to do this (as
> > > implemented in kthread_use_mm()).
> >
> > kthread_use_mm() was renamed from use_mm in the v5.8 kernel. Before
> > that, it wasn't used for user processes in the kernel, but it was
> > exported for modules, and we used it without any visible problems. We
> > understood that there could be some issues like zap_threads and it was
> > one of reasons why we decided to introduce this system call.
> >
> > I understand that there are no places in the kernel where we change mm
> > of user threads back and forth, but are there any real concerns why we
> > should not do that? I agree that zap_threads should be fixed, but it
> > will the easy one.
> 
> My point is that if you break a preexisting assumption like this,
> you'll have to go through the kernel and search for places that rely
> on this assumption, and fix them up, which may potentially require
> thinking about what kinds of semantics would actually be appropriate
> there. Like the MCE killing logic (collect_procs_anon() and such). And
> current_is_single_threaded(), in which the current patch probably
> leads to logic security bugs. And __uprobe_perf_filter(). Before my
> refactoring of the ELF coredump logic in kernel 5.10 (commit
> b2767d97f5ff75 and the ones before it), you'd have also probably
> created memory corruption bugs in races between elf_core_dump() and
> syscalls like mmap()/munmap(). (Note that this is not necessarily an
> exhaustive list.)
> 

There’s nmi_uaccess_okay(), and its callers assume that, when a task is perf tracing itself, that an event on that task with nmi_uaccess_okay() means that uaccess will access that task’s memory.

Core dump code probably expects that dumping memory will access the correct mm.

I cannot fathom why any kind of remote vm access touched FPU state at all.

What PKRU value is supposed to be used when doing mm swap shenanigans?  How about PASID?

What happens if one task attempts to issue a KVM ioctl while its mm is swapped?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space
  2021-07-02 15:12     ` Jann Horn
@ 2021-07-18  0:38       ` Andrei Vagin
  0 siblings, 0 replies; 6+ messages in thread
From: Andrei Vagin @ 2021-07-18  0:38 UTC (permalink / raw)
  To: Jann Horn
  Cc: kernel list, Linux API, linux-um, criu, avagin, Andrew Morton,
	Andy Lutomirski, Anton Ivanov, Christian Brauner, Dmitry Safonov,
	Ingo Molnar, Jeff Dike, Mike Rapoport, Michael Kerrisk,
	Oleg Nesterov, Peter Zijlstra, Richard Weinberger,
	Thomas Gleixner, linux-mm

On Fri, Jul 02, 2021 at 05:12:02PM +0200, Jann Horn wrote:
> On Fri, Jul 2, 2021 at 9:01 AM Andrei Vagin <avagin@gmail.com> wrote:
> > On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> > > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> > > > We already have process_vm_readv and process_vm_writev to read and write
> > > > to a process memory faster than we can do this with ptrace. And now it
> > > > is time for process_vm_exec that allows executing code in an address
> > > > space of another process. We can do this with ptrace but it is much
> > > > slower.
> > > >
> > > > = Use-cases =
> > >
> > > It seems to me like your proposed API doesn't really fit either one of
> > > those usecases well...
> > >
> > > > Here are two known use-cases. The first one is “application kernel”
> > > > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > > > process that runs the sandbox kernel and a set of stub processes that
> > > > are used to manage guest address spaces. Guest code is executed in the
> > > > context of stub processes but all system calls are intercepted and
> > > > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > > > significantly speed them up.
> > >
> > > In this case, since you really only want an mm_struct to run code
> > > under, it seems weird to create a whole task with its own PID and so
> > > on. It seems to me like something similar to the /dev/kvm API would be
> > > more appropriate here? Implementation options that I see for that
> > > would be:
> > >
> > > 1. mm_struct-based:
> > >       a set of syscalls to create a new mm_struct,
> > >       change memory mappings under that mm_struct, and switch to it
> >
> > I like the idea to have a handle for mm. Instead of pid, we will pass
> > this handle to process_vm_exec. We have pidfd for processes and we can
> > introduce mmfd for mm_struct.
> 
> I personally think that it might be quite unwieldy when it comes to
> the restrictions you get from trying to have shared memory with the
> owning process - I'm having trouble figuring out how you can implement
> copy-on-write semantics without relying on copy-on-write logic in the
> host OS and without being able to use userfaultfd.

It is easy. COW mappings are mapped to guest address spaces without the
write permission. If one of processes wants to write something, it
triggers a fault that is handled in the Sentry (supervisor/kernel).

> 
> But if that's not a problem somehow, and you can find some reasonable
> way to handle memory usage accounting and fix up everything that
> assumes that multithreaded userspace threads don't switch ->mm, I
> guess this might work for your usecase.
> 
> > > 2. pagetable-mirroring-based:
> > >       like /dev/kvm, an API to create a new pagetable, mirror parts of
> > >       the mm_struct's pagetables over into it with modified permissions
> > >       (like KVM_SET_USER_MEMORY_REGION),
> > >       and run code under that context.
> > >       page fault handling would first handle the fault against mm->pgd
> > >       as normal, then mirror the PTE over into the secondary pagetables.
> > >       invalidation could be handled with MMU notifiers.
> > >
> >
> > I found this idea interesting and decided to look at it more closely.
> > After reading the kernel code for a few days, I realized that it would
> > not be easy to implement something like this,
> 
> Yeah, it might need architecture-specific code to flip the page tables
> on userspace entry/exit, and maybe also for mirroring them. And for
> the TLB flushing logic...
> 
> > but more important is that
> > I don’t understand what problem it solves. Will it simplify the
> > user-space code? I don’t think so. Will it improve performance? It is
> > unclear for me too.
> 
> Some reasons I can think of are:
> 
>  - direct guest memory access: I imagined you'd probably want to be able to
>    directly access userspace memory from the supervisor, and
>    with this approach that'd become easy.

Right now, we use shared memory regions for that and they work fine. As
I already mentioned the most part of memory are never mapped to the
supervisor address space.

> 
>  - integration with on-demand paging of the host OS: You'd be able to
>    create things like file-backed copy-on-write mappings from the
>    host filesystem, or implement your own mappings backed by some kind
>    of storage using userfaultfd.

This isn't a problem either...

> 
>  - sandboxing: For sandboxing usecases (not your usecase), it would be
>    possible to e.g. create a read-only clone of the entire address space of a
>    process and give write access to specific parts of it, or something
>    like that.
>    These address space clones could potentially be created and destroyed
>    fairly quickly.

This is a very valid example and I would assume this is where your idea
was coming from. I have some doubts about the idea of additional
sub-page-tables in the kernel, but I know a good way how to implement
your idea with KVM. You can look at how the KVM platform is implemented in
gVisor and this sort of sandboxing can be implemented in the same way.

In a few words, we create a KVM virtual machine, repeat the process
address space in the guest ring0, implement basic operating system-level
stubs, so that the process can jump between the host ring3 and the guest
ring0.

https://github.com/google/gvisor/blob/master/pkg/ring0/
https://github.com/google/gvisor/tree/master/pkg/sentry/platform/kvm

When we have all these bits, we can create any page tables for a guest
ring3 and run untrusted code there. The sandbox process switches to
the guest ring0 and then it switches to a guest ring3 with a specified
page tables and a state.

https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/kvm/machine_amd64.go;l=356

With this scheme, the sandbox process will have direct access to page
tables and will be able to change them.

> 
>  - accounting: memory usage would be automatically accounted to the
>    supervisor process, so even without a parasite process, you'd be able
>    to see the memory usage correctly in things like "top".
> 
>  - small (non-pageable) memory footprint in the host kernel:
>    The only things the host kernel would have to persistently store would be
>    the normal MM data structures for the supervisor plus the mappings
>    from "guest userspace" memory ranges to supervisor memory ranges;
>    userspace pagetables would be discardable, and could even be shared
>    with those of the supervisor in cases where the alignment fits.
>    So with this, large anonymous mappings with 4K granularity only cost you
>    ~0.20% overhead across host and guest address space; without this, if you
>    used shared mappings instead, you'd pay twice that for every 2MiB range
>    from which parts are accessed in both contexts, plus probably another
>    ~0.2% or so for the "struct address_space"?

If we use shared mappings, we don't map the most part of guest memory to
the supervisor address space and don't have page tables for it there. I
would say that this is a question where a memory footprint will be
smaller...

> 
>  - all memory-management-related syscalls could be directly performed
>    in the "kernel" process
> 
> But yeah, some of those aren't really relevant for your usecase, and I
> guess things like the accounting aspect could just as well be solved
> differently...
> 
> > First, in the KVM case, we have a few big linear mappings and need to
> > support one “shadow” address space. In the case of sandboxes, we can
> > have a tremendous amount of mappings and many address spaces that we
> > need to manage.  Memory mappings will be mapped with different addresses
> > in a supervisor address space and “guest” address spaces. If guest
> > address spaces will not have their mm_structs, we will need to reinvent
> > vma-s in some form. If guest address spaces have mm_structs, this will
> > look similar to https://lwn.net/Articles/830648/.
> >
> > Second, each pagetable is tied up with mm_stuct. You suggest creating
> > new pagetables that will not have their mm_struct-s (sorry if I
> > misunderstood something).
> 
> Yeah, that's what I had in mind, page tables without an mm_struct.
> 
> > I am not sure that it will be easy to
> > implement. How many corner cases will be there?
> 
> Yeah, it would require some work around TLB flushing and entry/exit
> from userspace. But from a high-level perspective it feels to me like
> a change with less systematic impact. Maybe I'm wrong about that.
> 
> > As for page faults in a secondary address space, we will need to find a
> > fault address in the main address space, handle the fault there and then
> > mirror the PTE to the secondary pagetable.
> 
> Right.
> 
> > Effectively, it means that
> > page faults will be handled in two address spaces. Right now, we use
> > memfd and shared mappings. It means that each fault is handled only in
> > one address space, and we map a guest memory region to the supervisor
> > address space only when we need to access it. A large portion of guest
> > anonymous memory is never mapped to the supervisor address space.
> > Will an overhead of mirrored address spaces be smaller than memfd shared
> > mappings? I am not sure.
> 
> But as long as the mappings are sufficiently big and aligned properly,
> or you explicitly manage the supervisor address space, some of that
> cost disappears: E.g. even if a page is mapped in both address spaces,
> you wouldn't have a memory cost for the second mapping if the page
> tables are shared.

You are right. It is interesting how many pte-s will be shared. For
example, if a guest process forks a child, all anon memory will be COW,
this means we will need to remove the W bit from pte-s and so we will
need to allocate pte-s for both processes...

> 
> > Third, this approach will not get rid of having process_vm_exec. We will
> > need to switch to a guest address space with a specified state and
> > switch back on faults or syscalls.
> 
> Yeah, you'd still need a syscall for running code under a different
> set of page tables. But that's something that KVM _almost_ already
> does.

I don't understand this analogy with KVM...

> 
> > If the main concern is the ability to
> > run syscalls on a remote mm, we can think about how to fix this. I see
> > two ways what we can do here:
> >
> > * Specify the exact list of system calls that are allowed. The first
> > three candidates are mmap, munmap, and vmsplice.
> >
> > * Instead of allowing us to run system calls, we can implement this in
> > the form of commands. In the case of sandboxes, we need to implement
> > only two commands to create and destroy memory mappings in a target
> > address space.
> 
> FWIW, there is precedent for something similar: The Android folks
> already added process_madvise() for remotely messing with the VMAs of
> another process to some degree.

I know. We tried to implement process_vm_mmap and process_vm_splice:

https://lkml.org/lkml/2018/1/9/32
https://patchwork.kernel.org/project/linux-mm/cover/155836064844.2441.10911127801797083064.stgit@localhost.localdomain/

Thanks,
Andrei


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-07-18  0:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210414055217.543246-1-avagin@gmail.com>
     [not found] ` <20210414055217.543246-3-avagin@gmail.com>
     [not found]   ` <CAG48ez3UrzPE8rkucTgCu8ggcTEjx_h3Gj2FES1qM-uv2KD8bQ@mail.gmail.com>
2021-07-02  6:22     ` [PATCH 2/4] arch/x86: implement the process_vm_exec syscall Andrei Vagin
2021-07-02 11:51       ` Jann Horn
2021-07-02 20:40         ` Andy Lutomirski
     [not found] ` <CAG48ez0jfsS=gKN0Vo_VS2EvvMBvEr+QNz0vDKPeSAzsrsRwPQ@mail.gmail.com>
2021-07-02  6:57   ` [PATCH 0/4 POC] Allow executing code and syscalls in another address space Andrei Vagin
2021-07-02 15:12     ` Jann Horn
2021-07-18  0:38       ` Andrei Vagin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).