From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E456DC43461 for ; Wed, 14 Apr 2021 06:47:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AD24F611F2 for ; Wed, 14 Apr 2021 06:47:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349328AbhDNGrb (ORCPT ); Wed, 14 Apr 2021 02:47:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45102 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1348255AbhDNGra (ORCPT ); Wed, 14 Apr 2021 02:47:30 -0400 Received: from mail-lf1-x129.google.com (mail-lf1-x129.google.com [IPv6:2a00:1450:4864:20::129]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F22A0C061756 for ; Tue, 13 Apr 2021 23:47:08 -0700 (PDT) Received: by mail-lf1-x129.google.com with SMTP id n8so31444508lfh.1 for ; Tue, 13 Apr 2021 23:47:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=o7YUQsFXpVVzDPEGdF/U6Rn3Tg4j4C3osu9eHf3orK8=; b=XFdPNvxHWkKqyNvtRYRixrne43Ud5Nv45e7x2J2k1gwoawHyEst2zAorc/4jV6vHD1 oyXh7R900sVsURqzY9z5CF9GdkWfPWZBTZqjaLkdE+ZOtopofHjjAdYzCXioCmykjnvF +1pYiMxqkkSP8v9lYZAzGbskYFWnEcuw7NyeylG0jbp/LC0agt2r/lo+yHJ8bwLCYupV eb/9FJUNPS1ay4ujpaR1YQfm4qOxruB9WA0oGJjROfcSRJex5azES4JtqykbhYmKt8rW QA2UEneBNegtkiA4ckUfJ6uBsw1Q555Bx3vZ97CDQps9ybNI3bz1X40h03mZHwgE56hC npqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=o7YUQsFXpVVzDPEGdF/U6Rn3Tg4j4C3osu9eHf3orK8=; b=hr3GXogBxF0tA0sHivVtZyfCQJr2h2uAsCV6C9TBsYPmobWlkGMFxsmShWas61SjqU 3mSj2jpjc7Msm/YqqoiVfetA58qS5/P9YW43O0/PM3b7u1Buggu6p8eDIp9mNzSZ3Mmp jcOhqp/aPakn5NzhyH6wvrefTLDjjMBxSKQ0DHewFkx1Jr99BlKDjtEKWS3jnDLU9zC8 94lSsYLudF2hgyzeZr0ojDkh5dRCBprPmFg8OyJz4w8vpX93HJDyZAW7JtsHcGQJ0Pic c9XIajaxvVExQMktSaAFAi+mUiC85UQrkWipPKy20RZtQfXZswvR97lRO4J3cu40J98L 1T9w== X-Gm-Message-State: AOAM5339qnXpYJ8ObB9YsTlYVpyNbRqT9ndprX6PSjFMalGhc3/rQHk/ 1zENlvVc7SeGKCgWuOAVGxR+c+Dwvcr5oMx0g7OwIQ== X-Google-Smtp-Source: ABdhPJzLne3z/hFVoU9gPpwPkGPHfCAenskPK5svCrkqJUaNsYIIt5zqjrr8wnjLp78YEG8UCr7QefSS/zpN7U3t21g= X-Received: by 2002:a19:6a16:: with SMTP id u22mr24510963lfu.356.1618382827311; Tue, 13 Apr 2021 23:47:07 -0700 (PDT) MIME-Version: 1.0 References: <20210414055217.543246-1-avagin@gmail.com> In-Reply-To: <20210414055217.543246-1-avagin@gmail.com> From: Jann Horn Date: Wed, 14 Apr 2021 08:46:40 +0200 Message-ID: Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space To: Andrei Vagin Cc: kernel list , Linux API , linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com, Andrew Morton , Andy Lutomirski , Anton Ivanov , Christian Brauner , Dmitry Safonov <0x7f454c46@gmail.com>, Ingo Molnar , Jeff Dike , Mike Rapoport , Michael Kerrisk , Oleg Nesterov , Peter Zijlstra , Richard Weinberger , Thomas Gleixner Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin wrote: > We already have process_vm_readv and process_vm_writev to read and write > to a process memory faster than we can do this with ptrace. And now it > is time for process_vm_exec that allows executing code in an address > space of another process. We can do this with ptrace but it is much > slower. > > =3D Use-cases =3D It seems to me like your proposed API doesn't really fit either one of those usecases well... > Here are two known use-cases. The first one is =E2=80=9Capplication kerne= l=E2=80=9D > sandboxes like User-mode Linux and gVisor. In this case, we have a > process that runs the sandbox kernel and a set of stub processes that > are used to manage guest address spaces. Guest code is executed in the > context of stub processes but all system calls are intercepted and > handled in the sandbox kernel. Right now, these sort of sandboxes use > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > significantly speed them up. In this case, since you really only want an mm_struct to run code under, it seems weird to create a whole task with its own PID and so on. It seems to me like something similar to the /dev/kvm API would be more appropriate here? Implementation options that I see for that would be: 1. mm_struct-based: a set of syscalls to create a new mm_struct, change memory mappings under that mm_struct, and switch to it 2. pagetable-mirroring-based: like /dev/kvm, an API to create a new pagetable, mirror parts of the mm_struct's pagetables over into it with modified permissions (like KVM_SET_USER_MEMORY_REGION), and run code under that context. page fault handling would first handle the fault against mm->pgd as normal, then mirror the PTE over into the secondary pagetables. invalidation could be handled with MMU notifiers. > Another use-case is CRIU (Checkpoint/Restore in User-space). Several > process properties can be received only from the process itself. Right > now, we use a parasite code that is injected into the process. We do > this with ptrace but it is slow, unsafe, and tricky. But this API will only let you run code under the *mm* of the target process, not fully in the context of a target *task*, right? So you still won't be able to use this for accessing anything other than memory? That doesn't seem very generically useful to me. Also, I don't doubt that anything involving ptrace is kinda tricky, but it would be nice to have some more detail on what exactly makes this slow, unsafe and tricky. Are there API additions for ptrace that would make this work better? I imagine you're thinking of things like an API for injecting a syscall into the target process without having to first somehow find an existing SYSCALL instruction in the target process? > process_vm_exec can > simplify the process of injecting a parasite code and it will allow > pre-dump memory without stopping processes. The pre-dump here is when we > enable a memory tracker and dump the memory while a process is continue > running. On each interaction we dump memory that has been changed from > the previous iteration. In the final step, we will stop processes and > dump their full state. Right now the most effective way to dump process > memory is to create a set of pipes and splice memory into these pipes > from the parasite code. With process_vm_exec, we will be able to call > vmsplice directly. It means that we will not need to stop a process to > inject the parasite code. Alternatively you could add splice support to /proc/$pid/mem or add a syscall similar to process_vm_readv() that splices into a pipe, right? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x136.google.com ([2a00:1450:4864:20::136]) by bombadil.infradead.org with esmtps (Exim 4.94 #2 (Red Hat Linux)) id 1lWZIu-007XZh-6T for linux-um@lists.infradead.org; Wed, 14 Apr 2021 06:47:13 +0000 Received: by mail-lf1-x136.google.com with SMTP id j18so31412596lfg.5 for ; Tue, 13 Apr 2021 23:47:08 -0700 (PDT) MIME-Version: 1.0 References: <20210414055217.543246-1-avagin@gmail.com> In-Reply-To: <20210414055217.543246-1-avagin@gmail.com> From: Jann Horn Date: Wed, 14 Apr 2021 08:46:40 +0200 Message-ID: Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Sender: "linux-um" Errors-To: linux-um-bounces+geert=linux-m68k.org@lists.infradead.org To: Andrei Vagin Cc: kernel list , Linux API , linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com, Andrew Morton , Andy Lutomirski , Anton Ivanov , Christian Brauner , Dmitry Safonov <0x7f454c46@gmail.com>, Ingo Molnar , Jeff Dike , Mike Rapoport , Michael Kerrisk , Oleg Nesterov , Peter Zijlstra , Richard Weinberger , Thomas Gleixner T24gV2VkLCBBcHIgMTQsIDIwMjEgYXQgNzo1OSBBTSBBbmRyZWkgVmFnaW4gPGF2YWdpbkBnbWFp bC5jb20+IHdyb3RlOgo+IFdlIGFscmVhZHkgaGF2ZSBwcm9jZXNzX3ZtX3JlYWR2IGFuZCBwcm9j ZXNzX3ZtX3dyaXRldiB0byByZWFkIGFuZCB3cml0ZQo+IHRvIGEgcHJvY2VzcyBtZW1vcnkgZmFz dGVyIHRoYW4gd2UgY2FuIGRvIHRoaXMgd2l0aCBwdHJhY2UuIEFuZCBub3cgaXQKPiBpcyB0aW1l IGZvciBwcm9jZXNzX3ZtX2V4ZWMgdGhhdCBhbGxvd3MgZXhlY3V0aW5nIGNvZGUgaW4gYW4gYWRk cmVzcwo+IHNwYWNlIG9mIGFub3RoZXIgcHJvY2Vzcy4gV2UgY2FuIGRvIHRoaXMgd2l0aCBwdHJh Y2UgYnV0IGl0IGlzIG11Y2gKPiBzbG93ZXIuCj4KPiA9IFVzZS1jYXNlcyA9CgpJdCBzZWVtcyB0 byBtZSBsaWtlIHlvdXIgcHJvcG9zZWQgQVBJIGRvZXNuJ3QgcmVhbGx5IGZpdCBlaXRoZXIgb25l IG9mCnRob3NlIHVzZWNhc2VzIHdlbGwuLi4KCj4gSGVyZSBhcmUgdHdvIGtub3duIHVzZS1jYXNl cy4gVGhlIGZpcnN0IG9uZSBpcyDigJxhcHBsaWNhdGlvbiBrZXJuZWzigJ0KPiBzYW5kYm94ZXMg bGlrZSBVc2VyLW1vZGUgTGludXggYW5kIGdWaXNvci4gSW4gdGhpcyBjYXNlLCB3ZSBoYXZlIGEK PiBwcm9jZXNzIHRoYXQgcnVucyB0aGUgc2FuZGJveCBrZXJuZWwgYW5kIGEgc2V0IG9mIHN0dWIg cHJvY2Vzc2VzIHRoYXQKPiBhcmUgdXNlZCB0byBtYW5hZ2UgZ3Vlc3QgYWRkcmVzcyBzcGFjZXMu IEd1ZXN0IGNvZGUgaXMgZXhlY3V0ZWQgaW4gdGhlCj4gY29udGV4dCBvZiBzdHViIHByb2Nlc3Nl cyBidXQgYWxsIHN5c3RlbSBjYWxscyBhcmUgaW50ZXJjZXB0ZWQgYW5kCj4gaGFuZGxlZCBpbiB0 aGUgc2FuZGJveCBrZXJuZWwuIFJpZ2h0IG5vdywgdGhlc2Ugc29ydCBvZiBzYW5kYm94ZXMgdXNl Cj4gUFRSQUNFX1NZU0VNVSB0byB0cmFwIHN5c3RlbSBjYWxscywgYnV0IHRoZSBwcm9jZXNzX3Zt X2V4ZWMgY2FuCj4gc2lnbmlmaWNhbnRseSBzcGVlZCB0aGVtIHVwLgoKSW4gdGhpcyBjYXNlLCBz aW5jZSB5b3UgcmVhbGx5IG9ubHkgd2FudCBhbiBtbV9zdHJ1Y3QgdG8gcnVuIGNvZGUKdW5kZXIs IGl0IHNlZW1zIHdlaXJkIHRvIGNyZWF0ZSBhIHdob2xlIHRhc2sgd2l0aCBpdHMgb3duIFBJRCBh bmQgc28Kb24uIEl0IHNlZW1zIHRvIG1lIGxpa2Ugc29tZXRoaW5nIHNpbWlsYXIgdG8gdGhlIC9k ZXYva3ZtIEFQSSB3b3VsZCBiZQptb3JlIGFwcHJvcHJpYXRlIGhlcmU/IEltcGxlbWVudGF0aW9u IG9wdGlvbnMgdGhhdCBJIHNlZSBmb3IgdGhhdAp3b3VsZCBiZToKCjEuIG1tX3N0cnVjdC1iYXNl ZDoKICAgICAgYSBzZXQgb2Ygc3lzY2FsbHMgdG8gY3JlYXRlIGEgbmV3IG1tX3N0cnVjdCwKICAg ICAgY2hhbmdlIG1lbW9yeSBtYXBwaW5ncyB1bmRlciB0aGF0IG1tX3N0cnVjdCwgYW5kIHN3aXRj aCB0byBpdAoyLiBwYWdldGFibGUtbWlycm9yaW5nLWJhc2VkOgogICAgICBsaWtlIC9kZXYva3Zt LCBhbiBBUEkgdG8gY3JlYXRlIGEgbmV3IHBhZ2V0YWJsZSwgbWlycm9yIHBhcnRzIG9mCiAgICAg IHRoZSBtbV9zdHJ1Y3QncyBwYWdldGFibGVzIG92ZXIgaW50byBpdCB3aXRoIG1vZGlmaWVkIHBl cm1pc3Npb25zCiAgICAgIChsaWtlIEtWTV9TRVRfVVNFUl9NRU1PUllfUkVHSU9OKSwKICAgICAg YW5kIHJ1biBjb2RlIHVuZGVyIHRoYXQgY29udGV4dC4KICAgICAgcGFnZSBmYXVsdCBoYW5kbGlu ZyB3b3VsZCBmaXJzdCBoYW5kbGUgdGhlIGZhdWx0IGFnYWluc3QgbW0tPnBnZAogICAgICBhcyBu b3JtYWwsIHRoZW4gbWlycm9yIHRoZSBQVEUgb3ZlciBpbnRvIHRoZSBzZWNvbmRhcnkgcGFnZXRh Ymxlcy4KICAgICAgaW52YWxpZGF0aW9uIGNvdWxkIGJlIGhhbmRsZWQgd2l0aCBNTVUgbm90aWZp ZXJzLgoKPiBBbm90aGVyIHVzZS1jYXNlIGlzIENSSVUgKENoZWNrcG9pbnQvUmVzdG9yZSBpbiBV c2VyLXNwYWNlKS4gU2V2ZXJhbAo+IHByb2Nlc3MgcHJvcGVydGllcyBjYW4gYmUgcmVjZWl2ZWQg b25seSBmcm9tIHRoZSBwcm9jZXNzIGl0c2VsZi4gUmlnaHQKPiBub3csIHdlIHVzZSBhIHBhcmFz aXRlIGNvZGUgdGhhdCBpcyBpbmplY3RlZCBpbnRvIHRoZSBwcm9jZXNzLiBXZSBkbwo+IHRoaXMg d2l0aCBwdHJhY2UgYnV0IGl0IGlzIHNsb3csIHVuc2FmZSwgYW5kIHRyaWNreS4KCkJ1dCB0aGlz IEFQSSB3aWxsIG9ubHkgbGV0IHlvdSBydW4gY29kZSB1bmRlciB0aGUgKm1tKiBvZiB0aGUgdGFy Z2V0CnByb2Nlc3MsIG5vdCBmdWxseSBpbiB0aGUgY29udGV4dCBvZiBhIHRhcmdldCAqdGFzayos IHJpZ2h0PyBTbyB5b3UKc3RpbGwgd29uJ3QgYmUgYWJsZSB0byB1c2UgdGhpcyBmb3IgYWNjZXNz aW5nIGFueXRoaW5nIG90aGVyIHRoYW4KbWVtb3J5PyBUaGF0IGRvZXNuJ3Qgc2VlbSB2ZXJ5IGdl bmVyaWNhbGx5IHVzZWZ1bCB0byBtZS4KCkFsc28sIEkgZG9uJ3QgZG91YnQgdGhhdCBhbnl0aGlu ZyBpbnZvbHZpbmcgcHRyYWNlIGlzIGtpbmRhIHRyaWNreSwKYnV0IGl0IHdvdWxkIGJlIG5pY2Ug dG8gaGF2ZSBzb21lIG1vcmUgZGV0YWlsIG9uIHdoYXQgZXhhY3RseSBtYWtlcwp0aGlzIHNsb3cs IHVuc2FmZSBhbmQgdHJpY2t5LiBBcmUgdGhlcmUgQVBJIGFkZGl0aW9ucyBmb3IgcHRyYWNlIHRo YXQKd291bGQgbWFrZSB0aGlzIHdvcmsgYmV0dGVyPyBJIGltYWdpbmUgeW91J3JlIHRoaW5raW5n IG9mIHRoaW5ncyBsaWtlCmFuIEFQSSBmb3IgaW5qZWN0aW5nIGEgc3lzY2FsbCBpbnRvIHRoZSB0 YXJnZXQgcHJvY2VzcyB3aXRob3V0IGhhdmluZwp0byBmaXJzdCBzb21laG93IGZpbmQgYW4gZXhp c3RpbmcgU1lTQ0FMTCBpbnN0cnVjdGlvbiBpbiB0aGUgdGFyZ2V0CnByb2Nlc3M/Cgo+IHByb2Nl c3Nfdm1fZXhlYyBjYW4KPiBzaW1wbGlmeSB0aGUgcHJvY2VzcyBvZiBpbmplY3RpbmcgYSBwYXJh c2l0ZSBjb2RlIGFuZCBpdCB3aWxsIGFsbG93Cj4gcHJlLWR1bXAgbWVtb3J5IHdpdGhvdXQgc3Rv cHBpbmcgcHJvY2Vzc2VzLiBUaGUgcHJlLWR1bXAgaGVyZSBpcyB3aGVuIHdlCj4gZW5hYmxlIGEg bWVtb3J5IHRyYWNrZXIgYW5kIGR1bXAgdGhlIG1lbW9yeSB3aGlsZSBhIHByb2Nlc3MgaXMgY29u dGludWUKPiBydW5uaW5nLiBPbiBlYWNoIGludGVyYWN0aW9uIHdlIGR1bXAgbWVtb3J5IHRoYXQg aGFzIGJlZW4gY2hhbmdlZCBmcm9tCj4gdGhlIHByZXZpb3VzIGl0ZXJhdGlvbi4gSW4gdGhlIGZp bmFsIHN0ZXAsIHdlIHdpbGwgc3RvcCBwcm9jZXNzZXMgYW5kCj4gZHVtcCB0aGVpciBmdWxsIHN0 YXRlLiBSaWdodCBub3cgdGhlIG1vc3QgZWZmZWN0aXZlIHdheSB0byBkdW1wIHByb2Nlc3MKPiBt ZW1vcnkgaXMgdG8gY3JlYXRlIGEgc2V0IG9mIHBpcGVzIGFuZCBzcGxpY2UgbWVtb3J5IGludG8g dGhlc2UgcGlwZXMKPiBmcm9tIHRoZSBwYXJhc2l0ZSBjb2RlLiBXaXRoIHByb2Nlc3Nfdm1fZXhl Yywgd2Ugd2lsbCBiZSBhYmxlIHRvIGNhbGwKPiB2bXNwbGljZSBkaXJlY3RseS4gSXQgbWVhbnMg dGhhdCB3ZSB3aWxsIG5vdCBuZWVkIHRvIHN0b3AgYSBwcm9jZXNzIHRvCj4gaW5qZWN0IHRoZSBw YXJhc2l0ZSBjb2RlLgoKQWx0ZXJuYXRpdmVseSB5b3UgY291bGQgYWRkIHNwbGljZSBzdXBwb3J0 IHRvIC9wcm9jLyRwaWQvbWVtIG9yIGFkZCBhCnN5c2NhbGwgc2ltaWxhciB0byBwcm9jZXNzX3Zt X3JlYWR2KCkgdGhhdCBzcGxpY2VzIGludG8gYSBwaXBlLCByaWdodD8KCl9fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fCmxpbnV4LXVtIG1haWxpbmcgbGlzdAps aW51eC11bUBsaXN0cy5pbmZyYWRlYWQub3JnCmh0dHA6Ly9saXN0cy5pbmZyYWRlYWQub3JnL21h aWxtYW4vbGlzdGluZm8vbGludXgtdW0K