From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A5678C11F6B for ; Fri, 2 Jul 2021 15:12:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8856D61416 for ; Fri, 2 Jul 2021 15:12:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232322AbhGBPPE (ORCPT ); Fri, 2 Jul 2021 11:15:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51448 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232306AbhGBPPD (ORCPT ); Fri, 2 Jul 2021 11:15:03 -0400 Received: from mail-lf1-x12e.google.com (mail-lf1-x12e.google.com [IPv6:2a00:1450:4864:20::12e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54080C061764 for ; Fri, 2 Jul 2021 08:12:31 -0700 (PDT) Received: by mail-lf1-x12e.google.com with SMTP id bu19so18661115lfb.9 for ; Fri, 02 Jul 2021 08:12:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=khrNN02EDjPn05bGelbS4FlmRXR9mKhe5093E+aR9GU=; b=a9CXomLJ4GNd0tdcmkWcR+FmNcSc2kTxEU0xtbUD6O3OGNR/M2wgP49Lq3F7YFZs38 oqgcyEv+SagTByCN0q2uRdW5ipKECVAgViAA1aTBKmqVbTIxdE2wHLRIWLT+Q86HOTQJ l7+DZ2ZxNM8tloIrlk3qbkDV3HsizwWyA1s5Y7Jke8Cg98+8kvVkRtTT7kcPA0jQxjV2 Fo0qsKh0aRE59elDr04R6kZtlvijP4PCxUS/b+q3gO5O18r21Iu26OnzX6kW2xvMx1/3 WJqmVMes6CXVl01lfFlr/YA/HV5rd2LAvfrQh5Q0WUzS3Ty/7yHCVcUiw/Rkgd6yXlHc BhjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=khrNN02EDjPn05bGelbS4FlmRXR9mKhe5093E+aR9GU=; b=hIPEbf7FRC4XV2BxkTDhMt4eEB3dkzMnGNBlYgv94HGd+fZrIIBTqvwCKRqix0MQga dXREPXCWo5jrwgjKyzD9AR4775t8DebCcKl6KVymmkcF0A4ChihPPIpdJ4nLHfNCkeXU 1J+wnnm9Lbhz5eEa9w5sfofYwOAk0KLd9y6f5pZmGZMXfEkYweSKdsU+p6ksr8hIAbT3 4m6gGSX0BgVLBT8wtByxCMN0NLbrRGYcPJxEYv407mb+ivEADhkm07KuerYRbnjRswUW W63t0QljCjBjZAd9DnyCj22Vmg5Y4nx7pMqMRxkYCEP0Lw6Vg9QrvvUvHzoE5tj7r70c NO9A== X-Gm-Message-State: AOAM533V+JYpkesz9v4VJBTm7s+wZgouolNW+wVMMitVDs61kdRKy087 8c9Xnrga7oN92V3qQpB6m8MAH8f/4VX7M9B0c4EDRQ== X-Google-Smtp-Source: ABdhPJxPyb01nLArk8HsphTcEK9TwKyoakG/3eUzbZINCpJhALlOZQQYpvV7hhCrYlkELzFhkQuBsnRDwBFhONDS5Fc= X-Received: by 2002:ac2:519b:: with SMTP id u27mr94541lfi.352.1625238749175; Fri, 02 Jul 2021 08:12:29 -0700 (PDT) MIME-Version: 1.0 References: <20210414055217.543246-1-avagin@gmail.com> In-Reply-To: From: Jann Horn Date: Fri, 2 Jul 2021 17:12:02 +0200 Message-ID: Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space To: Andrei Vagin Cc: kernel list , Linux API , linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com, Andrew Morton , Andy Lutomirski , Anton Ivanov , Christian Brauner , Dmitry Safonov <0x7f454c46@gmail.com>, Ingo Molnar , Jeff Dike , Mike Rapoport , Michael Kerrisk , Oleg Nesterov , Peter Zijlstra , Richard Weinberger , Thomas Gleixner , linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 2, 2021 at 9:01 AM Andrei Vagin wrote: > On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote: > > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin wrote: > > > We already have process_vm_readv and process_vm_writev to read and wr= ite > > > to a process memory faster than we can do this with ptrace. And now i= t > > > is time for process_vm_exec that allows executing code in an address > > > space of another process. We can do this with ptrace but it is much > > > slower. > > > > > > =3D Use-cases =3D > > > > It seems to me like your proposed API doesn't really fit either one of > > those usecases well... > > > > > Here are two known use-cases. The first one is =E2=80=9Capplication k= ernel=E2=80=9D > > > sandboxes like User-mode Linux and gVisor. In this case, we have a > > > process that runs the sandbox kernel and a set of stub processes that > > > are used to manage guest address spaces. Guest code is executed in th= e > > > context of stub processes but all system calls are intercepted and > > > handled in the sandbox kernel. Right now, these sort of sandboxes use > > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > > > significantly speed them up. > > > > In this case, since you really only want an mm_struct to run code > > under, it seems weird to create a whole task with its own PID and so > > on. It seems to me like something similar to the /dev/kvm API would be > > more appropriate here? Implementation options that I see for that > > would be: > > > > 1. mm_struct-based: > > a set of syscalls to create a new mm_struct, > > change memory mappings under that mm_struct, and switch to it > > I like the idea to have a handle for mm. Instead of pid, we will pass > this handle to process_vm_exec. We have pidfd for processes and we can > introduce mmfd for mm_struct. I personally think that it might be quite unwieldy when it comes to the restrictions you get from trying to have shared memory with the owning process - I'm having trouble figuring out how you can implement copy-on-write semantics without relying on copy-on-write logic in the host OS and without being able to use userfaultfd. But if that's not a problem somehow, and you can find some reasonable way to handle memory usage accounting and fix up everything that assumes that multithreaded userspace threads don't switch ->mm, I guess this might work for your usecase. > > 2. pagetable-mirroring-based: > > like /dev/kvm, an API to create a new pagetable, mirror parts of > > the mm_struct's pagetables over into it with modified permissions > > (like KVM_SET_USER_MEMORY_REGION), > > and run code under that context. > > page fault handling would first handle the fault against mm->pgd > > as normal, then mirror the PTE over into the secondary pagetables= . > > invalidation could be handled with MMU notifiers. > > > > I found this idea interesting and decided to look at it more closely. > After reading the kernel code for a few days, I realized that it would > not be easy to implement something like this, Yeah, it might need architecture-specific code to flip the page tables on userspace entry/exit, and maybe also for mirroring them. And for the TLB flushing logic... > but more important is that > I don=E2=80=99t understand what problem it solves. Will it simplify the > user-space code? I don=E2=80=99t think so. Will it improve performance? I= t is > unclear for me too. Some reasons I can think of are: - direct guest memory access: I imagined you'd probably want to be able to directly access userspace memory from the supervisor, and with this approach that'd become easy. - integration with on-demand paging of the host OS: You'd be able to create things like file-backed copy-on-write mappings from the host filesystem, or implement your own mappings backed by some kind of storage using userfaultfd. - sandboxing: For sandboxing usecases (not your usecase), it would be possible to e.g. create a read-only clone of the entire address space of= a process and give write access to specific parts of it, or something like that. These address space clones could potentially be created and destroyed fairly quickly. - accounting: memory usage would be automatically accounted to the supervisor process, so even without a parasite process, you'd be able to see the memory usage correctly in things like "top". - small (non-pageable) memory footprint in the host kernel: The only things the host kernel would have to persistently store would b= e the normal MM data structures for the supervisor plus the mappings from "guest userspace" memory ranges to supervisor memory ranges; userspace pagetables would be discardable, and could even be shared with those of the supervisor in cases where the alignment fits. So with this, large anonymous mappings with 4K granularity only cost you ~0.20% overhead across host and guest address space; without this, if yo= u used shared mappings instead, you'd pay twice that for every 2MiB range from which parts are accessed in both contexts, plus probably another ~0.2% or so for the "struct address_space"? - all memory-management-related syscalls could be directly performed in the "kernel" process But yeah, some of those aren't really relevant for your usecase, and I guess things like the accounting aspect could just as well be solved differently... > First, in the KVM case, we have a few big linear mappings and need to > support one =E2=80=9Cshadow=E2=80=9D address space. In the case of sandbo= xes, we can > have a tremendous amount of mappings and many address spaces that we > need to manage. Memory mappings will be mapped with different addresses > in a supervisor address space and =E2=80=9Cguest=E2=80=9D address spaces.= If guest > address spaces will not have their mm_structs, we will need to reinvent > vma-s in some form. If guest address spaces have mm_structs, this will > look similar to https://lwn.net/Articles/830648/. > > Second, each pagetable is tied up with mm_stuct. You suggest creating > new pagetables that will not have their mm_struct-s (sorry if I > misunderstood something). Yeah, that's what I had in mind, page tables without an mm_struct. > I am not sure that it will be easy to > implement. How many corner cases will be there? Yeah, it would require some work around TLB flushing and entry/exit from userspace. But from a high-level perspective it feels to me like a change with less systematic impact. Maybe I'm wrong about that. > As for page faults in a secondary address space, we will need to find a > fault address in the main address space, handle the fault there and then > mirror the PTE to the secondary pagetable. Right. > Effectively, it means that > page faults will be handled in two address spaces. Right now, we use > memfd and shared mappings. It means that each fault is handled only in > one address space, and we map a guest memory region to the supervisor > address space only when we need to access it. A large portion of guest > anonymous memory is never mapped to the supervisor address space. > Will an overhead of mirrored address spaces be smaller than memfd shared > mappings? I am not sure. But as long as the mappings are sufficiently big and aligned properly, or you explicitly manage the supervisor address space, some of that cost disappears: E.g. even if a page is mapped in both address spaces, you wouldn't have a memory cost for the second mapping if the page tables are shared. > Third, this approach will not get rid of having process_vm_exec. We will > need to switch to a guest address space with a specified state and > switch back on faults or syscalls. Yeah, you'd still need a syscall for running code under a different set of page tables. But that's something that KVM _almost_ already does. > If the main concern is the ability to > run syscalls on a remote mm, we can think about how to fix this. I see > two ways what we can do here: > > * Specify the exact list of system calls that are allowed. The first > three candidates are mmap, munmap, and vmsplice. > > * Instead of allowing us to run system calls, we can implement this in > the form of commands. In the case of sandboxes, we need to implement > only two commands to create and destroy memory mappings in a target > address space. FWIW, there is precedent for something similar: The Android folks already added process_madvise() for remotely messing with the VMAs of another process to some degree. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A2DCAC11F6A for ; Fri, 2 Jul 2021 15:12:33 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 244316141D for ; Fri, 2 Jul 2021 15:12:33 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 244316141D Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2FAA16B0070; Fri, 2 Jul 2021 11:12:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2AAE16B0071; Fri, 2 Jul 2021 11:12:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 125408D0001; Fri, 2 Jul 2021 11:12:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0043.hostedemail.com [216.40.44.43]) by kanga.kvack.org (Postfix) with ESMTP id DFCC06B0070 for ; Fri, 2 Jul 2021 11:12:31 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 8114180C2375 for ; Fri, 2 Jul 2021 15:12:31 +0000 (UTC) X-FDA: 78317989302.29.5BF7477 Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com [209.85.167.52]) by imf18.hostedemail.com (Postfix) with ESMTP id 370BC400208D for ; Fri, 2 Jul 2021 15:12:31 +0000 (UTC) Received: by mail-lf1-f52.google.com with SMTP id q16so18692695lfr.4 for ; Fri, 02 Jul 2021 08:12:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=khrNN02EDjPn05bGelbS4FlmRXR9mKhe5093E+aR9GU=; b=a9CXomLJ4GNd0tdcmkWcR+FmNcSc2kTxEU0xtbUD6O3OGNR/M2wgP49Lq3F7YFZs38 oqgcyEv+SagTByCN0q2uRdW5ipKECVAgViAA1aTBKmqVbTIxdE2wHLRIWLT+Q86HOTQJ l7+DZ2ZxNM8tloIrlk3qbkDV3HsizwWyA1s5Y7Jke8Cg98+8kvVkRtTT7kcPA0jQxjV2 Fo0qsKh0aRE59elDr04R6kZtlvijP4PCxUS/b+q3gO5O18r21Iu26OnzX6kW2xvMx1/3 WJqmVMes6CXVl01lfFlr/YA/HV5rd2LAvfrQh5Q0WUzS3Ty/7yHCVcUiw/Rkgd6yXlHc BhjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=khrNN02EDjPn05bGelbS4FlmRXR9mKhe5093E+aR9GU=; b=BPngfkgLbU9K3tc+1cqjsiFoNqL3MKCREp5Yl10vmhuP4zbtBfI+A5K5NhCkmZeLAp AeUTtBvstmFRJwktoZrPQyGs5mgoNLPm+XtWt/1uLE1ASdLsynSLBSOeFZgu9p2AfRS9 2lr1dTNO3UimNzmSOWh+K5T/i0rwicAtQWQJ799LkAnwMHhCGdnqy8S3P9OBdsLN9ei2 cDgek0oOUjFCUkU774vJ5WoTQAcxh3dYAQAtUJ7DGSp94Cs5GPS7O2P1U39yAVpmsd3Y iHSYqioPVC9PF/SwMTekAV8CJcQBvAUsWOFvWoqw01WYxxfkiwRajh/PRWDrjON+VwsB P2Vg== X-Gm-Message-State: AOAM5320iN7cfMtOKewb/JGEEAhl06oFV/uK8gliynCkFqrPfrsFNyPH +Xhf9zM1aTqIYEloAIn3NCQC2ROxRSgMi7g4p6YMyw== X-Google-Smtp-Source: ABdhPJxPyb01nLArk8HsphTcEK9TwKyoakG/3eUzbZINCpJhALlOZQQYpvV7hhCrYlkELzFhkQuBsnRDwBFhONDS5Fc= X-Received: by 2002:ac2:519b:: with SMTP id u27mr94541lfi.352.1625238749175; Fri, 02 Jul 2021 08:12:29 -0700 (PDT) MIME-Version: 1.0 References: <20210414055217.543246-1-avagin@gmail.com> In-Reply-To: From: Jann Horn Date: Fri, 2 Jul 2021 17:12:02 +0200 Message-ID: Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space To: Andrei Vagin Cc: kernel list , Linux API , linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com, Andrew Morton , Andy Lutomirski , Anton Ivanov , Christian Brauner , Dmitry Safonov <0x7f454c46@gmail.com>, Ingo Molnar , Jeff Dike , Mike Rapoport , Michael Kerrisk , Oleg Nesterov , Peter Zijlstra , Richard Weinberger , Thomas Gleixner , linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=a9CXomLJ; spf=pass (imf18.hostedemail.com: domain of jannh@google.com designates 209.85.167.52 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 370BC400208D X-Stat-Signature: ygf4bgnrsxjhqziag153qqoe7rmxusqb X-HE-Tag: 1625238751-625732 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jul 2, 2021 at 9:01 AM Andrei Vagin wrote: > On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote: > > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin wrote: > > > We already have process_vm_readv and process_vm_writev to read and wr= ite > > > to a process memory faster than we can do this with ptrace. And now i= t > > > is time for process_vm_exec that allows executing code in an address > > > space of another process. We can do this with ptrace but it is much > > > slower. > > > > > > =3D Use-cases =3D > > > > It seems to me like your proposed API doesn't really fit either one of > > those usecases well... > > > > > Here are two known use-cases. The first one is =E2=80=9Capplication k= ernel=E2=80=9D > > > sandboxes like User-mode Linux and gVisor. In this case, we have a > > > process that runs the sandbox kernel and a set of stub processes that > > > are used to manage guest address spaces. Guest code is executed in th= e > > > context of stub processes but all system calls are intercepted and > > > handled in the sandbox kernel. Right now, these sort of sandboxes use > > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > > > significantly speed them up. > > > > In this case, since you really only want an mm_struct to run code > > under, it seems weird to create a whole task with its own PID and so > > on. It seems to me like something similar to the /dev/kvm API would be > > more appropriate here? Implementation options that I see for that > > would be: > > > > 1. mm_struct-based: > > a set of syscalls to create a new mm_struct, > > change memory mappings under that mm_struct, and switch to it > > I like the idea to have a handle for mm. Instead of pid, we will pass > this handle to process_vm_exec. We have pidfd for processes and we can > introduce mmfd for mm_struct. I personally think that it might be quite unwieldy when it comes to the restrictions you get from trying to have shared memory with the owning process - I'm having trouble figuring out how you can implement copy-on-write semantics without relying on copy-on-write logic in the host OS and without being able to use userfaultfd. But if that's not a problem somehow, and you can find some reasonable way to handle memory usage accounting and fix up everything that assumes that multithreaded userspace threads don't switch ->mm, I guess this might work for your usecase. > > 2. pagetable-mirroring-based: > > like /dev/kvm, an API to create a new pagetable, mirror parts of > > the mm_struct's pagetables over into it with modified permissions > > (like KVM_SET_USER_MEMORY_REGION), > > and run code under that context. > > page fault handling would first handle the fault against mm->pgd > > as normal, then mirror the PTE over into the secondary pagetables= . > > invalidation could be handled with MMU notifiers. > > > > I found this idea interesting and decided to look at it more closely. > After reading the kernel code for a few days, I realized that it would > not be easy to implement something like this, Yeah, it might need architecture-specific code to flip the page tables on userspace entry/exit, and maybe also for mirroring them. And for the TLB flushing logic... > but more important is that > I don=E2=80=99t understand what problem it solves. Will it simplify the > user-space code? I don=E2=80=99t think so. Will it improve performance? I= t is > unclear for me too. Some reasons I can think of are: - direct guest memory access: I imagined you'd probably want to be able to directly access userspace memory from the supervisor, and with this approach that'd become easy. - integration with on-demand paging of the host OS: You'd be able to create things like file-backed copy-on-write mappings from the host filesystem, or implement your own mappings backed by some kind of storage using userfaultfd. - sandboxing: For sandboxing usecases (not your usecase), it would be possible to e.g. create a read-only clone of the entire address space of= a process and give write access to specific parts of it, or something like that. These address space clones could potentially be created and destroyed fairly quickly. - accounting: memory usage would be automatically accounted to the supervisor process, so even without a parasite process, you'd be able to see the memory usage correctly in things like "top". - small (non-pageable) memory footprint in the host kernel: The only things the host kernel would have to persistently store would b= e the normal MM data structures for the supervisor plus the mappings from "guest userspace" memory ranges to supervisor memory ranges; userspace pagetables would be discardable, and could even be shared with those of the supervisor in cases where the alignment fits. So with this, large anonymous mappings with 4K granularity only cost you ~0.20% overhead across host and guest address space; without this, if yo= u used shared mappings instead, you'd pay twice that for every 2MiB range from which parts are accessed in both contexts, plus probably another ~0.2% or so for the "struct address_space"? - all memory-management-related syscalls could be directly performed in the "kernel" process But yeah, some of those aren't really relevant for your usecase, and I guess things like the accounting aspect could just as well be solved differently... > First, in the KVM case, we have a few big linear mappings and need to > support one =E2=80=9Cshadow=E2=80=9D address space. In the case of sandbo= xes, we can > have a tremendous amount of mappings and many address spaces that we > need to manage. Memory mappings will be mapped with different addresses > in a supervisor address space and =E2=80=9Cguest=E2=80=9D address spaces.= If guest > address spaces will not have their mm_structs, we will need to reinvent > vma-s in some form. If guest address spaces have mm_structs, this will > look similar to https://lwn.net/Articles/830648/. > > Second, each pagetable is tied up with mm_stuct. You suggest creating > new pagetables that will not have their mm_struct-s (sorry if I > misunderstood something). Yeah, that's what I had in mind, page tables without an mm_struct. > I am not sure that it will be easy to > implement. How many corner cases will be there? Yeah, it would require some work around TLB flushing and entry/exit from userspace. But from a high-level perspective it feels to me like a change with less systematic impact. Maybe I'm wrong about that. > As for page faults in a secondary address space, we will need to find a > fault address in the main address space, handle the fault there and then > mirror the PTE to the secondary pagetable. Right. > Effectively, it means that > page faults will be handled in two address spaces. Right now, we use > memfd and shared mappings. It means that each fault is handled only in > one address space, and we map a guest memory region to the supervisor > address space only when we need to access it. A large portion of guest > anonymous memory is never mapped to the supervisor address space. > Will an overhead of mirrored address spaces be smaller than memfd shared > mappings? I am not sure. But as long as the mappings are sufficiently big and aligned properly, or you explicitly manage the supervisor address space, some of that cost disappears: E.g. even if a page is mapped in both address spaces, you wouldn't have a memory cost for the second mapping if the page tables are shared. > Third, this approach will not get rid of having process_vm_exec. We will > need to switch to a guest address space with a specified state and > switch back on faults or syscalls. Yeah, you'd still need a syscall for running code under a different set of page tables. But that's something that KVM _almost_ already does. > If the main concern is the ability to > run syscalls on a remote mm, we can think about how to fix this. I see > two ways what we can do here: > > * Specify the exact list of system calls that are allowed. The first > three candidates are mmap, munmap, and vmsplice. > > * Instead of allowing us to run system calls, we can implement this in > the form of commands. In the case of sandboxes, we need to implement > only two commands to create and destroy memory mappings in a target > address space. FWIW, there is precedent for something similar: The Android folks already added process_madvise() for remotely messing with the VMAs of another process to some degree. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x12d.google.com ([2a00:1450:4864:20::12d]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1lzKqF-003Nhn-D4 for linux-um@lists.infradead.org; Fri, 02 Jul 2021 15:12:33 +0000 Received: by mail-lf1-x12d.google.com with SMTP id q18so18646055lfc.7 for ; Fri, 02 Jul 2021 08:12:30 -0700 (PDT) MIME-Version: 1.0 References: <20210414055217.543246-1-avagin@gmail.com> In-Reply-To: From: Jann Horn Date: Fri, 2 Jul 2021 17:12:02 +0200 Message-ID: Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Sender: "linux-um" Errors-To: linux-um-bounces+geert=linux-m68k.org@lists.infradead.org To: Andrei Vagin Cc: kernel list , Linux API , linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com, Andrew Morton , Andy Lutomirski , Anton Ivanov , Christian Brauner , Dmitry Safonov <0x7f454c46@gmail.com>, Ingo Molnar , Jeff Dike , Mike Rapoport , Michael Kerrisk , Oleg Nesterov , Peter Zijlstra , Richard Weinberger , Thomas Gleixner , linux-mm@kvack.org T24gRnJpLCBKdWwgMiwgMjAyMSBhdCA5OjAxIEFNIEFuZHJlaSBWYWdpbiA8YXZhZ2luQGdtYWls LmNvbT4gd3JvdGU6Cj4gT24gV2VkLCBBcHIgMTQsIDIwMjEgYXQgMDg6NDY6NDBBTSArMDIwMCwg SmFubiBIb3JuIHdyb3RlOgo+ID4gT24gV2VkLCBBcHIgMTQsIDIwMjEgYXQgNzo1OSBBTSBBbmRy ZWkgVmFnaW4gPGF2YWdpbkBnbWFpbC5jb20+IHdyb3RlOgo+ID4gPiBXZSBhbHJlYWR5IGhhdmUg cHJvY2Vzc192bV9yZWFkdiBhbmQgcHJvY2Vzc192bV93cml0ZXYgdG8gcmVhZCBhbmQgd3JpdGUK PiA+ID4gdG8gYSBwcm9jZXNzIG1lbW9yeSBmYXN0ZXIgdGhhbiB3ZSBjYW4gZG8gdGhpcyB3aXRo IHB0cmFjZS4gQW5kIG5vdyBpdAo+ID4gPiBpcyB0aW1lIGZvciBwcm9jZXNzX3ZtX2V4ZWMgdGhh dCBhbGxvd3MgZXhlY3V0aW5nIGNvZGUgaW4gYW4gYWRkcmVzcwo+ID4gPiBzcGFjZSBvZiBhbm90 aGVyIHByb2Nlc3MuIFdlIGNhbiBkbyB0aGlzIHdpdGggcHRyYWNlIGJ1dCBpdCBpcyBtdWNoCj4g PiA+IHNsb3dlci4KPiA+ID4KPiA+ID4gPSBVc2UtY2FzZXMgPQo+ID4KPiA+IEl0IHNlZW1zIHRv IG1lIGxpa2UgeW91ciBwcm9wb3NlZCBBUEkgZG9lc24ndCByZWFsbHkgZml0IGVpdGhlciBvbmUg b2YKPiA+IHRob3NlIHVzZWNhc2VzIHdlbGwuLi4KPiA+Cj4gPiA+IEhlcmUgYXJlIHR3byBrbm93 biB1c2UtY2FzZXMuIFRoZSBmaXJzdCBvbmUgaXMg4oCcYXBwbGljYXRpb24ga2VybmVs4oCdCj4g PiA+IHNhbmRib3hlcyBsaWtlIFVzZXItbW9kZSBMaW51eCBhbmQgZ1Zpc29yLiBJbiB0aGlzIGNh c2UsIHdlIGhhdmUgYQo+ID4gPiBwcm9jZXNzIHRoYXQgcnVucyB0aGUgc2FuZGJveCBrZXJuZWwg YW5kIGEgc2V0IG9mIHN0dWIgcHJvY2Vzc2VzIHRoYXQKPiA+ID4gYXJlIHVzZWQgdG8gbWFuYWdl IGd1ZXN0IGFkZHJlc3Mgc3BhY2VzLiBHdWVzdCBjb2RlIGlzIGV4ZWN1dGVkIGluIHRoZQo+ID4g PiBjb250ZXh0IG9mIHN0dWIgcHJvY2Vzc2VzIGJ1dCBhbGwgc3lzdGVtIGNhbGxzIGFyZSBpbnRl cmNlcHRlZCBhbmQKPiA+ID4gaGFuZGxlZCBpbiB0aGUgc2FuZGJveCBrZXJuZWwuIFJpZ2h0IG5v dywgdGhlc2Ugc29ydCBvZiBzYW5kYm94ZXMgdXNlCj4gPiA+IFBUUkFDRV9TWVNFTVUgdG8gdHJh cCBzeXN0ZW0gY2FsbHMsIGJ1dCB0aGUgcHJvY2Vzc192bV9leGVjIGNhbgo+ID4gPiBzaWduaWZp Y2FudGx5IHNwZWVkIHRoZW0gdXAuCj4gPgo+ID4gSW4gdGhpcyBjYXNlLCBzaW5jZSB5b3UgcmVh bGx5IG9ubHkgd2FudCBhbiBtbV9zdHJ1Y3QgdG8gcnVuIGNvZGUKPiA+IHVuZGVyLCBpdCBzZWVt cyB3ZWlyZCB0byBjcmVhdGUgYSB3aG9sZSB0YXNrIHdpdGggaXRzIG93biBQSUQgYW5kIHNvCj4g PiBvbi4gSXQgc2VlbXMgdG8gbWUgbGlrZSBzb21ldGhpbmcgc2ltaWxhciB0byB0aGUgL2Rldi9r dm0gQVBJIHdvdWxkIGJlCj4gPiBtb3JlIGFwcHJvcHJpYXRlIGhlcmU/IEltcGxlbWVudGF0aW9u IG9wdGlvbnMgdGhhdCBJIHNlZSBmb3IgdGhhdAo+ID4gd291bGQgYmU6Cj4gPgo+ID4gMS4gbW1f c3RydWN0LWJhc2VkOgo+ID4gICAgICAgYSBzZXQgb2Ygc3lzY2FsbHMgdG8gY3JlYXRlIGEgbmV3 IG1tX3N0cnVjdCwKPiA+ICAgICAgIGNoYW5nZSBtZW1vcnkgbWFwcGluZ3MgdW5kZXIgdGhhdCBt bV9zdHJ1Y3QsIGFuZCBzd2l0Y2ggdG8gaXQKPgo+IEkgbGlrZSB0aGUgaWRlYSB0byBoYXZlIGEg aGFuZGxlIGZvciBtbS4gSW5zdGVhZCBvZiBwaWQsIHdlIHdpbGwgcGFzcwo+IHRoaXMgaGFuZGxl IHRvIHByb2Nlc3Nfdm1fZXhlYy4gV2UgaGF2ZSBwaWRmZCBmb3IgcHJvY2Vzc2VzIGFuZCB3ZSBj YW4KPiBpbnRyb2R1Y2UgbW1mZCBmb3IgbW1fc3RydWN0LgoKSSBwZXJzb25hbGx5IHRoaW5rIHRo YXQgaXQgbWlnaHQgYmUgcXVpdGUgdW53aWVsZHkgd2hlbiBpdCBjb21lcyB0bwp0aGUgcmVzdHJp Y3Rpb25zIHlvdSBnZXQgZnJvbSB0cnlpbmcgdG8gaGF2ZSBzaGFyZWQgbWVtb3J5IHdpdGggdGhl Cm93bmluZyBwcm9jZXNzIC0gSSdtIGhhdmluZyB0cm91YmxlIGZpZ3VyaW5nIG91dCBob3cgeW91 IGNhbiBpbXBsZW1lbnQKY29weS1vbi13cml0ZSBzZW1hbnRpY3Mgd2l0aG91dCByZWx5aW5nIG9u IGNvcHktb24td3JpdGUgbG9naWMgaW4gdGhlCmhvc3QgT1MgYW5kIHdpdGhvdXQgYmVpbmcgYWJs ZSB0byB1c2UgdXNlcmZhdWx0ZmQuCgpCdXQgaWYgdGhhdCdzIG5vdCBhIHByb2JsZW0gc29tZWhv dywgYW5kIHlvdSBjYW4gZmluZCBzb21lIHJlYXNvbmFibGUKd2F5IHRvIGhhbmRsZSBtZW1vcnkg dXNhZ2UgYWNjb3VudGluZyBhbmQgZml4IHVwIGV2ZXJ5dGhpbmcgdGhhdAphc3N1bWVzIHRoYXQg bXVsdGl0aHJlYWRlZCB1c2Vyc3BhY2UgdGhyZWFkcyBkb24ndCBzd2l0Y2ggLT5tbSwgSQpndWVz cyB0aGlzIG1pZ2h0IHdvcmsgZm9yIHlvdXIgdXNlY2FzZS4KCj4gPiAyLiBwYWdldGFibGUtbWly cm9yaW5nLWJhc2VkOgo+ID4gICAgICAgbGlrZSAvZGV2L2t2bSwgYW4gQVBJIHRvIGNyZWF0ZSBh IG5ldyBwYWdldGFibGUsIG1pcnJvciBwYXJ0cyBvZgo+ID4gICAgICAgdGhlIG1tX3N0cnVjdCdz IHBhZ2V0YWJsZXMgb3ZlciBpbnRvIGl0IHdpdGggbW9kaWZpZWQgcGVybWlzc2lvbnMKPiA+ICAg ICAgIChsaWtlIEtWTV9TRVRfVVNFUl9NRU1PUllfUkVHSU9OKSwKPiA+ICAgICAgIGFuZCBydW4g Y29kZSB1bmRlciB0aGF0IGNvbnRleHQuCj4gPiAgICAgICBwYWdlIGZhdWx0IGhhbmRsaW5nIHdv dWxkIGZpcnN0IGhhbmRsZSB0aGUgZmF1bHQgYWdhaW5zdCBtbS0+cGdkCj4gPiAgICAgICBhcyBu b3JtYWwsIHRoZW4gbWlycm9yIHRoZSBQVEUgb3ZlciBpbnRvIHRoZSBzZWNvbmRhcnkgcGFnZXRh Ymxlcy4KPiA+ICAgICAgIGludmFsaWRhdGlvbiBjb3VsZCBiZSBoYW5kbGVkIHdpdGggTU1VIG5v dGlmaWVycy4KPiA+Cj4KPiBJIGZvdW5kIHRoaXMgaWRlYSBpbnRlcmVzdGluZyBhbmQgZGVjaWRl ZCB0byBsb29rIGF0IGl0IG1vcmUgY2xvc2VseS4KPiBBZnRlciByZWFkaW5nIHRoZSBrZXJuZWwg Y29kZSBmb3IgYSBmZXcgZGF5cywgSSByZWFsaXplZCB0aGF0IGl0IHdvdWxkCj4gbm90IGJlIGVh c3kgdG8gaW1wbGVtZW50IHNvbWV0aGluZyBsaWtlIHRoaXMsCgpZZWFoLCBpdCBtaWdodCBuZWVk IGFyY2hpdGVjdHVyZS1zcGVjaWZpYyBjb2RlIHRvIGZsaXAgdGhlIHBhZ2UgdGFibGVzCm9uIHVz ZXJzcGFjZSBlbnRyeS9leGl0LCBhbmQgbWF5YmUgYWxzbyBmb3IgbWlycm9yaW5nIHRoZW0uIEFu ZCBmb3IKdGhlIFRMQiBmbHVzaGluZyBsb2dpYy4uLgoKPiBidXQgbW9yZSBpbXBvcnRhbnQgaXMg dGhhdAo+IEkgZG9u4oCZdCB1bmRlcnN0YW5kIHdoYXQgcHJvYmxlbSBpdCBzb2x2ZXMuIFdpbGwg aXQgc2ltcGxpZnkgdGhlCj4gdXNlci1zcGFjZSBjb2RlPyBJIGRvbuKAmXQgdGhpbmsgc28uIFdp bGwgaXQgaW1wcm92ZSBwZXJmb3JtYW5jZT8gSXQgaXMKPiB1bmNsZWFyIGZvciBtZSB0b28uCgpT b21lIHJlYXNvbnMgSSBjYW4gdGhpbmsgb2YgYXJlOgoKIC0gZGlyZWN0IGd1ZXN0IG1lbW9yeSBh Y2Nlc3M6IEkgaW1hZ2luZWQgeW91J2QgcHJvYmFibHkgd2FudCB0byBiZSBhYmxlIHRvCiAgIGRp cmVjdGx5IGFjY2VzcyB1c2Vyc3BhY2UgbWVtb3J5IGZyb20gdGhlIHN1cGVydmlzb3IsIGFuZAog ICB3aXRoIHRoaXMgYXBwcm9hY2ggdGhhdCdkIGJlY29tZSBlYXN5LgoKIC0gaW50ZWdyYXRpb24g d2l0aCBvbi1kZW1hbmQgcGFnaW5nIG9mIHRoZSBob3N0IE9TOiBZb3UnZCBiZSBhYmxlIHRvCiAg IGNyZWF0ZSB0aGluZ3MgbGlrZSBmaWxlLWJhY2tlZCBjb3B5LW9uLXdyaXRlIG1hcHBpbmdzIGZy b20gdGhlCiAgIGhvc3QgZmlsZXN5c3RlbSwgb3IgaW1wbGVtZW50IHlvdXIgb3duIG1hcHBpbmdz IGJhY2tlZCBieSBzb21lIGtpbmQKICAgb2Ygc3RvcmFnZSB1c2luZyB1c2VyZmF1bHRmZC4KCiAt IHNhbmRib3hpbmc6IEZvciBzYW5kYm94aW5nIHVzZWNhc2VzIChub3QgeW91ciB1c2VjYXNlKSwg aXQgd291bGQgYmUKICAgcG9zc2libGUgdG8gZS5nLiBjcmVhdGUgYSByZWFkLW9ubHkgY2xvbmUg b2YgdGhlIGVudGlyZSBhZGRyZXNzIHNwYWNlIG9mIGEKICAgcHJvY2VzcyBhbmQgZ2l2ZSB3cml0 ZSBhY2Nlc3MgdG8gc3BlY2lmaWMgcGFydHMgb2YgaXQsIG9yIHNvbWV0aGluZwogICBsaWtlIHRo YXQuCiAgIFRoZXNlIGFkZHJlc3Mgc3BhY2UgY2xvbmVzIGNvdWxkIHBvdGVudGlhbGx5IGJlIGNy ZWF0ZWQgYW5kIGRlc3Ryb3llZAogICBmYWlybHkgcXVpY2tseS4KCiAtIGFjY291bnRpbmc6IG1l bW9yeSB1c2FnZSB3b3VsZCBiZSBhdXRvbWF0aWNhbGx5IGFjY291bnRlZCB0byB0aGUKICAgc3Vw ZXJ2aXNvciBwcm9jZXNzLCBzbyBldmVuIHdpdGhvdXQgYSBwYXJhc2l0ZSBwcm9jZXNzLCB5b3Un ZCBiZSBhYmxlCiAgIHRvIHNlZSB0aGUgbWVtb3J5IHVzYWdlIGNvcnJlY3RseSBpbiB0aGluZ3Mg bGlrZSAidG9wIi4KCiAtIHNtYWxsIChub24tcGFnZWFibGUpIG1lbW9yeSBmb290cHJpbnQgaW4g dGhlIGhvc3Qga2VybmVsOgogICBUaGUgb25seSB0aGluZ3MgdGhlIGhvc3Qga2VybmVsIHdvdWxk IGhhdmUgdG8gcGVyc2lzdGVudGx5IHN0b3JlIHdvdWxkIGJlCiAgIHRoZSBub3JtYWwgTU0gZGF0 YSBzdHJ1Y3R1cmVzIGZvciB0aGUgc3VwZXJ2aXNvciBwbHVzIHRoZSBtYXBwaW5ncwogICBmcm9t ICJndWVzdCB1c2Vyc3BhY2UiIG1lbW9yeSByYW5nZXMgdG8gc3VwZXJ2aXNvciBtZW1vcnkgcmFu Z2VzOwogICB1c2Vyc3BhY2UgcGFnZXRhYmxlcyB3b3VsZCBiZSBkaXNjYXJkYWJsZSwgYW5kIGNv dWxkIGV2ZW4gYmUgc2hhcmVkCiAgIHdpdGggdGhvc2Ugb2YgdGhlIHN1cGVydmlzb3IgaW4gY2Fz ZXMgd2hlcmUgdGhlIGFsaWdubWVudCBmaXRzLgogICBTbyB3aXRoIHRoaXMsIGxhcmdlIGFub255 bW91cyBtYXBwaW5ncyB3aXRoIDRLIGdyYW51bGFyaXR5IG9ubHkgY29zdCB5b3UKICAgfjAuMjAl IG92ZXJoZWFkIGFjcm9zcyBob3N0IGFuZCBndWVzdCBhZGRyZXNzIHNwYWNlOyB3aXRob3V0IHRo aXMsIGlmIHlvdQogICB1c2VkIHNoYXJlZCBtYXBwaW5ncyBpbnN0ZWFkLCB5b3UnZCBwYXkgdHdp Y2UgdGhhdCBmb3IgZXZlcnkgMk1pQiByYW5nZQogICBmcm9tIHdoaWNoIHBhcnRzIGFyZSBhY2Nl c3NlZCBpbiBib3RoIGNvbnRleHRzLCBwbHVzIHByb2JhYmx5IGFub3RoZXIKICAgfjAuMiUgb3Ig c28gZm9yIHRoZSAic3RydWN0IGFkZHJlc3Nfc3BhY2UiPwoKIC0gYWxsIG1lbW9yeS1tYW5hZ2Vt ZW50LXJlbGF0ZWQgc3lzY2FsbHMgY291bGQgYmUgZGlyZWN0bHkgcGVyZm9ybWVkCiAgIGluIHRo ZSAia2VybmVsIiBwcm9jZXNzCgpCdXQgeWVhaCwgc29tZSBvZiB0aG9zZSBhcmVuJ3QgcmVhbGx5 IHJlbGV2YW50IGZvciB5b3VyIHVzZWNhc2UsIGFuZCBJCmd1ZXNzIHRoaW5ncyBsaWtlIHRoZSBh Y2NvdW50aW5nIGFzcGVjdCBjb3VsZCBqdXN0IGFzIHdlbGwgYmUgc29sdmVkCmRpZmZlcmVudGx5 Li4uCgo+IEZpcnN0LCBpbiB0aGUgS1ZNIGNhc2UsIHdlIGhhdmUgYSBmZXcgYmlnIGxpbmVhciBt YXBwaW5ncyBhbmQgbmVlZCB0bwo+IHN1cHBvcnQgb25lIOKAnHNoYWRvd+KAnSBhZGRyZXNzIHNw YWNlLiBJbiB0aGUgY2FzZSBvZiBzYW5kYm94ZXMsIHdlIGNhbgo+IGhhdmUgYSB0cmVtZW5kb3Vz IGFtb3VudCBvZiBtYXBwaW5ncyBhbmQgbWFueSBhZGRyZXNzIHNwYWNlcyB0aGF0IHdlCj4gbmVl ZCB0byBtYW5hZ2UuICBNZW1vcnkgbWFwcGluZ3Mgd2lsbCBiZSBtYXBwZWQgd2l0aCBkaWZmZXJl bnQgYWRkcmVzc2VzCj4gaW4gYSBzdXBlcnZpc29yIGFkZHJlc3Mgc3BhY2UgYW5kIOKAnGd1ZXN0 4oCdIGFkZHJlc3Mgc3BhY2VzLiBJZiBndWVzdAo+IGFkZHJlc3Mgc3BhY2VzIHdpbGwgbm90IGhh dmUgdGhlaXIgbW1fc3RydWN0cywgd2Ugd2lsbCBuZWVkIHRvIHJlaW52ZW50Cj4gdm1hLXMgaW4g c29tZSBmb3JtLiBJZiBndWVzdCBhZGRyZXNzIHNwYWNlcyBoYXZlIG1tX3N0cnVjdHMsIHRoaXMg d2lsbAo+IGxvb2sgc2ltaWxhciB0byBodHRwczovL2x3bi5uZXQvQXJ0aWNsZXMvODMwNjQ4Ly4K Pgo+IFNlY29uZCwgZWFjaCBwYWdldGFibGUgaXMgdGllZCB1cCB3aXRoIG1tX3N0dWN0LiBZb3Ug c3VnZ2VzdCBjcmVhdGluZwo+IG5ldyBwYWdldGFibGVzIHRoYXQgd2lsbCBub3QgaGF2ZSB0aGVp ciBtbV9zdHJ1Y3QtcyAoc29ycnkgaWYgSQo+IG1pc3VuZGVyc3Rvb2Qgc29tZXRoaW5nKS4KClll YWgsIHRoYXQncyB3aGF0IEkgaGFkIGluIG1pbmQsIHBhZ2UgdGFibGVzIHdpdGhvdXQgYW4gbW1f c3RydWN0LgoKPiBJIGFtIG5vdCBzdXJlIHRoYXQgaXQgd2lsbCBiZSBlYXN5IHRvCj4gaW1wbGVt ZW50LiBIb3cgbWFueSBjb3JuZXIgY2FzZXMgd2lsbCBiZSB0aGVyZT8KClllYWgsIGl0IHdvdWxk IHJlcXVpcmUgc29tZSB3b3JrIGFyb3VuZCBUTEIgZmx1c2hpbmcgYW5kIGVudHJ5L2V4aXQKZnJv bSB1c2Vyc3BhY2UuIEJ1dCBmcm9tIGEgaGlnaC1sZXZlbCBwZXJzcGVjdGl2ZSBpdCBmZWVscyB0 byBtZSBsaWtlCmEgY2hhbmdlIHdpdGggbGVzcyBzeXN0ZW1hdGljIGltcGFjdC4gTWF5YmUgSSdt IHdyb25nIGFib3V0IHRoYXQuCgo+IEFzIGZvciBwYWdlIGZhdWx0cyBpbiBhIHNlY29uZGFyeSBh ZGRyZXNzIHNwYWNlLCB3ZSB3aWxsIG5lZWQgdG8gZmluZCBhCj4gZmF1bHQgYWRkcmVzcyBpbiB0 aGUgbWFpbiBhZGRyZXNzIHNwYWNlLCBoYW5kbGUgdGhlIGZhdWx0IHRoZXJlIGFuZCB0aGVuCj4g bWlycm9yIHRoZSBQVEUgdG8gdGhlIHNlY29uZGFyeSBwYWdldGFibGUuCgpSaWdodC4KCj4gRWZm ZWN0aXZlbHksIGl0IG1lYW5zIHRoYXQKPiBwYWdlIGZhdWx0cyB3aWxsIGJlIGhhbmRsZWQgaW4g dHdvIGFkZHJlc3Mgc3BhY2VzLiBSaWdodCBub3csIHdlIHVzZQo+IG1lbWZkIGFuZCBzaGFyZWQg bWFwcGluZ3MuIEl0IG1lYW5zIHRoYXQgZWFjaCBmYXVsdCBpcyBoYW5kbGVkIG9ubHkgaW4KPiBv bmUgYWRkcmVzcyBzcGFjZSwgYW5kIHdlIG1hcCBhIGd1ZXN0IG1lbW9yeSByZWdpb24gdG8gdGhl IHN1cGVydmlzb3IKPiBhZGRyZXNzIHNwYWNlIG9ubHkgd2hlbiB3ZSBuZWVkIHRvIGFjY2VzcyBp dC4gQSBsYXJnZSBwb3J0aW9uIG9mIGd1ZXN0Cj4gYW5vbnltb3VzIG1lbW9yeSBpcyBuZXZlciBt YXBwZWQgdG8gdGhlIHN1cGVydmlzb3IgYWRkcmVzcyBzcGFjZS4KPiBXaWxsIGFuIG92ZXJoZWFk IG9mIG1pcnJvcmVkIGFkZHJlc3Mgc3BhY2VzIGJlIHNtYWxsZXIgdGhhbiBtZW1mZCBzaGFyZWQK PiBtYXBwaW5ncz8gSSBhbSBub3Qgc3VyZS4KCkJ1dCBhcyBsb25nIGFzIHRoZSBtYXBwaW5ncyBh cmUgc3VmZmljaWVudGx5IGJpZyBhbmQgYWxpZ25lZCBwcm9wZXJseSwKb3IgeW91IGV4cGxpY2l0 bHkgbWFuYWdlIHRoZSBzdXBlcnZpc29yIGFkZHJlc3Mgc3BhY2UsIHNvbWUgb2YgdGhhdApjb3N0 IGRpc2FwcGVhcnM6IEUuZy4gZXZlbiBpZiBhIHBhZ2UgaXMgbWFwcGVkIGluIGJvdGggYWRkcmVz cyBzcGFjZXMsCnlvdSB3b3VsZG4ndCBoYXZlIGEgbWVtb3J5IGNvc3QgZm9yIHRoZSBzZWNvbmQg bWFwcGluZyBpZiB0aGUgcGFnZQp0YWJsZXMgYXJlIHNoYXJlZC4KCj4gVGhpcmQsIHRoaXMgYXBw cm9hY2ggd2lsbCBub3QgZ2V0IHJpZCBvZiBoYXZpbmcgcHJvY2Vzc192bV9leGVjLiBXZSB3aWxs Cj4gbmVlZCB0byBzd2l0Y2ggdG8gYSBndWVzdCBhZGRyZXNzIHNwYWNlIHdpdGggYSBzcGVjaWZp ZWQgc3RhdGUgYW5kCj4gc3dpdGNoIGJhY2sgb24gZmF1bHRzIG9yIHN5c2NhbGxzLgoKWWVhaCwg eW91J2Qgc3RpbGwgbmVlZCBhIHN5c2NhbGwgZm9yIHJ1bm5pbmcgY29kZSB1bmRlciBhIGRpZmZl cmVudApzZXQgb2YgcGFnZSB0YWJsZXMuIEJ1dCB0aGF0J3Mgc29tZXRoaW5nIHRoYXQgS1ZNIF9h bG1vc3RfIGFscmVhZHkKZG9lcy4KCj4gSWYgdGhlIG1haW4gY29uY2VybiBpcyB0aGUgYWJpbGl0 eSB0bwo+IHJ1biBzeXNjYWxscyBvbiBhIHJlbW90ZSBtbSwgd2UgY2FuIHRoaW5rIGFib3V0IGhv dyB0byBmaXggdGhpcy4gSSBzZWUKPiB0d28gd2F5cyB3aGF0IHdlIGNhbiBkbyBoZXJlOgo+Cj4g KiBTcGVjaWZ5IHRoZSBleGFjdCBsaXN0IG9mIHN5c3RlbSBjYWxscyB0aGF0IGFyZSBhbGxvd2Vk LiBUaGUgZmlyc3QKPiB0aHJlZSBjYW5kaWRhdGVzIGFyZSBtbWFwLCBtdW5tYXAsIGFuZCB2bXNw bGljZS4KPgo+ICogSW5zdGVhZCBvZiBhbGxvd2luZyB1cyB0byBydW4gc3lzdGVtIGNhbGxzLCB3 ZSBjYW4gaW1wbGVtZW50IHRoaXMgaW4KPiB0aGUgZm9ybSBvZiBjb21tYW5kcy4gSW4gdGhlIGNh c2Ugb2Ygc2FuZGJveGVzLCB3ZSBuZWVkIHRvIGltcGxlbWVudAo+IG9ubHkgdHdvIGNvbW1hbmRz IHRvIGNyZWF0ZSBhbmQgZGVzdHJveSBtZW1vcnkgbWFwcGluZ3MgaW4gYSB0YXJnZXQKPiBhZGRy ZXNzIHNwYWNlLgoKRldJVywgdGhlcmUgaXMgcHJlY2VkZW50IGZvciBzb21ldGhpbmcgc2ltaWxh cjogVGhlIEFuZHJvaWQgZm9sa3MKYWxyZWFkeSBhZGRlZCBwcm9jZXNzX21hZHZpc2UoKSBmb3Ig cmVtb3RlbHkgbWVzc2luZyB3aXRoIHRoZSBWTUFzIG9mCmFub3RoZXIgcHJvY2VzcyB0byBzb21l IGRlZ3JlZS4KCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f CmxpbnV4LXVtIG1haWxpbmcgbGlzdApsaW51eC11bUBsaXN0cy5pbmZyYWRlYWQub3JnCmh0dHA6 Ly9saXN0cy5pbmZyYWRlYWQub3JnL21haWxtYW4vbGlzdGluZm8vbGludXgtdW0K