From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A5678C11F6B
	for <linux-kernel@archiver.kernel.org>; Fri,  2 Jul 2021 15:12:33 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 8856D61416
	for <linux-kernel@archiver.kernel.org>; Fri,  2 Jul 2021 15:12:33 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232322AbhGBPPE (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Fri, 2 Jul 2021 11:15:04 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51448 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232306AbhGBPPD (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 2 Jul 2021 11:15:03 -0400
Received: from mail-lf1-x12e.google.com (mail-lf1-x12e.google.com [IPv6:2a00:1450:4864:20::12e])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54080C061764
        for <linux-kernel@vger.kernel.org>; Fri,  2 Jul 2021 08:12:31 -0700 (PDT)
Received: by mail-lf1-x12e.google.com with SMTP id bu19so18661115lfb.9
        for <linux-kernel@vger.kernel.org>; Fri, 02 Jul 2021 08:12:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc:content-transfer-encoding;
        bh=khrNN02EDjPn05bGelbS4FlmRXR9mKhe5093E+aR9GU=;
        b=a9CXomLJ4GNd0tdcmkWcR+FmNcSc2kTxEU0xtbUD6O3OGNR/M2wgP49Lq3F7YFZs38
         oqgcyEv+SagTByCN0q2uRdW5ipKECVAgViAA1aTBKmqVbTIxdE2wHLRIWLT+Q86HOTQJ
         l7+DZ2ZxNM8tloIrlk3qbkDV3HsizwWyA1s5Y7Jke8Cg98+8kvVkRtTT7kcPA0jQxjV2
         Fo0qsKh0aRE59elDr04R6kZtlvijP4PCxUS/b+q3gO5O18r21Iu26OnzX6kW2xvMx1/3
         WJqmVMes6CXVl01lfFlr/YA/HV5rd2LAvfrQh5Q0WUzS3Ty/7yHCVcUiw/Rkgd6yXlHc
         BhjQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc:content-transfer-encoding;
        bh=khrNN02EDjPn05bGelbS4FlmRXR9mKhe5093E+aR9GU=;
        b=hIPEbf7FRC4XV2BxkTDhMt4eEB3dkzMnGNBlYgv94HGd+fZrIIBTqvwCKRqix0MQga
         dXREPXCWo5jrwgjKyzD9AR4775t8DebCcKl6KVymmkcF0A4ChihPPIpdJ4nLHfNCkeXU
         1J+wnnm9Lbhz5eEa9w5sfofYwOAk0KLd9y6f5pZmGZMXfEkYweSKdsU+p6ksr8hIAbT3
         4m6gGSX0BgVLBT8wtByxCMN0NLbrRGYcPJxEYv407mb+ivEADhkm07KuerYRbnjRswUW
         W63t0QljCjBjZAd9DnyCj22Vmg5Y4nx7pMqMRxkYCEP0Lw6Vg9QrvvUvHzoE5tj7r70c
         NO9A==
X-Gm-Message-State: AOAM533V+JYpkesz9v4VJBTm7s+wZgouolNW+wVMMitVDs61kdRKy087
        8c9Xnrga7oN92V3qQpB6m8MAH8f/4VX7M9B0c4EDRQ==
X-Google-Smtp-Source: ABdhPJxPyb01nLArk8HsphTcEK9TwKyoakG/3eUzbZINCpJhALlOZQQYpvV7hhCrYlkELzFhkQuBsnRDwBFhONDS5Fc=
X-Received: by 2002:ac2:519b:: with SMTP id u27mr94541lfi.352.1625238749175;
 Fri, 02 Jul 2021 08:12:29 -0700 (PDT)
MIME-Version: 1.0
References: <20210414055217.543246-1-avagin@gmail.com> <CAG48ez0jfsS=gKN0Vo_VS2EvvMBvEr+QNz0vDKPeSAzsrsRwPQ@mail.gmail.com>
 <YN648cPBDIGKYlYa@gmail.com>
In-Reply-To: <YN648cPBDIGKYlYa@gmail.com>
From:   Jann Horn <jannh@google.com>
Date:   Fri, 2 Jul 2021 17:12:02 +0200
Message-ID: <CAG48ez2vLKGTBOmc-5AJQE=j4Uy=HprSJVmJnOR-4Exb5rbMdA@mail.gmail.com>
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another
 address space
To:     Andrei Vagin <avagin@gmail.com>
Cc:     kernel list <linux-kernel@vger.kernel.org>,
        Linux API <linux-api@vger.kernel.org>,
        linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com,
        Andrew Morton <akpm@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>,
        Anton Ivanov <anton.ivanov@cambridgegreys.com>,
        Christian Brauner <christian.brauner@ubuntu.com>,
        Dmitry Safonov <0x7f454c46@gmail.com>,
        Ingo Molnar <mingo@redhat.com>, Jeff Dike <jdike@addtoit.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Michael Kerrisk <mtk.manpages@gmail.com>,
        Oleg Nesterov <oleg@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Richard Weinberger <richard@nod.at>,
        Thomas Gleixner <tglx@linutronix.de>, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Jul 2, 2021 at 9:01 AM Andrei Vagin <avagin@gmail.com> wrote:
> On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> > > We already have process_vm_readv and process_vm_writev to read and wr=
ite
> > > to a process memory faster than we can do this with ptrace. And now i=
t
> > > is time for process_vm_exec that allows executing code in an address
> > > space of another process. We can do this with ptrace but it is much
> > > slower.
> > >
> > > =3D Use-cases =3D
> >
> > It seems to me like your proposed API doesn't really fit either one of
> > those usecases well...
> >
> > > Here are two known use-cases. The first one is =E2=80=9Capplication k=
ernel=E2=80=9D
> > > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > > process that runs the sandbox kernel and a set of stub processes that
> > > are used to manage guest address spaces. Guest code is executed in th=
e
> > > context of stub processes but all system calls are intercepted and
> > > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > > significantly speed them up.
> >
> > In this case, since you really only want an mm_struct to run code
> > under, it seems weird to create a whole task with its own PID and so
> > on. It seems to me like something similar to the /dev/kvm API would be
> > more appropriate here? Implementation options that I see for that
> > would be:
> >
> > 1. mm_struct-based:
> >       a set of syscalls to create a new mm_struct,
> >       change memory mappings under that mm_struct, and switch to it
>
> I like the idea to have a handle for mm. Instead of pid, we will pass
> this handle to process_vm_exec. We have pidfd for processes and we can
> introduce mmfd for mm_struct.

I personally think that it might be quite unwieldy when it comes to
the restrictions you get from trying to have shared memory with the
owning process - I'm having trouble figuring out how you can implement
copy-on-write semantics without relying on copy-on-write logic in the
host OS and without being able to use userfaultfd.

But if that's not a problem somehow, and you can find some reasonable
way to handle memory usage accounting and fix up everything that
assumes that multithreaded userspace threads don't switch ->mm, I
guess this might work for your usecase.

> > 2. pagetable-mirroring-based:
> >       like /dev/kvm, an API to create a new pagetable, mirror parts of
> >       the mm_struct's pagetables over into it with modified permissions
> >       (like KVM_SET_USER_MEMORY_REGION),
> >       and run code under that context.
> >       page fault handling would first handle the fault against mm->pgd
> >       as normal, then mirror the PTE over into the secondary pagetables=
.
> >       invalidation could be handled with MMU notifiers.
> >
>
> I found this idea interesting and decided to look at it more closely.
> After reading the kernel code for a few days, I realized that it would
> not be easy to implement something like this,

Yeah, it might need architecture-specific code to flip the page tables
on userspace entry/exit, and maybe also for mirroring them. And for
the TLB flushing logic...

> but more important is that
> I don=E2=80=99t understand what problem it solves. Will it simplify the
> user-space code? I don=E2=80=99t think so. Will it improve performance? I=
t is
> unclear for me too.

Some reasons I can think of are:

 - direct guest memory access: I imagined you'd probably want to be able to
   directly access userspace memory from the supervisor, and
   with this approach that'd become easy.

 - integration with on-demand paging of the host OS: You'd be able to
   create things like file-backed copy-on-write mappings from the
   host filesystem, or implement your own mappings backed by some kind
   of storage using userfaultfd.

 - sandboxing: For sandboxing usecases (not your usecase), it would be
   possible to e.g. create a read-only clone of the entire address space of=
 a
   process and give write access to specific parts of it, or something
   like that.
   These address space clones could potentially be created and destroyed
   fairly quickly.

 - accounting: memory usage would be automatically accounted to the
   supervisor process, so even without a parasite process, you'd be able
   to see the memory usage correctly in things like "top".

 - small (non-pageable) memory footprint in the host kernel:
   The only things the host kernel would have to persistently store would b=
e
   the normal MM data structures for the supervisor plus the mappings
   from "guest userspace" memory ranges to supervisor memory ranges;
   userspace pagetables would be discardable, and could even be shared
   with those of the supervisor in cases where the alignment fits.
   So with this, large anonymous mappings with 4K granularity only cost you
   ~0.20% overhead across host and guest address space; without this, if yo=
u
   used shared mappings instead, you'd pay twice that for every 2MiB range
   from which parts are accessed in both contexts, plus probably another
   ~0.2% or so for the "struct address_space"?

 - all memory-management-related syscalls could be directly performed
   in the "kernel" process

But yeah, some of those aren't really relevant for your usecase, and I
guess things like the accounting aspect could just as well be solved
differently...

> First, in the KVM case, we have a few big linear mappings and need to
> support one =E2=80=9Cshadow=E2=80=9D address space. In the case of sandbo=
xes, we can
> have a tremendous amount of mappings and many address spaces that we
> need to manage.  Memory mappings will be mapped with different addresses
> in a supervisor address space and =E2=80=9Cguest=E2=80=9D address spaces.=
 If guest
> address spaces will not have their mm_structs, we will need to reinvent
> vma-s in some form. If guest address spaces have mm_structs, this will
> look similar to https://lwn.net/Articles/830648/.
>
> Second, each pagetable is tied up with mm_stuct. You suggest creating
> new pagetables that will not have their mm_struct-s (sorry if I
> misunderstood something).

Yeah, that's what I had in mind, page tables without an mm_struct.

> I am not sure that it will be easy to
> implement. How many corner cases will be there?

Yeah, it would require some work around TLB flushing and entry/exit
from userspace. But from a high-level perspective it feels to me like
a change with less systematic impact. Maybe I'm wrong about that.

> As for page faults in a secondary address space, we will need to find a
> fault address in the main address space, handle the fault there and then
> mirror the PTE to the secondary pagetable.

Right.

> Effectively, it means that
> page faults will be handled in two address spaces. Right now, we use
> memfd and shared mappings. It means that each fault is handled only in
> one address space, and we map a guest memory region to the supervisor
> address space only when we need to access it. A large portion of guest
> anonymous memory is never mapped to the supervisor address space.
> Will an overhead of mirrored address spaces be smaller than memfd shared
> mappings? I am not sure.

But as long as the mappings are sufficiently big and aligned properly,
or you explicitly manage the supervisor address space, some of that
cost disappears: E.g. even if a page is mapped in both address spaces,
you wouldn't have a memory cost for the second mapping if the page
tables are shared.

> Third, this approach will not get rid of having process_vm_exec. We will
> need to switch to a guest address space with a specified state and
> switch back on faults or syscalls.

Yeah, you'd still need a syscall for running code under a different
set of page tables. But that's something that KVM _almost_ already
does.

> If the main concern is the ability to
> run syscalls on a remote mm, we can think about how to fix this. I see
> two ways what we can do here:
>
> * Specify the exact list of system calls that are allowed. The first
> three candidates are mmap, munmap, and vmsplice.
>
> * Instead of allowing us to run system calls, we can implement this in
> the form of commands. In the case of sandboxes, we need to implement
> only two commands to create and destroy memory mappings in a target
> address space.

FWIW, there is precedent for something similar: The Android folks
already added process_madvise() for remotely messing with the VMAs of
another process to some degree.

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ixrD=L2=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A2DCAC11F6A
	for <linux-mm@archiver.kernel.org>; Fri,  2 Jul 2021 15:12:33 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 244316141D
	for <linux-mm@archiver.kernel.org>; Fri,  2 Jul 2021 15:12:33 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 244316141D
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 2FAA16B0070; Fri,  2 Jul 2021 11:12:32 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2AAE16B0071; Fri,  2 Jul 2021 11:12:32 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 125408D0001; Fri,  2 Jul 2021 11:12:32 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0043.hostedemail.com [216.40.44.43])
	by kanga.kvack.org (Postfix) with ESMTP id DFCC06B0070
	for <linux-mm@kvack.org>; Fri,  2 Jul 2021 11:12:31 -0400 (EDT)
Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 8114180C2375
	for <linux-mm@kvack.org>; Fri,  2 Jul 2021 15:12:31 +0000 (UTC)
X-FDA: 78317989302.29.5BF7477
Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com [209.85.167.52])
	by imf18.hostedemail.com (Postfix) with ESMTP id 370BC400208D
	for <linux-mm@kvack.org>; Fri,  2 Jul 2021 15:12:31 +0000 (UTC)
Received: by mail-lf1-f52.google.com with SMTP id q16so18692695lfr.4
        for <linux-mm@kvack.org>; Fri, 02 Jul 2021 08:12:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc:content-transfer-encoding;
        bh=khrNN02EDjPn05bGelbS4FlmRXR9mKhe5093E+aR9GU=;
        b=a9CXomLJ4GNd0tdcmkWcR+FmNcSc2kTxEU0xtbUD6O3OGNR/M2wgP49Lq3F7YFZs38
         oqgcyEv+SagTByCN0q2uRdW5ipKECVAgViAA1aTBKmqVbTIxdE2wHLRIWLT+Q86HOTQJ
         l7+DZ2ZxNM8tloIrlk3qbkDV3HsizwWyA1s5Y7Jke8Cg98+8kvVkRtTT7kcPA0jQxjV2
         Fo0qsKh0aRE59elDr04R6kZtlvijP4PCxUS/b+q3gO5O18r21Iu26OnzX6kW2xvMx1/3
         WJqmVMes6CXVl01lfFlr/YA/HV5rd2LAvfrQh5Q0WUzS3Ty/7yHCVcUiw/Rkgd6yXlHc
         BhjQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc:content-transfer-encoding;
        bh=khrNN02EDjPn05bGelbS4FlmRXR9mKhe5093E+aR9GU=;
        b=BPngfkgLbU9K3tc+1cqjsiFoNqL3MKCREp5Yl10vmhuP4zbtBfI+A5K5NhCkmZeLAp
         AeUTtBvstmFRJwktoZrPQyGs5mgoNLPm+XtWt/1uLE1ASdLsynSLBSOeFZgu9p2AfRS9
         2lr1dTNO3UimNzmSOWh+K5T/i0rwicAtQWQJ799LkAnwMHhCGdnqy8S3P9OBdsLN9ei2
         cDgek0oOUjFCUkU774vJ5WoTQAcxh3dYAQAtUJ7DGSp94Cs5GPS7O2P1U39yAVpmsd3Y
         iHSYqioPVC9PF/SwMTekAV8CJcQBvAUsWOFvWoqw01WYxxfkiwRajh/PRWDrjON+VwsB
         P2Vg==
X-Gm-Message-State: AOAM5320iN7cfMtOKewb/JGEEAhl06oFV/uK8gliynCkFqrPfrsFNyPH
	+Xhf9zM1aTqIYEloAIn3NCQC2ROxRSgMi7g4p6YMyw==
X-Google-Smtp-Source: ABdhPJxPyb01nLArk8HsphTcEK9TwKyoakG/3eUzbZINCpJhALlOZQQYpvV7hhCrYlkELzFhkQuBsnRDwBFhONDS5Fc=
X-Received: by 2002:ac2:519b:: with SMTP id u27mr94541lfi.352.1625238749175;
 Fri, 02 Jul 2021 08:12:29 -0700 (PDT)
MIME-Version: 1.0
References: <20210414055217.543246-1-avagin@gmail.com> <CAG48ez0jfsS=gKN0Vo_VS2EvvMBvEr+QNz0vDKPeSAzsrsRwPQ@mail.gmail.com>
 <YN648cPBDIGKYlYa@gmail.com>
In-Reply-To: <YN648cPBDIGKYlYa@gmail.com>
From: Jann Horn <jannh@google.com>
Date: Fri, 2 Jul 2021 17:12:02 +0200
Message-ID: <CAG48ez2vLKGTBOmc-5AJQE=j4Uy=HprSJVmJnOR-4Exb5rbMdA@mail.gmail.com>
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another
 address space
To: Andrei Vagin <avagin@gmail.com>
Cc: kernel list <linux-kernel@vger.kernel.org>, Linux API <linux-api@vger.kernel.org>, 
	linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com, 
	Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, 
	Anton Ivanov <anton.ivanov@cambridgegreys.com>, 
	Christian Brauner <christian.brauner@ubuntu.com>, Dmitry Safonov <0x7f454c46@gmail.com>, 
	Ingo Molnar <mingo@redhat.com>, Jeff Dike <jdike@addtoit.com>, Mike Rapoport <rppt@linux.ibm.com>, 
	Michael Kerrisk <mtk.manpages@gmail.com>, Oleg Nesterov <oleg@redhat.com>, 
	Peter Zijlstra <peterz@infradead.org>, Richard Weinberger <richard@nod.at>, 
	Thomas Gleixner <tglx@linutronix.de>, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=google.com header.s=20161025 header.b=a9CXomLJ;
	spf=pass (imf18.hostedemail.com: domain of jannh@google.com designates 209.85.167.52 as permitted sender) smtp.mailfrom=jannh@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 370BC400208D
X-Stat-Signature: ygf4bgnrsxjhqziag153qqoe7rmxusqb
X-HE-Tag: 1625238751-625732
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Jul 2, 2021 at 9:01 AM Andrei Vagin <avagin@gmail.com> wrote:
> On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote:
> > > We already have process_vm_readv and process_vm_writev to read and wr=
ite
> > > to a process memory faster than we can do this with ptrace. And now i=
t
> > > is time for process_vm_exec that allows executing code in an address
> > > space of another process. We can do this with ptrace but it is much
> > > slower.
> > >
> > > =3D Use-cases =3D
> >
> > It seems to me like your proposed API doesn't really fit either one of
> > those usecases well...
> >
> > > Here are two known use-cases. The first one is =E2=80=9Capplication k=
ernel=E2=80=9D
> > > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > > process that runs the sandbox kernel and a set of stub processes that
> > > are used to manage guest address spaces. Guest code is executed in th=
e
> > > context of stub processes but all system calls are intercepted and
> > > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > > significantly speed them up.
> >
> > In this case, since you really only want an mm_struct to run code
> > under, it seems weird to create a whole task with its own PID and so
> > on. It seems to me like something similar to the /dev/kvm API would be
> > more appropriate here? Implementation options that I see for that
> > would be:
> >
> > 1. mm_struct-based:
> >       a set of syscalls to create a new mm_struct,
> >       change memory mappings under that mm_struct, and switch to it
>
> I like the idea to have a handle for mm. Instead of pid, we will pass
> this handle to process_vm_exec. We have pidfd for processes and we can
> introduce mmfd for mm_struct.

I personally think that it might be quite unwieldy when it comes to
the restrictions you get from trying to have shared memory with the
owning process - I'm having trouble figuring out how you can implement
copy-on-write semantics without relying on copy-on-write logic in the
host OS and without being able to use userfaultfd.

But if that's not a problem somehow, and you can find some reasonable
way to handle memory usage accounting and fix up everything that
assumes that multithreaded userspace threads don't switch ->mm, I
guess this might work for your usecase.

> > 2. pagetable-mirroring-based:
> >       like /dev/kvm, an API to create a new pagetable, mirror parts of
> >       the mm_struct's pagetables over into it with modified permissions
> >       (like KVM_SET_USER_MEMORY_REGION),
> >       and run code under that context.
> >       page fault handling would first handle the fault against mm->pgd
> >       as normal, then mirror the PTE over into the secondary pagetables=
.
> >       invalidation could be handled with MMU notifiers.
> >
>
> I found this idea interesting and decided to look at it more closely.
> After reading the kernel code for a few days, I realized that it would
> not be easy to implement something like this,

Yeah, it might need architecture-specific code to flip the page tables
on userspace entry/exit, and maybe also for mirroring them. And for
the TLB flushing logic...

> but more important is that
> I don=E2=80=99t understand what problem it solves. Will it simplify the
> user-space code? I don=E2=80=99t think so. Will it improve performance? I=
t is
> unclear for me too.

Some reasons I can think of are:

 - direct guest memory access: I imagined you'd probably want to be able to
   directly access userspace memory from the supervisor, and
   with this approach that'd become easy.

 - integration with on-demand paging of the host OS: You'd be able to
   create things like file-backed copy-on-write mappings from the
   host filesystem, or implement your own mappings backed by some kind
   of storage using userfaultfd.

 - sandboxing: For sandboxing usecases (not your usecase), it would be
   possible to e.g. create a read-only clone of the entire address space of=
 a
   process and give write access to specific parts of it, or something
   like that.
   These address space clones could potentially be created and destroyed
   fairly quickly.

 - accounting: memory usage would be automatically accounted to the
   supervisor process, so even without a parasite process, you'd be able
   to see the memory usage correctly in things like "top".

 - small (non-pageable) memory footprint in the host kernel:
   The only things the host kernel would have to persistently store would b=
e
   the normal MM data structures for the supervisor plus the mappings
   from "guest userspace" memory ranges to supervisor memory ranges;
   userspace pagetables would be discardable, and could even be shared
   with those of the supervisor in cases where the alignment fits.
   So with this, large anonymous mappings with 4K granularity only cost you
   ~0.20% overhead across host and guest address space; without this, if yo=
u
   used shared mappings instead, you'd pay twice that for every 2MiB range
   from which parts are accessed in both contexts, plus probably another
   ~0.2% or so for the "struct address_space"?

 - all memory-management-related syscalls could be directly performed
   in the "kernel" process

But yeah, some of those aren't really relevant for your usecase, and I
guess things like the accounting aspect could just as well be solved
differently...

> First, in the KVM case, we have a few big linear mappings and need to
> support one =E2=80=9Cshadow=E2=80=9D address space. In the case of sandbo=
xes, we can
> have a tremendous amount of mappings and many address spaces that we
> need to manage.  Memory mappings will be mapped with different addresses
> in a supervisor address space and =E2=80=9Cguest=E2=80=9D address spaces.=
 If guest
> address spaces will not have their mm_structs, we will need to reinvent
> vma-s in some form. If guest address spaces have mm_structs, this will
> look similar to https://lwn.net/Articles/830648/.
>
> Second, each pagetable is tied up with mm_stuct. You suggest creating
> new pagetables that will not have their mm_struct-s (sorry if I
> misunderstood something).

Yeah, that's what I had in mind, page tables without an mm_struct.

> I am not sure that it will be easy to
> implement. How many corner cases will be there?

Yeah, it would require some work around TLB flushing and entry/exit
from userspace. But from a high-level perspective it feels to me like
a change with less systematic impact. Maybe I'm wrong about that.

> As for page faults in a secondary address space, we will need to find a
> fault address in the main address space, handle the fault there and then
> mirror the PTE to the secondary pagetable.

Right.

> Effectively, it means that
> page faults will be handled in two address spaces. Right now, we use
> memfd and shared mappings. It means that each fault is handled only in
> one address space, and we map a guest memory region to the supervisor
> address space only when we need to access it. A large portion of guest
> anonymous memory is never mapped to the supervisor address space.
> Will an overhead of mirrored address spaces be smaller than memfd shared
> mappings? I am not sure.

But as long as the mappings are sufficiently big and aligned properly,
or you explicitly manage the supervisor address space, some of that
cost disappears: E.g. even if a page is mapped in both address spaces,
you wouldn't have a memory cost for the second mapping if the page
tables are shared.

> Third, this approach will not get rid of having process_vm_exec. We will
> need to switch to a guest address space with a specified state and
> switch back on faults or syscalls.

Yeah, you'd still need a syscall for running code under a different
set of page tables. But that's something that KVM _almost_ already
does.

> If the main concern is the ability to
> run syscalls on a remote mm, we can think about how to fix this. I see
> two ways what we can do here:
>
> * Specify the exact list of system calls that are allowed. The first
> three candidates are mmap, munmap, and vmsplice.
>
> * Instead of allowing us to run system calls, we can implement this in
> the form of commands. In the case of sandboxes, we need to implement
> only two commands to create and destroy memory mappings in a target
> address space.

FWIW, there is precedent for something similar: The Android folks
already added process_madvise() for remotely messing with the VMAs of
another process to some degree.


From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-um-bounces+geert=linux-m68k.org@lists.infradead.org>
Received: from mail-lf1-x12d.google.com ([2a00:1450:4864:20::12d])
 by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
 id 1lzKqF-003Nhn-D4
 for linux-um@lists.infradead.org; Fri, 02 Jul 2021 15:12:33 +0000
Received: by mail-lf1-x12d.google.com with SMTP id q18so18646055lfc.7
 for <linux-um@lists.infradead.org>; Fri, 02 Jul 2021 08:12:30 -0700 (PDT)
MIME-Version: 1.0
References: <20210414055217.543246-1-avagin@gmail.com>
 <CAG48ez0jfsS=gKN0Vo_VS2EvvMBvEr+QNz0vDKPeSAzsrsRwPQ@mail.gmail.com>
 <YN648cPBDIGKYlYa@gmail.com>
In-Reply-To: <YN648cPBDIGKYlYa@gmail.com>
From: Jann Horn <jannh@google.com>
Date: Fri, 2 Jul 2021 17:12:02 +0200
Message-ID: <CAG48ez2vLKGTBOmc-5AJQE=j4Uy=HprSJVmJnOR-4Exb5rbMdA@mail.gmail.com>
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another
 address space
List-Id: <linux-um.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-um>,
 <mailto:linux-um-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-um/>
List-Post: <mailto:linux-um@lists.infradead.org>
List-Help: <mailto:linux-um-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-um>,
 <mailto:linux-um-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Sender: "linux-um" <linux-um-bounces@lists.infradead.org>
Errors-To: linux-um-bounces+geert=linux-m68k.org@lists.infradead.org
To: Andrei Vagin <avagin@gmail.com>
Cc: kernel list <linux-kernel@vger.kernel.org>, Linux API <linux-api@vger.kernel.org>, linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Anton Ivanov <anton.ivanov@cambridgegreys.com>, Christian Brauner <christian.brauner@ubuntu.com>, Dmitry Safonov <0x7f454c46@gmail.com>, Ingo Molnar <mingo@redhat.com>, Jeff Dike <jdike@addtoit.com>, Mike Rapoport <rppt@linux.ibm.com>, Michael Kerrisk <mtk.manpages@gmail.com>, Oleg Nesterov <oleg@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Richard Weinberger <richard@nod.at>, Thomas Gleixner <tglx@linutronix.de>, linux-mm@kvack.org

T24gRnJpLCBKdWwgMiwgMjAyMSBhdCA5OjAxIEFNIEFuZHJlaSBWYWdpbiA8YXZhZ2luQGdtYWls
LmNvbT4gd3JvdGU6Cj4gT24gV2VkLCBBcHIgMTQsIDIwMjEgYXQgMDg6NDY6NDBBTSArMDIwMCwg
SmFubiBIb3JuIHdyb3RlOgo+ID4gT24gV2VkLCBBcHIgMTQsIDIwMjEgYXQgNzo1OSBBTSBBbmRy
ZWkgVmFnaW4gPGF2YWdpbkBnbWFpbC5jb20+IHdyb3RlOgo+ID4gPiBXZSBhbHJlYWR5IGhhdmUg
cHJvY2Vzc192bV9yZWFkdiBhbmQgcHJvY2Vzc192bV93cml0ZXYgdG8gcmVhZCBhbmQgd3JpdGUK
PiA+ID4gdG8gYSBwcm9jZXNzIG1lbW9yeSBmYXN0ZXIgdGhhbiB3ZSBjYW4gZG8gdGhpcyB3aXRo
IHB0cmFjZS4gQW5kIG5vdyBpdAo+ID4gPiBpcyB0aW1lIGZvciBwcm9jZXNzX3ZtX2V4ZWMgdGhh
dCBhbGxvd3MgZXhlY3V0aW5nIGNvZGUgaW4gYW4gYWRkcmVzcwo+ID4gPiBzcGFjZSBvZiBhbm90
aGVyIHByb2Nlc3MuIFdlIGNhbiBkbyB0aGlzIHdpdGggcHRyYWNlIGJ1dCBpdCBpcyBtdWNoCj4g
PiA+IHNsb3dlci4KPiA+ID4KPiA+ID4gPSBVc2UtY2FzZXMgPQo+ID4KPiA+IEl0IHNlZW1zIHRv
IG1lIGxpa2UgeW91ciBwcm9wb3NlZCBBUEkgZG9lc24ndCByZWFsbHkgZml0IGVpdGhlciBvbmUg
b2YKPiA+IHRob3NlIHVzZWNhc2VzIHdlbGwuLi4KPiA+Cj4gPiA+IEhlcmUgYXJlIHR3byBrbm93
biB1c2UtY2FzZXMuIFRoZSBmaXJzdCBvbmUgaXMg4oCcYXBwbGljYXRpb24ga2VybmVs4oCdCj4g
PiA+IHNhbmRib3hlcyBsaWtlIFVzZXItbW9kZSBMaW51eCBhbmQgZ1Zpc29yLiBJbiB0aGlzIGNh
c2UsIHdlIGhhdmUgYQo+ID4gPiBwcm9jZXNzIHRoYXQgcnVucyB0aGUgc2FuZGJveCBrZXJuZWwg
YW5kIGEgc2V0IG9mIHN0dWIgcHJvY2Vzc2VzIHRoYXQKPiA+ID4gYXJlIHVzZWQgdG8gbWFuYWdl
IGd1ZXN0IGFkZHJlc3Mgc3BhY2VzLiBHdWVzdCBjb2RlIGlzIGV4ZWN1dGVkIGluIHRoZQo+ID4g
PiBjb250ZXh0IG9mIHN0dWIgcHJvY2Vzc2VzIGJ1dCBhbGwgc3lzdGVtIGNhbGxzIGFyZSBpbnRl
cmNlcHRlZCBhbmQKPiA+ID4gaGFuZGxlZCBpbiB0aGUgc2FuZGJveCBrZXJuZWwuIFJpZ2h0IG5v
dywgdGhlc2Ugc29ydCBvZiBzYW5kYm94ZXMgdXNlCj4gPiA+IFBUUkFDRV9TWVNFTVUgdG8gdHJh
cCBzeXN0ZW0gY2FsbHMsIGJ1dCB0aGUgcHJvY2Vzc192bV9leGVjIGNhbgo+ID4gPiBzaWduaWZp
Y2FudGx5IHNwZWVkIHRoZW0gdXAuCj4gPgo+ID4gSW4gdGhpcyBjYXNlLCBzaW5jZSB5b3UgcmVh
bGx5IG9ubHkgd2FudCBhbiBtbV9zdHJ1Y3QgdG8gcnVuIGNvZGUKPiA+IHVuZGVyLCBpdCBzZWVt
cyB3ZWlyZCB0byBjcmVhdGUgYSB3aG9sZSB0YXNrIHdpdGggaXRzIG93biBQSUQgYW5kIHNvCj4g
PiBvbi4gSXQgc2VlbXMgdG8gbWUgbGlrZSBzb21ldGhpbmcgc2ltaWxhciB0byB0aGUgL2Rldi9r
dm0gQVBJIHdvdWxkIGJlCj4gPiBtb3JlIGFwcHJvcHJpYXRlIGhlcmU/IEltcGxlbWVudGF0aW9u
IG9wdGlvbnMgdGhhdCBJIHNlZSBmb3IgdGhhdAo+ID4gd291bGQgYmU6Cj4gPgo+ID4gMS4gbW1f
c3RydWN0LWJhc2VkOgo+ID4gICAgICAgYSBzZXQgb2Ygc3lzY2FsbHMgdG8gY3JlYXRlIGEgbmV3
IG1tX3N0cnVjdCwKPiA+ICAgICAgIGNoYW5nZSBtZW1vcnkgbWFwcGluZ3MgdW5kZXIgdGhhdCBt
bV9zdHJ1Y3QsIGFuZCBzd2l0Y2ggdG8gaXQKPgo+IEkgbGlrZSB0aGUgaWRlYSB0byBoYXZlIGEg
aGFuZGxlIGZvciBtbS4gSW5zdGVhZCBvZiBwaWQsIHdlIHdpbGwgcGFzcwo+IHRoaXMgaGFuZGxl
IHRvIHByb2Nlc3Nfdm1fZXhlYy4gV2UgaGF2ZSBwaWRmZCBmb3IgcHJvY2Vzc2VzIGFuZCB3ZSBj
YW4KPiBpbnRyb2R1Y2UgbW1mZCBmb3IgbW1fc3RydWN0LgoKSSBwZXJzb25hbGx5IHRoaW5rIHRo
YXQgaXQgbWlnaHQgYmUgcXVpdGUgdW53aWVsZHkgd2hlbiBpdCBjb21lcyB0bwp0aGUgcmVzdHJp
Y3Rpb25zIHlvdSBnZXQgZnJvbSB0cnlpbmcgdG8gaGF2ZSBzaGFyZWQgbWVtb3J5IHdpdGggdGhl
Cm93bmluZyBwcm9jZXNzIC0gSSdtIGhhdmluZyB0cm91YmxlIGZpZ3VyaW5nIG91dCBob3cgeW91
IGNhbiBpbXBsZW1lbnQKY29weS1vbi13cml0ZSBzZW1hbnRpY3Mgd2l0aG91dCByZWx5aW5nIG9u
IGNvcHktb24td3JpdGUgbG9naWMgaW4gdGhlCmhvc3QgT1MgYW5kIHdpdGhvdXQgYmVpbmcgYWJs
ZSB0byB1c2UgdXNlcmZhdWx0ZmQuCgpCdXQgaWYgdGhhdCdzIG5vdCBhIHByb2JsZW0gc29tZWhv
dywgYW5kIHlvdSBjYW4gZmluZCBzb21lIHJlYXNvbmFibGUKd2F5IHRvIGhhbmRsZSBtZW1vcnkg
dXNhZ2UgYWNjb3VudGluZyBhbmQgZml4IHVwIGV2ZXJ5dGhpbmcgdGhhdAphc3N1bWVzIHRoYXQg
bXVsdGl0aHJlYWRlZCB1c2Vyc3BhY2UgdGhyZWFkcyBkb24ndCBzd2l0Y2ggLT5tbSwgSQpndWVz
cyB0aGlzIG1pZ2h0IHdvcmsgZm9yIHlvdXIgdXNlY2FzZS4KCj4gPiAyLiBwYWdldGFibGUtbWly
cm9yaW5nLWJhc2VkOgo+ID4gICAgICAgbGlrZSAvZGV2L2t2bSwgYW4gQVBJIHRvIGNyZWF0ZSBh
IG5ldyBwYWdldGFibGUsIG1pcnJvciBwYXJ0cyBvZgo+ID4gICAgICAgdGhlIG1tX3N0cnVjdCdz
IHBhZ2V0YWJsZXMgb3ZlciBpbnRvIGl0IHdpdGggbW9kaWZpZWQgcGVybWlzc2lvbnMKPiA+ICAg
ICAgIChsaWtlIEtWTV9TRVRfVVNFUl9NRU1PUllfUkVHSU9OKSwKPiA+ICAgICAgIGFuZCBydW4g
Y29kZSB1bmRlciB0aGF0IGNvbnRleHQuCj4gPiAgICAgICBwYWdlIGZhdWx0IGhhbmRsaW5nIHdv
dWxkIGZpcnN0IGhhbmRsZSB0aGUgZmF1bHQgYWdhaW5zdCBtbS0+cGdkCj4gPiAgICAgICBhcyBu
b3JtYWwsIHRoZW4gbWlycm9yIHRoZSBQVEUgb3ZlciBpbnRvIHRoZSBzZWNvbmRhcnkgcGFnZXRh
Ymxlcy4KPiA+ICAgICAgIGludmFsaWRhdGlvbiBjb3VsZCBiZSBoYW5kbGVkIHdpdGggTU1VIG5v
dGlmaWVycy4KPiA+Cj4KPiBJIGZvdW5kIHRoaXMgaWRlYSBpbnRlcmVzdGluZyBhbmQgZGVjaWRl
ZCB0byBsb29rIGF0IGl0IG1vcmUgY2xvc2VseS4KPiBBZnRlciByZWFkaW5nIHRoZSBrZXJuZWwg
Y29kZSBmb3IgYSBmZXcgZGF5cywgSSByZWFsaXplZCB0aGF0IGl0IHdvdWxkCj4gbm90IGJlIGVh
c3kgdG8gaW1wbGVtZW50IHNvbWV0aGluZyBsaWtlIHRoaXMsCgpZZWFoLCBpdCBtaWdodCBuZWVk
IGFyY2hpdGVjdHVyZS1zcGVjaWZpYyBjb2RlIHRvIGZsaXAgdGhlIHBhZ2UgdGFibGVzCm9uIHVz
ZXJzcGFjZSBlbnRyeS9leGl0LCBhbmQgbWF5YmUgYWxzbyBmb3IgbWlycm9yaW5nIHRoZW0uIEFu
ZCBmb3IKdGhlIFRMQiBmbHVzaGluZyBsb2dpYy4uLgoKPiBidXQgbW9yZSBpbXBvcnRhbnQgaXMg
dGhhdAo+IEkgZG9u4oCZdCB1bmRlcnN0YW5kIHdoYXQgcHJvYmxlbSBpdCBzb2x2ZXMuIFdpbGwg
aXQgc2ltcGxpZnkgdGhlCj4gdXNlci1zcGFjZSBjb2RlPyBJIGRvbuKAmXQgdGhpbmsgc28uIFdp
bGwgaXQgaW1wcm92ZSBwZXJmb3JtYW5jZT8gSXQgaXMKPiB1bmNsZWFyIGZvciBtZSB0b28uCgpT
b21lIHJlYXNvbnMgSSBjYW4gdGhpbmsgb2YgYXJlOgoKIC0gZGlyZWN0IGd1ZXN0IG1lbW9yeSBh
Y2Nlc3M6IEkgaW1hZ2luZWQgeW91J2QgcHJvYmFibHkgd2FudCB0byBiZSBhYmxlIHRvCiAgIGRp
cmVjdGx5IGFjY2VzcyB1c2Vyc3BhY2UgbWVtb3J5IGZyb20gdGhlIHN1cGVydmlzb3IsIGFuZAog
ICB3aXRoIHRoaXMgYXBwcm9hY2ggdGhhdCdkIGJlY29tZSBlYXN5LgoKIC0gaW50ZWdyYXRpb24g
d2l0aCBvbi1kZW1hbmQgcGFnaW5nIG9mIHRoZSBob3N0IE9TOiBZb3UnZCBiZSBhYmxlIHRvCiAg
IGNyZWF0ZSB0aGluZ3MgbGlrZSBmaWxlLWJhY2tlZCBjb3B5LW9uLXdyaXRlIG1hcHBpbmdzIGZy
b20gdGhlCiAgIGhvc3QgZmlsZXN5c3RlbSwgb3IgaW1wbGVtZW50IHlvdXIgb3duIG1hcHBpbmdz
IGJhY2tlZCBieSBzb21lIGtpbmQKICAgb2Ygc3RvcmFnZSB1c2luZyB1c2VyZmF1bHRmZC4KCiAt
IHNhbmRib3hpbmc6IEZvciBzYW5kYm94aW5nIHVzZWNhc2VzIChub3QgeW91ciB1c2VjYXNlKSwg
aXQgd291bGQgYmUKICAgcG9zc2libGUgdG8gZS5nLiBjcmVhdGUgYSByZWFkLW9ubHkgY2xvbmUg
b2YgdGhlIGVudGlyZSBhZGRyZXNzIHNwYWNlIG9mIGEKICAgcHJvY2VzcyBhbmQgZ2l2ZSB3cml0
ZSBhY2Nlc3MgdG8gc3BlY2lmaWMgcGFydHMgb2YgaXQsIG9yIHNvbWV0aGluZwogICBsaWtlIHRo
YXQuCiAgIFRoZXNlIGFkZHJlc3Mgc3BhY2UgY2xvbmVzIGNvdWxkIHBvdGVudGlhbGx5IGJlIGNy
ZWF0ZWQgYW5kIGRlc3Ryb3llZAogICBmYWlybHkgcXVpY2tseS4KCiAtIGFjY291bnRpbmc6IG1l
bW9yeSB1c2FnZSB3b3VsZCBiZSBhdXRvbWF0aWNhbGx5IGFjY291bnRlZCB0byB0aGUKICAgc3Vw
ZXJ2aXNvciBwcm9jZXNzLCBzbyBldmVuIHdpdGhvdXQgYSBwYXJhc2l0ZSBwcm9jZXNzLCB5b3Un
ZCBiZSBhYmxlCiAgIHRvIHNlZSB0aGUgbWVtb3J5IHVzYWdlIGNvcnJlY3RseSBpbiB0aGluZ3Mg
bGlrZSAidG9wIi4KCiAtIHNtYWxsIChub24tcGFnZWFibGUpIG1lbW9yeSBmb290cHJpbnQgaW4g
dGhlIGhvc3Qga2VybmVsOgogICBUaGUgb25seSB0aGluZ3MgdGhlIGhvc3Qga2VybmVsIHdvdWxk
IGhhdmUgdG8gcGVyc2lzdGVudGx5IHN0b3JlIHdvdWxkIGJlCiAgIHRoZSBub3JtYWwgTU0gZGF0
YSBzdHJ1Y3R1cmVzIGZvciB0aGUgc3VwZXJ2aXNvciBwbHVzIHRoZSBtYXBwaW5ncwogICBmcm9t
ICJndWVzdCB1c2Vyc3BhY2UiIG1lbW9yeSByYW5nZXMgdG8gc3VwZXJ2aXNvciBtZW1vcnkgcmFu
Z2VzOwogICB1c2Vyc3BhY2UgcGFnZXRhYmxlcyB3b3VsZCBiZSBkaXNjYXJkYWJsZSwgYW5kIGNv
dWxkIGV2ZW4gYmUgc2hhcmVkCiAgIHdpdGggdGhvc2Ugb2YgdGhlIHN1cGVydmlzb3IgaW4gY2Fz
ZXMgd2hlcmUgdGhlIGFsaWdubWVudCBmaXRzLgogICBTbyB3aXRoIHRoaXMsIGxhcmdlIGFub255
bW91cyBtYXBwaW5ncyB3aXRoIDRLIGdyYW51bGFyaXR5IG9ubHkgY29zdCB5b3UKICAgfjAuMjAl
IG92ZXJoZWFkIGFjcm9zcyBob3N0IGFuZCBndWVzdCBhZGRyZXNzIHNwYWNlOyB3aXRob3V0IHRo
aXMsIGlmIHlvdQogICB1c2VkIHNoYXJlZCBtYXBwaW5ncyBpbnN0ZWFkLCB5b3UnZCBwYXkgdHdp
Y2UgdGhhdCBmb3IgZXZlcnkgMk1pQiByYW5nZQogICBmcm9tIHdoaWNoIHBhcnRzIGFyZSBhY2Nl
c3NlZCBpbiBib3RoIGNvbnRleHRzLCBwbHVzIHByb2JhYmx5IGFub3RoZXIKICAgfjAuMiUgb3Ig
c28gZm9yIHRoZSAic3RydWN0IGFkZHJlc3Nfc3BhY2UiPwoKIC0gYWxsIG1lbW9yeS1tYW5hZ2Vt
ZW50LXJlbGF0ZWQgc3lzY2FsbHMgY291bGQgYmUgZGlyZWN0bHkgcGVyZm9ybWVkCiAgIGluIHRo
ZSAia2VybmVsIiBwcm9jZXNzCgpCdXQgeWVhaCwgc29tZSBvZiB0aG9zZSBhcmVuJ3QgcmVhbGx5
IHJlbGV2YW50IGZvciB5b3VyIHVzZWNhc2UsIGFuZCBJCmd1ZXNzIHRoaW5ncyBsaWtlIHRoZSBh
Y2NvdW50aW5nIGFzcGVjdCBjb3VsZCBqdXN0IGFzIHdlbGwgYmUgc29sdmVkCmRpZmZlcmVudGx5
Li4uCgo+IEZpcnN0LCBpbiB0aGUgS1ZNIGNhc2UsIHdlIGhhdmUgYSBmZXcgYmlnIGxpbmVhciBt
YXBwaW5ncyBhbmQgbmVlZCB0bwo+IHN1cHBvcnQgb25lIOKAnHNoYWRvd+KAnSBhZGRyZXNzIHNw
YWNlLiBJbiB0aGUgY2FzZSBvZiBzYW5kYm94ZXMsIHdlIGNhbgo+IGhhdmUgYSB0cmVtZW5kb3Vz
IGFtb3VudCBvZiBtYXBwaW5ncyBhbmQgbWFueSBhZGRyZXNzIHNwYWNlcyB0aGF0IHdlCj4gbmVl
ZCB0byBtYW5hZ2UuICBNZW1vcnkgbWFwcGluZ3Mgd2lsbCBiZSBtYXBwZWQgd2l0aCBkaWZmZXJl
bnQgYWRkcmVzc2VzCj4gaW4gYSBzdXBlcnZpc29yIGFkZHJlc3Mgc3BhY2UgYW5kIOKAnGd1ZXN0
4oCdIGFkZHJlc3Mgc3BhY2VzLiBJZiBndWVzdAo+IGFkZHJlc3Mgc3BhY2VzIHdpbGwgbm90IGhh
dmUgdGhlaXIgbW1fc3RydWN0cywgd2Ugd2lsbCBuZWVkIHRvIHJlaW52ZW50Cj4gdm1hLXMgaW4g
c29tZSBmb3JtLiBJZiBndWVzdCBhZGRyZXNzIHNwYWNlcyBoYXZlIG1tX3N0cnVjdHMsIHRoaXMg
d2lsbAo+IGxvb2sgc2ltaWxhciB0byBodHRwczovL2x3bi5uZXQvQXJ0aWNsZXMvODMwNjQ4Ly4K
Pgo+IFNlY29uZCwgZWFjaCBwYWdldGFibGUgaXMgdGllZCB1cCB3aXRoIG1tX3N0dWN0LiBZb3Ug
c3VnZ2VzdCBjcmVhdGluZwo+IG5ldyBwYWdldGFibGVzIHRoYXQgd2lsbCBub3QgaGF2ZSB0aGVp
ciBtbV9zdHJ1Y3QtcyAoc29ycnkgaWYgSQo+IG1pc3VuZGVyc3Rvb2Qgc29tZXRoaW5nKS4KClll
YWgsIHRoYXQncyB3aGF0IEkgaGFkIGluIG1pbmQsIHBhZ2UgdGFibGVzIHdpdGhvdXQgYW4gbW1f
c3RydWN0LgoKPiBJIGFtIG5vdCBzdXJlIHRoYXQgaXQgd2lsbCBiZSBlYXN5IHRvCj4gaW1wbGVt
ZW50LiBIb3cgbWFueSBjb3JuZXIgY2FzZXMgd2lsbCBiZSB0aGVyZT8KClllYWgsIGl0IHdvdWxk
IHJlcXVpcmUgc29tZSB3b3JrIGFyb3VuZCBUTEIgZmx1c2hpbmcgYW5kIGVudHJ5L2V4aXQKZnJv
bSB1c2Vyc3BhY2UuIEJ1dCBmcm9tIGEgaGlnaC1sZXZlbCBwZXJzcGVjdGl2ZSBpdCBmZWVscyB0
byBtZSBsaWtlCmEgY2hhbmdlIHdpdGggbGVzcyBzeXN0ZW1hdGljIGltcGFjdC4gTWF5YmUgSSdt
IHdyb25nIGFib3V0IHRoYXQuCgo+IEFzIGZvciBwYWdlIGZhdWx0cyBpbiBhIHNlY29uZGFyeSBh
ZGRyZXNzIHNwYWNlLCB3ZSB3aWxsIG5lZWQgdG8gZmluZCBhCj4gZmF1bHQgYWRkcmVzcyBpbiB0
aGUgbWFpbiBhZGRyZXNzIHNwYWNlLCBoYW5kbGUgdGhlIGZhdWx0IHRoZXJlIGFuZCB0aGVuCj4g
bWlycm9yIHRoZSBQVEUgdG8gdGhlIHNlY29uZGFyeSBwYWdldGFibGUuCgpSaWdodC4KCj4gRWZm
ZWN0aXZlbHksIGl0IG1lYW5zIHRoYXQKPiBwYWdlIGZhdWx0cyB3aWxsIGJlIGhhbmRsZWQgaW4g
dHdvIGFkZHJlc3Mgc3BhY2VzLiBSaWdodCBub3csIHdlIHVzZQo+IG1lbWZkIGFuZCBzaGFyZWQg
bWFwcGluZ3MuIEl0IG1lYW5zIHRoYXQgZWFjaCBmYXVsdCBpcyBoYW5kbGVkIG9ubHkgaW4KPiBv
bmUgYWRkcmVzcyBzcGFjZSwgYW5kIHdlIG1hcCBhIGd1ZXN0IG1lbW9yeSByZWdpb24gdG8gdGhl
IHN1cGVydmlzb3IKPiBhZGRyZXNzIHNwYWNlIG9ubHkgd2hlbiB3ZSBuZWVkIHRvIGFjY2VzcyBp
dC4gQSBsYXJnZSBwb3J0aW9uIG9mIGd1ZXN0Cj4gYW5vbnltb3VzIG1lbW9yeSBpcyBuZXZlciBt
YXBwZWQgdG8gdGhlIHN1cGVydmlzb3IgYWRkcmVzcyBzcGFjZS4KPiBXaWxsIGFuIG92ZXJoZWFk
IG9mIG1pcnJvcmVkIGFkZHJlc3Mgc3BhY2VzIGJlIHNtYWxsZXIgdGhhbiBtZW1mZCBzaGFyZWQK
PiBtYXBwaW5ncz8gSSBhbSBub3Qgc3VyZS4KCkJ1dCBhcyBsb25nIGFzIHRoZSBtYXBwaW5ncyBh
cmUgc3VmZmljaWVudGx5IGJpZyBhbmQgYWxpZ25lZCBwcm9wZXJseSwKb3IgeW91IGV4cGxpY2l0
bHkgbWFuYWdlIHRoZSBzdXBlcnZpc29yIGFkZHJlc3Mgc3BhY2UsIHNvbWUgb2YgdGhhdApjb3N0
IGRpc2FwcGVhcnM6IEUuZy4gZXZlbiBpZiBhIHBhZ2UgaXMgbWFwcGVkIGluIGJvdGggYWRkcmVz
cyBzcGFjZXMsCnlvdSB3b3VsZG4ndCBoYXZlIGEgbWVtb3J5IGNvc3QgZm9yIHRoZSBzZWNvbmQg
bWFwcGluZyBpZiB0aGUgcGFnZQp0YWJsZXMgYXJlIHNoYXJlZC4KCj4gVGhpcmQsIHRoaXMgYXBw
cm9hY2ggd2lsbCBub3QgZ2V0IHJpZCBvZiBoYXZpbmcgcHJvY2Vzc192bV9leGVjLiBXZSB3aWxs
Cj4gbmVlZCB0byBzd2l0Y2ggdG8gYSBndWVzdCBhZGRyZXNzIHNwYWNlIHdpdGggYSBzcGVjaWZp
ZWQgc3RhdGUgYW5kCj4gc3dpdGNoIGJhY2sgb24gZmF1bHRzIG9yIHN5c2NhbGxzLgoKWWVhaCwg
eW91J2Qgc3RpbGwgbmVlZCBhIHN5c2NhbGwgZm9yIHJ1bm5pbmcgY29kZSB1bmRlciBhIGRpZmZl
cmVudApzZXQgb2YgcGFnZSB0YWJsZXMuIEJ1dCB0aGF0J3Mgc29tZXRoaW5nIHRoYXQgS1ZNIF9h
bG1vc3RfIGFscmVhZHkKZG9lcy4KCj4gSWYgdGhlIG1haW4gY29uY2VybiBpcyB0aGUgYWJpbGl0
eSB0bwo+IHJ1biBzeXNjYWxscyBvbiBhIHJlbW90ZSBtbSwgd2UgY2FuIHRoaW5rIGFib3V0IGhv
dyB0byBmaXggdGhpcy4gSSBzZWUKPiB0d28gd2F5cyB3aGF0IHdlIGNhbiBkbyBoZXJlOgo+Cj4g
KiBTcGVjaWZ5IHRoZSBleGFjdCBsaXN0IG9mIHN5c3RlbSBjYWxscyB0aGF0IGFyZSBhbGxvd2Vk
LiBUaGUgZmlyc3QKPiB0aHJlZSBjYW5kaWRhdGVzIGFyZSBtbWFwLCBtdW5tYXAsIGFuZCB2bXNw
bGljZS4KPgo+ICogSW5zdGVhZCBvZiBhbGxvd2luZyB1cyB0byBydW4gc3lzdGVtIGNhbGxzLCB3
ZSBjYW4gaW1wbGVtZW50IHRoaXMgaW4KPiB0aGUgZm9ybSBvZiBjb21tYW5kcy4gSW4gdGhlIGNh
c2Ugb2Ygc2FuZGJveGVzLCB3ZSBuZWVkIHRvIGltcGxlbWVudAo+IG9ubHkgdHdvIGNvbW1hbmRz
IHRvIGNyZWF0ZSBhbmQgZGVzdHJveSBtZW1vcnkgbWFwcGluZ3MgaW4gYSB0YXJnZXQKPiBhZGRy
ZXNzIHNwYWNlLgoKRldJVywgdGhlcmUgaXMgcHJlY2VkZW50IGZvciBzb21ldGhpbmcgc2ltaWxh
cjogVGhlIEFuZHJvaWQgZm9sa3MKYWxyZWFkeSBhZGRlZCBwcm9jZXNzX21hZHZpc2UoKSBmb3Ig
cmVtb3RlbHkgbWVzc2luZyB3aXRoIHRoZSBWTUFzIG9mCmFub3RoZXIgcHJvY2VzcyB0byBzb21l
IGRlZ3JlZS4KCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
CmxpbnV4LXVtIG1haWxpbmcgbGlzdApsaW51eC11bUBsaXN0cy5pbmZyYWRlYWQub3JnCmh0dHA6
Ly9saXN0cy5pbmZyYWRlYWQub3JnL21haWxtYW4vbGlzdGluZm8vbGludXgtdW0K