From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 588E7C25B06 for ; Tue, 2 Aug 2022 00:49:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235446AbiHBAtZ (ORCPT ); Mon, 1 Aug 2022 20:49:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32908 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231290AbiHBAtW (ORCPT ); Mon, 1 Aug 2022 20:49:22 -0400 Received: from mail-pl1-x629.google.com (mail-pl1-x629.google.com [IPv6:2607:f8b0:4864:20::629]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CCD893341F for ; Mon, 1 Aug 2022 17:49:19 -0700 (PDT) Received: by mail-pl1-x629.google.com with SMTP id z19so12011775plb.1 for ; Mon, 01 Aug 2022 17:49:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc; bh=fYrbrS8yv9H/RvUOOyuHQrSi1loIQEHWMZ/8CZZ3RXc=; b=gO0RsnmpomRU40aaGuswU5stzk1dV4NDc3pxJ6zSc9zcDzvqVHJEsaTU46TCCnnEDM VIM2Q44ozglwDHGobO6GUhmj4taro6Si7I5uJlkeFvBIDIYNgCZMO9DNDT/Fr4nlOf8j FKfGYbL8I576CFY482PM+hUdjOGAEqqzJXYVPR8s7R687SE/sN2YRJ0jIEeCCsAk1y6f DVn0KhP6WN2ELlJd8pOY+UaucR7IsHDggLN+juJdfB22mcm/n8UKO1ldaQKA7IYzA1tt upHroORCIR8C9toovIJlWRMZF59BNixyWef3Z9805sHHgrSaCK+JCYEP+TBUe8H+AEPS tz/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc; bh=fYrbrS8yv9H/RvUOOyuHQrSi1loIQEHWMZ/8CZZ3RXc=; b=txnWUyuQxX4u+egWBgxyPciY+SjemiwuZ3PWOLDaAy7Hj4KM7xgjY5Mhpttf87r8Si Bg9Eg22HeQqF1oHK1WH3Gfq1Av730xuhIaX/SmGDHOcJbT/ue1LZqNomyiZ6gbx8QMxz CMZm3SJOn5SlZiacB9fNKvz8p0x25KkVcb/7zWyIs4irHoLsDJzPpw1vgf2g/Wq5UdYX kbCT5JE5KBfolBCNmyU6zSRIcpohYbG7yhP7lhyGECIO8IYmzzMJta0z276TGgkTCZVu VvTqjiNFxMLJZbmf/nUOi2BGm2Y5euv702DDbcsiMdOQXd3uoR4i1KD4tUgCulO0C3P3 JlkQ== X-Gm-Message-State: ACgBeo0Pek9J6c2ZH5MTSTIXHQYEwMetzvb5JeLC3jCZ3/QNohRaxVGP 9Nwq5Aw/oDdmiQUtIR+ofo+ZyQ== X-Google-Smtp-Source: AA6agR5NFcyJ/Eqf8Y9BodOFj30Va8ngMe4AFfAM05tfz+rj1jrOYkiljXtYcYQIlQH1KwHfCRTqmw== X-Received: by 2002:a17:903:1209:b0:16c:ece7:f68b with SMTP id l9-20020a170903120900b0016cece7f68bmr19190691plh.112.1659401359068; Mon, 01 Aug 2022 17:49:19 -0700 (PDT) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id o1-20020a170902d4c100b0016c4147e48asm5966869plg.219.2022.08.01.17.49.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 01 Aug 2022 17:49:18 -0700 (PDT) Date: Tue, 2 Aug 2022 00:49:14 +0000 From: Sean Christopherson To: Chao Peng Cc: Wei Wang , "Gupta, Pankaj" , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions Message-ID: References: <20220719140843.GA84779@chaop.bj.intel.com> <36e671d2-6b95-8e4f-c2ac-fee4b2670c6e@amd.com> <20220720150706.GB124133@chaop.bj.intel.com> <45ae9f57-d595-f202-abb5-26a03a2ca131@linux.intel.com> <20220721092906.GA153288@chaop.bj.intel.com> <20220725130417.GA304216@chaop.bj.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 29, 2022, Sean Christopherson wrote: > On Mon, Jul 25, 2022, Chao Peng wrote: > > On Thu, Jul 21, 2022 at 05:58:50PM +0000, Sean Christopherson wrote: > > > On Thu, Jul 21, 2022, Chao Peng wrote: > > > > On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote: > > > > > > > > > > > > > > > On 7/21/22 00:21, Sean Christopherson wrote: > > > > > Maybe you could tag it with cgs for all the confidential guest support > > > > > related stuff: e.g. kvm_vm_ioctl_set_cgs_mem() > > > > > > > > > > bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; > > > > > ... > > > > > kvm_vm_ioctl_set_cgs_mem(, is_private) > > > > > > > > If we plan to widely use such abbr. through KVM (e.g. it's well known), > > > > I'm fine. > > > > > > I'd prefer to stay away from "confidential guest", and away from any VM-scoped > > > name for that matter. User-unmappable memmory has use cases beyond hiding guest > > > state from the host, e.g. userspace could use inaccessible/unmappable memory to > > > harden itself against unintentional access to guest memory. > > > > > > > I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610 > > > > But I also don't quite like it, it's so generic and sounds say nothing. > > > > > > > > But I do want a name can cover future usages other than just > > > > private/shared (pKVM for example may have a third state). > > > > > > I don't think there can be a third top-level state. Memory is either private to > > > the guest or it's not. There can be sub-states, e.g. memory could be selectively > > > shared or encrypted with a different key, in which case we'd need metadata to > > > track that state. > > > > > > Though that begs the question of whether or not private_fd is the correct > > > terminology. E.g. if guest memory is backed by a memfd that can't be mapped by > > > userspace (currently F_SEAL_INACCESSIBLE), but something else in the kernel plugs > > > that memory into a device or another VM, then arguably that memory is shared, > > > especially the multi-VM scenario. > > > > > > For TDX and SNP "private vs. shared" is likely the correct terminology given the > > > current specs, but for generic KVM it's probably better to align with whatever > > > terminology is used for memfd. "inaccessible_fd" and "user_inaccessible_fd" are > > > a bit odd since the fd itself is accesible. > > > > > > What about "user_unmappable"? E.g. > > > > > > F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, KVM_HAS_USER_UNMAPPABLE_MEMORY, > > > MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc... > > > > For KVM I also think user_unmappable looks better than 'private', e.g. > > user_unmappable_fd/KVM_HAS_USER_UNMAPPABLE_MEMORY sounds more > > appropriate names. For memfd however, I don't feel that strong to change > > it from current 'inaccessible' to 'user_unmappable', one of the reason > > is it's not just about unmappable, but actually also inaccessible > > through direct ioctls like read()/write(). > > Heh, I _knew_ there had to be a catch. I agree that INACCESSIBLE is better for > memfd. Thought about this some more... I think we should avoid UNMAPPABLE even on the KVM side of things for the core memslots functionality and instead be very literal, e.g. KVM_HAS_FD_BASED_MEMSLOTS KVM_MEM_FD_VALID We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied directly to the memslot. Decoupling the two thingis will require a bit of extra work, but the code impact should be quite small, e.g. explicitly query and propagate MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be private. And unless I'm missing something, it won't require an additional memslot flag. The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would effectively ignore the hva for fd-based memslots for VM types that don't support private memory, i.e. userspace can't opt out of using the fd-based backing, but that doesn't seem like a deal breaker. Decoupling private memory from fd-based memslots will allow using fd-based memslots for backing VMs even if the memory is user mappable, which opens up potentially interesting use cases. It would also allow testing some parts of fd-based memslots with existing VMs. The big advantage of KVM's hva-based memslots is that KVM doesn't care what's backing a memslot, and so (in thoery) enabling new backing stores for KVM is free. It's not always free, but at this point I think we've eliminated most of the hiccups, e.g. x86's MMU should no longer require additional enlightenment to support huge pages for new backing types. On the flip-side, a big disadvantage of hva-based memslots is that KVM doesn't _know_ what's backing a memslot. This is one of the major reasons, if not _the_ main reason at this point, why KVM binds a VM to a single virtual address space. Running with different hva=>pfn mappings would either be completely unsafe or prohibitively expensive (nearly impossible?) to ensure. With fd-based memslots, KVM essentially binds a memslot directly to the backing store. This allows KVM to do a "deep" comparison of a memslot between two address spaces simply by checking that the backing store is the same. For intra-host/copyless migration (to upgrade the userspace VMM), being able to do a deep comparison would theoretically allow transferring KVM's page tables between VMs instead of forcing the target VM to rebuild the page tables. There are memcg complications (and probably many others) for transferring page tables, but I'm pretty sure it could work. I don't have a concrete use case (this is a recent idea on my end), but since we're already adding fd-based memory, I can't think of a good reason not make it more generic for not much extra cost. And there are definitely classes of VMs for which fd-based memory would Just Work, e.g. large VMs that are never oversubscribed on memory don't need to support reclaim, so the fact that fd-based memslots won't support page aging (among other things) right away is a non-issue.