From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4ACBC7EE22 for ; Wed, 10 May 2023 21:39:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235902AbjEJVjZ (ORCPT ); Wed, 10 May 2023 17:39:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46872 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229586AbjEJVjW (ORCPT ); Wed, 10 May 2023 17:39:22 -0400 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 21DAC3A85 for ; Wed, 10 May 2023 14:39:21 -0700 (PDT) Received: by mail-pf1-x44a.google.com with SMTP id d2e1a72fcca58-6438e9e9f91so4431698b3a.2 for ; Wed, 10 May 2023 14:39:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1683754760; x=1686346760; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=pvMo88ZxTu4JSemHWRpCdEPpg2ynZCnKfs0n5ONEGjQ=; b=CeJmhLiVeMl/SalaT3UbgWDd6VZbKL0myvmAHYqsphIPwK8cGR848wCnLDAuc7yd1I zELWkkT3aTqEiAbKdsf2hUs2GkeJ5MiWLGlN7rfO7jK+hivnkfmWMO8Ru5GlsM6lceBE yO+HEbccZlJDHHuIKD8xBhM0otHLjuZ67zmdfokxtRJYJROWpRI+2/9vi3RZDuK4uZmI zSwmnX/xQDN/QXBlGizERLvYFs4nZgznpyz4aAHbuxFw+Kw4pUjqkzZDmWVoUZFsx+Ul mFwv1sCMTb345np8OlMSroX5+ovzXWZzMYE7RVmfr6ZfnFfQfI789o8jit9LcGvg/eAD 66eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683754760; x=1686346760; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=pvMo88ZxTu4JSemHWRpCdEPpg2ynZCnKfs0n5ONEGjQ=; b=NR/aV56usrmfmZjJThkfivwagKWS5HnIOlLf0SMfUd/XXTrL5BEvAN74NphUmbDwTo 4853LbV0EUTkEf5vCymBM9113nKE4O0DmPBuzm6aNmx4Iae0+YOX1v9ye4M9S+uAxJpn ybbi3MNVD2MyKdYhplh6QFVC4LF2v94e1F0GI2Fr4Mu/kIG+ytZRQWPkJ3WvTdMoCGug hdfpdpjotMUvxNYKiur8k1LPQIVN20gTAXvk3Q6+ehNi4Pzpc8EPZdz9aAJ+YUbz+Xt2 cEanW5iyyyJQRfFR141VTROv37CXtTeMafDeH1I/PGGMwMD2HcnThkX+5BVaWYopvumF ofyQ== X-Gm-Message-State: AC+VfDzf3qmLEUgbWNsE97yenvAu081YwhldhZoMINj66MU4/HJY506e zoyjZwjWzBH80YBT0KrGKyNcD5ZILBQ= X-Google-Smtp-Source: ACHHUZ5vj9DFK0NhJYCWC3eWM91MVLeZUtPl8EkfA8521q2N+H8FD8crn4+g5/DykvygR1tXhlUlPIZU59U= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6a00:7c9:b0:643:78c0:ec65 with SMTP id n9-20020a056a0007c900b0064378c0ec65mr4990145pfu.5.1683754760622; Wed, 10 May 2023 14:39:20 -0700 (PDT) Date: Wed, 10 May 2023 14:39:18 -0700 In-Reply-To: Mime-Version: 1.0 References: <658018f9-581c-7786-795a-85227c712be0@redhat.com> <1ed06a62-05a1-ebe6-7ac4-5b35ba272d13@redhat.com> <9efef45f-e9f4-18d1-0120-f0fc0961761c@redhat.com> <5869f50f-0858-ab0c-9049-4345abcf5641@redhat.com> Message-ID: Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM) From: Sean Christopherson To: Vishal Annapurve Cc: David Hildenbrand , Chao Peng , Paolo Bonzini , Vitaly Kuznetsov , Jim Mattson , Joerg Roedel , "Maciej S . Szmigiero" , Vlastimil Babka , Yu Zhang , "Kirill A . Shutemov" , dhildenb@redhat.com, Quentin Perret , tabba@google.com, Michael Roth , wei.w.wang@intel.com, Mike Rapoport , Liam Merwick , Isaku Yamahata , Jarkko Sakkinen , Ackerley Tng , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Hugh Dickins , Christian Brauner Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 10, 2023, Vishal Annapurve wrote: > On Fri, Apr 21, 2023 at 6:33=E2=80=AFPM Sean Christopherson wrote: > > > > ... > > cold. I poked around a bit to see how we could avoid reinventing all o= f that > > infrastructure for fd-only memory, and the best idea I could come up wi= th is > > basically a rehash of Kirill's very original "KVM protected memory" RFC= [3], i.e. > > allow "mapping" fd-only memory, but ensure that memory is never actuall= y present > > from hardware's perspective. > > >=20 > I am most likely missing a lot of context here and possibly venturing > into an infeasible/already shot down direction here. Both :-) > But I would still like to get this discussed here before we move on. >=20 > I am wondering if it would make sense to implement > restricted_mem/guest_mem file to expose both private and shared memory > regions, inline with Kirill's original proposal now that the file > implementation is controlled by KVM. >=20 > Thinking from userspace perspective: > 1) Userspace creates guest mem files and is able to mmap them but all > accesses to these files result into faults as no memory is allowed to > be mapped into userspace VMM pagetables. Never mapping anything into the userspace page table is infeasible. Techni= cally it's doable, but it'd effectively require all of the work of an fd-based ap= proach (and probably significantly more), _and_ it'd require touching core mm code= . VMAs don't provide hva=3D>pfn information, they're the kernel's way of impl= ementing the abstraction provided to userspace by mmap(), mprotect() etc. Among man= y other things, a VMA describes properties of what is mapped, e.g. hugetblfs versus anonymous, where memory is mapped (virtual address), how memory is mapped, = e.g. RWX protections, etc. But a VMA doesn't track the physical address, that i= nfo is all managed through the userspace page tables. To make it possible to allow userspace to mmap() but not access memory (wit= hout redoing how the kernel fundamentally manages virtual=3D>physical mappings),= the simplest approach is to install PTEs into userspace page tables, but never = mark them Present in hardware, i.e. prevent actually accessing the backing memor= y. This is is exactly what Kirill's series in link [3] below implemented. Issues that led to us abandoning the "map with special !Present PTEs" appro= ach: - Using page tables, i.e. hardware defined structures, to track gfn=3D>pfn= mappings is inefficient and inflexible compared to software defined structures, e= specially for the expected use cases for CoCo guests. - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest associa= tion, let alone a 1:1 pfn:gfn mapping. =20 - Does not work for memory that isn't backed by 'struct page', e.g. if dev= ices gain support for exposing encrypted memory regions to guests. - Poking into the VMAs to convert memory would be likely be less performan= t due to using infrastructure that is much "heavier", e.g. would require takin= g mmap_lock for write. In short, shoehorning this into mmap() requires fighting how the kernel wor= ks at pretty much every step, and in the end, adding e.g. fbind() is a lot easier= . > 2) Userspace registers mmaped HVA ranges with KVM with additional > KVM_MEM_PRIVATE flag > 3) Userspace converts memory attributes and this memory conversion > allows userspace to access shared ranges of the file because those are > allowed to be faulted in from guest_mem. Shared to private conversion > unmaps the file ranges from userspace VMM pagetables. > 4) Granularity of userspace pagetable mappings for shared ranges will > have to be dictated by KVM guest_mem file implementation. >=20 > Caveat here is that once private pages are mapped into userspace view. >=20 > Benefits here: > 1) Userspace view remains consistent while still being able to use HVA ra= nges > 2) It would be possible to use HVA based APIs from userspace to do > things like binding. > 3) Double allocation wouldn't be a concern since hva ranges and gpa > ranges possibly map to the same HPA ranges. #3 isn't entirely correct. If a different process (call it "B") maps share= d memory, and then the guest converts that memory from shared to private, the backing= pages for the previously shared mapping will still be mapped by process B unless = userspace also ensures process B also unmaps on conversion. #3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages, ke= eping the 1GiB mapping is desirable if the guest converts a few KiB of memory to = shared, and possibly even if the guest converts a few MiB of memory. > > Code is available here if folks want to take a look before any kind of = formal > > posting: > > > > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo > > > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@go= ogle.com > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@= brauner > > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shut= emov@linux.intel.com