From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.3 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70446C432BE for ; Thu, 2 Sep 2021 08:44:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EEE5061059 for ; Thu, 2 Sep 2021 08:44:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org EEE5061059 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 7900C8D0002; Thu, 2 Sep 2021 04:44:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 73FF98D0001; Thu, 2 Sep 2021 04:44:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E0CF8D0002; Thu, 2 Sep 2021 04:44:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0137.hostedemail.com [216.40.44.137]) by kanga.kvack.org (Postfix) with ESMTP id 4B4128D0001 for ; Thu, 2 Sep 2021 04:44:08 -0400 (EDT) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id DD0B92A01E for ; Thu, 2 Sep 2021 08:44:07 +0000 (UTC) X-FDA: 78541996134.27.92ED151 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf06.hostedemail.com (Postfix) with ESMTP id 65057801A89C for ; Thu, 2 Sep 2021 08:44:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1630572246; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VWRgLyOnqoIayiNQjYpy0dsyFWdFNzRj4QcD6Wzn1iU=; b=SWgJOV2inDYutHJ6n/1F8EHKD5U0UsN1wXXuWEFgG9nm7wPKgVbCiFGmoZcvtNMVxTXjAX x1BO6zv/bPWcdLxwYSReBNpIOFiN+52H9Jo4NYidHI9dBMuv7/rUrlEoV4SJZHRLAZZpkL r5jcaBy10jB7sR7tNNEquKSMdEjhfQM= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-581-UB1sRuwQMa26APvJGnWiBw-1; Thu, 02 Sep 2021 04:44:05 -0400 X-MC-Unique: UB1sRuwQMa26APvJGnWiBw-1 Received: by mail-wm1-f71.google.com with SMTP id g3-20020a1c2003000000b002e751c4f439so680713wmg.7 for ; Thu, 02 Sep 2021 01:44:05 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=VWRgLyOnqoIayiNQjYpy0dsyFWdFNzRj4QcD6Wzn1iU=; b=EpYzUuSLNXQ1VtG/pbO/Ya0aa6s1eH9JY1lfG6KpM1thWtOb6KiEpgEkBfx0/s3qDg AvZg7apZRAXYp1D1q/GyU7iYWOyeNrmA1S796xSPy3VxoBP0w+rOF5EMXyMGG6TxA3rL aWChn0QZSk2yBhMJ2PSGvxepELFjxG6FKNb3nDdEEftZoLjEy0PBS8dAHr+8FhvNvPnq +8sGxZQ1eKtg4PuxT9+C32VAXpthSxlOWQtblDUTNdpD8wI2kV5qvvXkCDlN5Xqcutre lHR1uJneu1byB14Ty12SSwP/KRqA7LSSYZTjB52WPo437nqmHH+jbG56ZCNHXjlB4pI7 2l3w== X-Gm-Message-State: AOAM533S2CJGyhXrEmFxCo6zHyePWfHqQXngQYuSSs6SIz7grR6gT1FJ /OXzJdwo94maarY3QKhPTEFOr3vRMNatiME3w2V8cp0allnZ6jGbHLWlWdyOlqu1Co5uHSK+/t4 wTfktuRM9IWw= X-Received: by 2002:adf:c149:: with SMTP id w9mr2316293wre.126.1630572244517; Thu, 02 Sep 2021 01:44:04 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzziPkwyia0D2Pb+mZpxXBmtVXs8bzUX9GXE24EybkhxUj0x0dWCmdrH8a50cPWIyUVXFUHxg== X-Received: by 2002:adf:c149:: with SMTP id w9mr2316253wre.126.1630572244157; Thu, 02 Sep 2021 01:44:04 -0700 (PDT) Received: from [192.168.3.132] (p5b0c60bd.dip0.t-ipconnect.de. [91.12.96.189]) by smtp.gmail.com with ESMTPSA id b10sm1215199wrt.43.2021.09.02.01.44.02 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 02 Sep 2021 01:44:03 -0700 (PDT) To: Yu Zhang Cc: Andy Lutomirski , Sean Christopherson , Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm list , Linux Kernel Mailing List , Borislav Petkov , Andrew Morton , Joerg Roedel , Andi Kleen , David Rientjes , Vlastimil Babka , Tom Lendacky , Thomas Gleixner , "Peter Zijlstra (Intel)" , Ingo Molnar , Varad Gautam , Dario Faggioli , the arch/x86 maintainers , linux-mm@kvack.org, linux-coco@lists.linux.dev, "Kirill A. Shutemov" , "Kirill A . Shutemov" , Sathyanarayanan Kuppuswamy , Dave Hansen References: <20210824005248.200037-1-seanjc@google.com> <307d385a-a263-276f-28eb-4bc8dd287e32@redhat.com> <20210827023150.jotwvom7mlsawjh4@linux.intel.com> <8f3630ff-bd6d-4d57-8c67-6637ea2c9560@www.fastmail.com> <20210901102437.g5wrgezmrjqn3mvy@linux.intel.com> <2b2740ec-fa89-e4c3-d175-824e439874a6@redhat.com> <20210902083453.aeouc6fob53ydtc2@linux.intel.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: <823d9453-892e-508d-b806-1b18c9b9fc13@redhat.com> Date: Thu, 2 Sep 2021 10:44:02 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210902083453.aeouc6fob53ydtc2@linux.intel.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SWgJOV2i; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf06.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 65057801A89C X-Stat-Signature: e3h5jkfhyrm1pnxbcg7bjq9b3akp43uj X-HE-Tag: 1630572247-532019 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 02.09.21 10:34, Yu Zhang wrote: > On Wed, Sep 01, 2021 at 06:27:20PM +0200, David Hildenbrand wrote: >> On 01.09.21 18:07, Andy Lutomirski wrote: >>> On 9/1/21 3:24 AM, Yu Zhang wrote: >>>> On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote: >>>>> >>>>> >>>>> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote: >>>>>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: >>>>> >>>>>> Thanks a lot for this summary. A question about the requirement: d= o we or >>>>>> do we not have plan to support assigned device to the protected VM= ? >>>>>> >>>>>> If yes. The fd based solution may need change the VFIO interface a= s well( >>>>>> though the fake swap entry solution need mess with VFIO too). Beca= use: >>>>>> >>>>>> 1> KVM uses VFIO when assigning devices into a VM. >>>>>> >>>>>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffe= r, all >>>>>> guest pages will have to be mapped in host IOMMU page table to hos= t pages, >>>>>> which are pinned during the whole life cycle fo the VM. >>>>>> >>>>>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU= driver, >>>>>> in vfio_dma_do_map(). >>>>>> >>>>>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to ge= t the HPA >>>>>> and pin the page. >>>>>> >>>>>> But if we are using fd based solution, not every GPA can have a HV= A, thus >>>>>> the current VFIO interface to map and pin the GPA(IOVA) wont work.= And I >>>>>> doubt if VFIO can be modified to support this easily. >>>>>> >>>>>> >>>>> >>>>> Do you mean assigning a normal device to a protected VM or a hypoth= etical protected-MMIO device? >>>>> >>>>> If the former, it should work more or less like with a non-protecte= d VM. mmap the VFIO device, set up a memslot, and use it. I'm not sure w= hether anyone will actually do this, but it should be possible, at least = in principle. Maybe someone will want to assign a NIC to a TDX guest. A= n NVMe device with the understanding that the guest can't trust it wouldn= 't be entirely crazy ether. >>>>> >>>>> If the latter, AFAIK there is no spec for how it would work even in= principle. Presumably it wouldn't work quite like VFIO -- instead, the k= ernel could have a protection-virtual-io-fd mechanism, and that fd could = be bound to a memslot in whatever way we settle on for binding secure mem= ory to a memslot. >>>>> >>>> >>>> Thanks Andy. I was asking the first scenario. >>>> >>>> Well, I agree it is doable if someone really want some assigned >>>> device in TD guest. As Kevin mentioned in his reply, HPA can be >>>> generated, by extending VFIO with a new mapping protocol which >>>> uses fd+offset, instead of HVA. >>> >>> I'm confused. I don't see why any new code is needed for this at all= . >>> Every proposal I've seen for handling TDX memory continues to handle = TDX >>> *shared* memory exactly like regular guest memory today. The only >>> differences are that more hole punching will be needed, which will >>> require lightweight memslots (to have many of them), memslots with >>> holes, or mappings backing memslots with holes (which can be done wit= h >>> munmap() on current kernels). >>> >>> So you can literally just mmap a VFIO device and expect it to work, >>> exactly like it does right now. Whether the guest will be willing to >>> use the device will depend on the guest security policy (all kinds of >>> patches about that are flying around), but if the guest tries to use = it, >>> it really should just work. >> >> ... but if you end up mapping private memory into IOMMU of the device = and >> the device ends up accessing that memory, we're in the same position t= hat >> the host might get capped, just like access from user space, no? >=20 > Well, according to the spec: >=20 > - If the result of the translation results in a physical address wit= h a TD > private key ID, then the IOMMU will abort the transaction and report= a VT-d > DMA remapping failure. >=20 > - If the GPA in the transaction that is input to the IOMMU is privat= e (SHARED > bit is 0), then the IOMMU may abort the transaction and report a VT-= d DMA > remapping failure. >=20 > So I guess mapping private GPAs in IOMMU is not that dangerous as mappi= ng > into userspace. Though still wrong. >=20 >> >> Sure, you can map only the complete duplicate shared-memory region int= o the >> IOMMU of the device, that would work. Shame vfio mostly always pins al= l >> guest memory and you essentially cannot punch holes into the shared me= mory >> anymore -- resulting in the worst case in a duplicate memory consumpti= on for >> your VM. >> >> So you'd actually want to map only the *currently* shared pieces into = the >> IOMMU and update the mappings on demand. Having worked on something re= lated, >=20 > Exactly. On demand mapping and page pinning for shared memory is necess= ary. >=20 >> I can only say that 64k individual mappings, and not being able to mod= ify >> existing mappings except completely deleting them to replace with some= thing >> new (!atomic), can be quite an issue for bigger VMs. >=20 > Do you mean atomicity in mapping/unmapping can hardly be guaranteed dur= ing > the shared <-> private transition? May I ask for elaboration? Thanks! If we expect to really only have little shared memory, and expect a VM=20 always has no shared memory when booting up (I think this is the case),=20 I guess this could work. The issue is if the guest e.g., makes contiguous 2 MiB shared and later=20 wants to unshare individual pages/parts. You'll have to DMA map the 2 MiB in page granularity, otherwise you'll=20 have to DMA unmap 2 MiB and DMA remap all still-shared pieces. That is=20 not atomic and can be problematic if the device is accessing some of the=20 shared parts at that time. Consequently that means, that large shared regions can be problematic=20 when mapped, because we'll have to map in page granularity. We have 64k=20 such individual mappings in general. 64k * 4KiB =3D=3D 256 MiB Not sure if there would be use cases, e.g., with GPGPUs and similar,=20 where you'd want to share a lot of memory with a device ... But these are just my thoughts, maybe I am missing something important. --=20 Thanks, David / dhildenb