From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=eNaS=NY=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.3 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 70446C432BE
	for <linux-mm@archiver.kernel.org>; Thu,  2 Sep 2021 08:44:09 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id EEE5061059
	for <linux-mm@archiver.kernel.org>; Thu,  2 Sep 2021 08:44:08 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org EEE5061059
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 7900C8D0002; Thu,  2 Sep 2021 04:44:08 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 73FF98D0001; Thu,  2 Sep 2021 04:44:08 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5E0CF8D0002; Thu,  2 Sep 2021 04:44:08 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0137.hostedemail.com [216.40.44.137])
	by kanga.kvack.org (Postfix) with ESMTP id 4B4128D0001
	for <linux-mm@kvack.org>; Thu,  2 Sep 2021 04:44:08 -0400 (EDT)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id DD0B92A01E
	for <linux-mm@kvack.org>; Thu,  2 Sep 2021 08:44:07 +0000 (UTC)
X-FDA: 78541996134.27.92ED151
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf06.hostedemail.com (Postfix) with ESMTP id 65057801A89C
	for <linux-mm@kvack.org>; Thu,  2 Sep 2021 08:44:07 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1630572246;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=VWRgLyOnqoIayiNQjYpy0dsyFWdFNzRj4QcD6Wzn1iU=;
	b=SWgJOV2inDYutHJ6n/1F8EHKD5U0UsN1wXXuWEFgG9nm7wPKgVbCiFGmoZcvtNMVxTXjAX
	x1BO6zv/bPWcdLxwYSReBNpIOFiN+52H9Jo4NYidHI9dBMuv7/rUrlEoV4SJZHRLAZZpkL
	r5jcaBy10jB7sR7tNNEquKSMdEjhfQM=
Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com
 [209.85.128.71]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-581-UB1sRuwQMa26APvJGnWiBw-1; Thu, 02 Sep 2021 04:44:05 -0400
X-MC-Unique: UB1sRuwQMa26APvJGnWiBw-1
Received: by mail-wm1-f71.google.com with SMTP id g3-20020a1c2003000000b002e751c4f439so680713wmg.7
        for <linux-mm@kvack.org>; Thu, 02 Sep 2021 01:44:05 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=VWRgLyOnqoIayiNQjYpy0dsyFWdFNzRj4QcD6Wzn1iU=;
        b=EpYzUuSLNXQ1VtG/pbO/Ya0aa6s1eH9JY1lfG6KpM1thWtOb6KiEpgEkBfx0/s3qDg
         AvZg7apZRAXYp1D1q/GyU7iYWOyeNrmA1S796xSPy3VxoBP0w+rOF5EMXyMGG6TxA3rL
         aWChn0QZSk2yBhMJ2PSGvxepELFjxG6FKNb3nDdEEftZoLjEy0PBS8dAHr+8FhvNvPnq
         +8sGxZQ1eKtg4PuxT9+C32VAXpthSxlOWQtblDUTNdpD8wI2kV5qvvXkCDlN5Xqcutre
         lHR1uJneu1byB14Ty12SSwP/KRqA7LSSYZTjB52WPo437nqmHH+jbG56ZCNHXjlB4pI7
         2l3w==
X-Gm-Message-State: AOAM533S2CJGyhXrEmFxCo6zHyePWfHqQXngQYuSSs6SIz7grR6gT1FJ
	/OXzJdwo94maarY3QKhPTEFOr3vRMNatiME3w2V8cp0allnZ6jGbHLWlWdyOlqu1Co5uHSK+/t4
	wTfktuRM9IWw=
X-Received: by 2002:adf:c149:: with SMTP id w9mr2316293wre.126.1630572244517;
        Thu, 02 Sep 2021 01:44:04 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJzziPkwyia0D2Pb+mZpxXBmtVXs8bzUX9GXE24EybkhxUj0x0dWCmdrH8a50cPWIyUVXFUHxg==
X-Received: by 2002:adf:c149:: with SMTP id w9mr2316253wre.126.1630572244157;
        Thu, 02 Sep 2021 01:44:04 -0700 (PDT)
Received: from [192.168.3.132] (p5b0c60bd.dip0.t-ipconnect.de. [91.12.96.189])
        by smtp.gmail.com with ESMTPSA id b10sm1215199wrt.43.2021.09.02.01.44.02
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 02 Sep 2021 01:44:03 -0700 (PDT)
To: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>, Sean Christopherson
 <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com>,
 Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>,
 Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
 kvm list <kvm@vger.kernel.org>,
 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
 Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
 Joerg Roedel <jroedel@suse.de>, Andi Kleen <ak@linux.intel.com>,
 David Rientjes <rientjes@google.com>, Vlastimil Babka <vbabka@suse.cz>,
 Tom Lendacky <thomas.lendacky@amd.com>, Thomas Gleixner
 <tglx@linutronix.de>, "Peter Zijlstra (Intel)" <peterz@infradead.org>,
 Ingo Molnar <mingo@redhat.com>, Varad Gautam <varad.gautam@suse.com>,
 Dario Faggioli <dfaggioli@suse.com>,
 the arch/x86 maintainers <x86@kernel.org>, linux-mm@kvack.org,
 linux-coco@lists.linux.dev,
 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
 "Kirill A . Shutemov" <kirill@shutemov.name>,
 Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>,
 Dave Hansen <dave.hansen@intel.com>
References: <20210824005248.200037-1-seanjc@google.com>
 <307d385a-a263-276f-28eb-4bc8dd287e32@redhat.com>
 <20210827023150.jotwvom7mlsawjh4@linux.intel.com>
 <8f3630ff-bd6d-4d57-8c67-6637ea2c9560@www.fastmail.com>
 <20210901102437.g5wrgezmrjqn3mvy@linux.intel.com>
 <f37a61ba-b7ef-c789-5763-f7f237ae41cc@kernel.org>
 <2b2740ec-fa89-e4c3-d175-824e439874a6@redhat.com>
 <20210902083453.aeouc6fob53ydtc2@linux.intel.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private
 memory
Message-ID: <823d9453-892e-508d-b806-1b18c9b9fc13@redhat.com>
Date: Thu, 2 Sep 2021 10:44:02 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <20210902083453.aeouc6fob53ydtc2@linux.intel.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SWgJOV2i;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=none (imf06.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 65057801A89C
X-Stat-Signature: e3h5jkfhyrm1pnxbcg7bjq9b3akp43uj
X-HE-Tag: 1630572247-532019
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 02.09.21 10:34, Yu Zhang wrote:
> On Wed, Sep 01, 2021 at 06:27:20PM +0200, David Hildenbrand wrote:
>> On 01.09.21 18:07, Andy Lutomirski wrote:
>>> On 9/1/21 3:24 AM, Yu Zhang wrote:
>>>> On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote:
>>>>>
>>>>>
>>>>> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote:
>>>>>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
>>>>>
>>>>>> Thanks a lot for this summary. A question about the requirement: d=
o we or
>>>>>> do we not have plan to support assigned device to the protected VM=
?
>>>>>>
>>>>>> If yes. The fd based solution may need change the VFIO interface a=
s well(
>>>>>> though the fake swap entry solution need mess with VFIO too). Beca=
use:
>>>>>>
>>>>>> 1> KVM uses VFIO when assigning devices into a VM.
>>>>>>
>>>>>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffe=
r, all
>>>>>> guest pages will have to be mapped in host IOMMU page table to hos=
t pages,
>>>>>> which are pinned during the whole life cycle fo the VM.
>>>>>>
>>>>>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU=
 driver,
>>>>>> in vfio_dma_do_map().
>>>>>>
>>>>>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to ge=
t the HPA
>>>>>> and pin the page.
>>>>>>
>>>>>> But if we are using fd based solution, not every GPA can have a HV=
A, thus
>>>>>> the current VFIO interface to map and pin the GPA(IOVA) wont work.=
 And I
>>>>>> doubt if VFIO can be modified to support this easily.
>>>>>>
>>>>>>
>>>>>
>>>>> Do you mean assigning a normal device to a protected VM or a hypoth=
etical protected-MMIO device?
>>>>>
>>>>> If the former, it should work more or less like with a non-protecte=
d VM. mmap the VFIO device, set up a memslot, and use it.  I'm not sure w=
hether anyone will actually do this, but it should be possible, at least =
in principle.  Maybe someone will want to assign a NIC to a TDX guest.  A=
n NVMe device with the understanding that the guest can't trust it wouldn=
't be entirely crazy ether.
>>>>>
>>>>> If the latter, AFAIK there is no spec for how it would work even in=
 principle. Presumably it wouldn't work quite like VFIO -- instead, the k=
ernel could have a protection-virtual-io-fd mechanism, and that fd could =
be bound to a memslot in whatever way we settle on for binding secure mem=
ory to a memslot.
>>>>>
>>>>
>>>> Thanks Andy. I was asking the first scenario.
>>>>
>>>> Well, I agree it is doable if someone really want some assigned
>>>> device in TD guest. As Kevin mentioned in his reply, HPA can be
>>>> generated, by extending VFIO with a new mapping protocol which
>>>> uses fd+offset, instead of HVA.
>>>
>>> I'm confused.  I don't see why any new code is needed for this at all=
.
>>> Every proposal I've seen for handling TDX memory continues to handle =
TDX
>>> *shared* memory exactly like regular guest memory today.  The only
>>> differences are that more hole punching will be needed, which will
>>> require lightweight memslots (to have many of them), memslots with
>>> holes, or mappings backing memslots with holes (which can be done wit=
h
>>> munmap() on current kernels).
>>>
>>> So you can literally just mmap a VFIO device and expect it to work,
>>> exactly like it does right now.  Whether the guest will be willing to
>>> use the device will depend on the guest security policy (all kinds of
>>> patches about that are flying around), but if the guest tries to use =
it,
>>> it really should just work.
>>
>> ... but if you end up mapping private memory into IOMMU of the device =
and
>> the device ends up accessing that memory, we're in the same position t=
hat
>> the host might get capped, just like access from user space, no?
>=20
> Well, according to the spec:
>=20
>    - If the result of the translation results in a physical address wit=
h a TD
>    private key ID, then the IOMMU will abort the transaction and report=
 a VT-d
>    DMA remapping failure.
>=20
>    - If the GPA in the transaction that is input to the IOMMU is privat=
e (SHARED
>    bit is 0), then the IOMMU may abort the transaction and report a VT-=
d DMA
>    remapping failure.
>=20
> So I guess mapping private GPAs in IOMMU is not that dangerous as mappi=
ng
> into userspace. Though still wrong.
>=20
>>
>> Sure, you can map only the complete duplicate shared-memory region int=
o the
>> IOMMU of the device, that would work. Shame vfio mostly always pins al=
l
>> guest memory and you essentially cannot punch holes into the shared me=
mory
>> anymore -- resulting in the worst case in a duplicate memory consumpti=
on for
>> your VM.
>>
>> So you'd actually want to map only the *currently* shared pieces into =
the
>> IOMMU and update the mappings on demand. Having worked on something re=
lated,
>=20
> Exactly. On demand mapping and page pinning for shared memory is necess=
ary.
>=20
>> I can only say that 64k individual mappings, and not being able to mod=
ify
>> existing mappings except completely deleting them to replace with some=
thing
>> new (!atomic), can be quite an issue for bigger VMs.
>=20
> Do you mean atomicity in mapping/unmapping can hardly be guaranteed dur=
ing
> the shared <-> private transition? May I ask for elaboration? Thanks!

If we expect to really only have little shared memory, and expect a VM=20
always has no shared memory when booting up (I think this is the case),=20
I guess this could work.

The issue is if the guest e.g., makes contiguous 2 MiB shared and later=20
wants to unshare individual pages/parts.

You'll have to DMA map the 2 MiB in page granularity, otherwise you'll=20
have to DMA unmap 2 MiB and DMA remap all still-shared pieces. That is=20
not atomic and can be problematic if the device is accessing some of the=20
shared parts at that time.

Consequently that means, that large shared regions can be problematic=20
when mapped, because we'll have to map in page granularity. We have 64k=20
such individual mappings in general.

64k * 4KiB =3D=3D 256 MiB

Not sure if there would be use cases, e.g., with GPGPUs and similar,=20
where you'd want to share a lot of memory with a device ...

But these are just my thoughts, maybe I am missing something important.

--=20
Thanks,

David / dhildenb