Re: [PATCH 1/4] iommu/amd: Introduce Protection-domain flag VFIO

From: "Kalra, Ashish" <ashish.kalra@amd.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	joro@8bytes.org, robin.murphy@arm.com, thomas.lendacky@amd.com,
	vasant.hegde@amd.com, jon.grimm@amd.com
Subject: Re: [PATCH 1/4] iommu/amd: Introduce Protection-domain flag VFIO
Date: Fri, 20 Jan 2023 11:01:21 -0600	[thread overview]
Message-ID: <1ba09b11-8a07-24dd-a99f-eeacb2f5c96c@amd.com> (raw)
In-Reply-To: <Y8q9ocj2IZB2r6Np@ziepe.ca>

Hello Jason,

On 1/20/2023 10:13 AM, Jason Gunthorpe wrote:
> On Fri, Jan 20, 2023 at 09:12:26AM -0600, Kalra, Ashish wrote:
>> On 1/19/2023 11:44 AM, Jason Gunthorpe wrote:
>>> On Thu, Jan 19, 2023 at 02:54:43AM -0600, Kalra, Ashish wrote:
>>>> Hello Jason,
>>>>
>>>> On 1/13/2023 9:33 AM, Jason Gunthorpe wrote:
>>>>> On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote:
>>>>>> Currently, to detect if a domain is enabled with VFIO support, the driver
>>>>>> checks if the domain has devices attached and check if the domain type is
>>>>>> IOMMU_DOMAIN_UNMANAGED.
>>>>>
>>>>> NAK
>>>>>
>>>>> If you need weird HW specific stuff like this then please implement it
>>>>> properly in iommufd, not try and randomly guess what things need from
>>>>> the domain type.
>>>>>
>>>>> All this confidential computing stuff needs a comprehensive solution,
>>>>> not some piecemeal mess. How can you even use a CC guest with VFIO in
>>>>> the upstream kernel? Hmm?
>>>>>
>>>>
>>>> Currently all guest devices are untrusted - whether they are emulated,
>>>> virtio or passthrough. In the current use case of VFIO device-passthrough to
>>>> an SNP guest, the pass-through device will perform DMA to un-encrypted or
>>>> shared guest memory, in the same way as virtio or emulated devices.
>>>>
>>>> This fix is prompted by an issue reported by Nvidia, they are trying to do
>>>> PCIe device passthrough to SNP guest. The memory allocated for DMA is
>>>> through dma_alloc_coherent() in the SNP guest and during DMA I/O an
>>>> RMP_PAGE_FAULT is observed on the host.
>>>>
>>>> These dma_alloc_coherent() calls map into page state change hypercalls into
>>>> the host to change guest page state from encrypted to shared in the RMP
>>>> table.
>>>>
>>>> Following is a link to issue discussed above:
>>>> https://github.com/AMDESE/AMDSEV/issues/109
>>>
>>> Wow you should really write all of this in the commmit message
>>>
>>>> Now, to set individual 4K entries to different shared/private
>>>> mappings in NPT or host page tables for large page entries, the RMP
>>>> and NPT/host page table large page entries are split to 4K pte’s.
>>>
>>> Why are mappings to private pages even in the iommu in the first
>>> place - and how did they even get there?
>>>
>>
>> You seem to be confusing between host/NPT page tables and IOMMU page tables.
> 
> No, I haven't. I'm repeating what was said:
> 
>   during DMA I/O an RMP_PAGE_FAULT is observed on the host.
> 
> So, I'm interested to hear how you can get a RMP_PAGE_FAULT from the
> IOMMU if the IOMMU is only programmed with shared pages that, by (my)
> definition, are accessible to the CPU and should not generate a
> RMP_PAGE_FAULT?

Yes, sorry i got confused with your use of the word private as you 
mention below.

We basically get the RMP #PF from the IOMMU because there is a page size 
mismatch between the RMP table and the IOMMU page table. The RMP table's 
large page entry has been smashed to 4K PTEs to handle page state change 
to shared on 4K mappings, so this change has to be synced up with the 
IOMMU page table, otherwise there is now a page size mismatch between 
RMP table and IOMMU page table which causes the RMP #PF.

Thanks,
Ashish

> 
> I think you are confusing my use of the word private with some AMD
> architecture deatils. When I say private I mean that the host CPU will
> generate a violation if it tries to access the memory.
> 
> I think the conclusion is logical - if the IOMMU is experiencing a
> protection violation it is because the IOMMU was programed with PFNs
> it is not allowed to access - and so why was that even done in the
> first place?
> 
> I suppose what is going on is you program the IOPTEs with PFNs of
> unknown state and when the PFN changes access protections the IOMMU
> can simply use it without needing to synchronize with the access
> protection change. And your problem is that the granularity of access
> protection change does not match the IOPTE granularity in the IOMMU.
> 
> But this seems very wasteful as the IOMMU will be using IOPTEs and
> also will pin the memory when the systems *knows* this memory cannot
> be accessed through the IOMMU. It seems much better to dynamically
> establish IOMMU mappings only when you learn that the memory is
> actually accesisble to the IOMMU.
> 
> Also, I thought the leading plan for CC was to use the memfd approach here:
> 
> https://lore.kernel.org/kvm/20220915142913.2213336-1-chao.p.peng@linux.intel.com/
> 
> Which prevents mmaping the memory to userspace - so how did it get
> into the IOMMU in the first place?
> 
> Jason
>