Re: [PATCH v4 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Ram Pai <linuxram@us.ibm.com>
Cc: andmike@us.ibm.com, mst@redhat.com, mdroth@linux.vnet.ibm.com,
	linux-kernel@vger.kernel.org, ram.n.pai@gmail.com, cai@lca.pw,
	tglx@linutronix.de, sukadev@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, hch@lst.de,
	bauerman@linux.ibm.com, david@gibson.dropbear.id.au
Subject: Re: [PATCH v4 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.
Date: Wed, 4 Dec 2019 11:04:04 +1100	[thread overview]
Message-ID: <3a17372a-fcee-efbf-0a05-282ffb1adc90@ozlabs.ru> (raw)
In-Reply-To: <20191203165204.GA5079@oc0525413822.ibm.com>

On 04/12/2019 03:52, Ram Pai wrote:
> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 03/12/2019 15:05, Ram Pai wrote:
>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 03/12/2019 13:08, Ram Pai wrote:
>>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
>>>>>>
>>>>>>
>>>>>> On 02/12/2019 17:45, Ram Pai wrote:
>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
>>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the
>>>>>>> kernel for this purpose. On secure VMs, contents of this page, when
>>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
>>>>>>> needs to know the unencrypted entries, to update the TCE table
>>>>>>> accordingly.  There is nothing secret or sensitive about these entries.
>>>>>>> Hence share the page with the hypervisor.
>>>>>>
>>>>>> This unsecures a page in the guest in a random place which creates an
>>>>>> additional attack surface which is hard to exploit indeed but
>>>>>> nevertheless it is there.
>>>>>> A safer option would be not to use the
>>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
>>>>>> in the guest).
>>>>>
>>>>>
>>>>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
>>>>> automatically when IOMMU option is enabled.
>>>>
>>>> It is advertised by QEMU but the guest does not have to use it.
>>>
>>> Are you suggesting that even normal-guest, not use hcall-multi-tce?
>>> or just secure-guest?  
>>
>>
>> Just secure.
> 
> hmm..  how are the TCE entries communicated to the hypervisor, if
> hcall-multi-tce is disabled?

Via H_PUT_TCE which updates 1 entry at once (sets or clears).
hcall-multi-tce  enables H_PUT_TCE_INDIRECT (512 entries at once) and
H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an
optimization.

> 
>>
>>
>>>
>>>>
>>>>> This happens even
>>>>> on a normal VM when IOMMU is enabled.
>>>>>
>>>>>
>>>>>>
>>>>>> Also what is this for anyway? 
>>>>>
>>>>> This is for sending indirect-TCE entries to the hypervisor.
>>>>> The hypervisor must be able to read those TCE entries, so that it can 
>>>>> use those entires to populate the TCE table with the correct mappings.
>>>>>
>>>>>> if I understand things right, you cannot
>>>>>> map any random guest memory, you should only be mapping that 64MB-ish
>>>>>> bounce buffer array but 1) I do not see that happening (I may have
>>>>>> missed it) 2) it should be done once and it takes a little time for
>>>>>> whatever memory size we allow for bounce buffers anyway. Thanks,
>>>>>
>>>>> Any random guest memory can be shared by the guest. 
>>>>
>>>> Yes but we do not want this to be this random. 
>>>
>>> It is not sharing some random page. It is sharing a page that is
>>> ear-marked for communicating TCE entries. Yes the address of the page
>>> can be random, depending on where the allocator decides to allocate it.
>>> The purpose of the page is not random.
>>
>> I was talking about the location.
>>
>>
>>> That page is used for one specific purpose; to communicate the TCE
>>> entries to the hypervisor.  
>>>
>>>> I thought the whole idea
>>>> of swiotlb was to restrict the amount of shared memory to bare minimum,
>>>> what do I miss?
>>>
>>> I think, you are making a incorrect connection between this patch and
>>> SWIOTLB.  This patch has nothing to do with SWIOTLB.
>>
>> I can see this and this is the confusing part.
>>
>>
>>>>
>>>>> Maybe you are confusing this with the SWIOTLB bounce buffers used by
>>>>> PCI devices, to transfer data to the hypervisor?
>>>>
>>>> Is not this for pci+swiotlb? 
>>>
>>>
>>> No. This patch is NOT for PCI+SWIOTLB.  The SWIOTLB pages are a
>>> different set of pages allocated and earmarked for bounce buffering.
>>>
>>> This patch is purely to help the hypervisor setup the TCE table, in the
>>> presence of a IOMMU.
>>
>> Then the hypervisor should be able to access the guest pages mapped for
>> DMA and these pages should be made unsecure for this to work. Where/when
>> does this happen?
> 
> This happens in the SWIOTLB code.  The code to do that is already
> upstream.  
>
> The sharing of the pages containing the SWIOTLB bounce buffers is done
> in init_svm() which calls swiotlb_update_mem_attributes() which calls
> set_memory_decrypted().  In the case of pseries, set_memory_decrypted() calls 
> uv_share_page().

This does not seem enough as when you enforce iommu_platform=on, QEMU
starts accessing virtio buffers via IOMMU so bounce buffers have to be
mapped explicitly, via H_PUT_TCE&co, where does this happen?

> 
> The code that bounces the contents of a I/O buffer through the 
> SWIOTLB buffers, is in swiotlb_bounce().
> 
>>
>>
>>>> The cover letter suggests it is for
>>>> virtio-scsi-_pci_ with 	iommu_platform=on which makes it a
>>>> normal pci device just like emulated XHCI. Thanks,
>>>
>>> Well, I guess, the cover letter is probably confusing. There are two
>>> patches, which togather enable virtio on secure guests, in the presence
>>> of IOMMU.
>>>
>>> The second patch enables virtio in the presence of a IOMMU, to use
>>> DMA_ops+SWIOTLB infrastructure, to correctly navigate the I/O to virtio
>>> devices.
>>
>> The second patch does nothing in relation to the problem being solved.
> 
> The second patch registers dma_iommu_ops with the PCI-system.  Doing so
> enables I/O to take the dma_iommu_ops path, which internally 
> leads it through the SWIOTLB path. Without that, the I/O fails to reach
> its destination.

This is not what the commit log says. What DMA ops was used before 2/2?
I thought it was NULL which should have turned into direct which then
would switch to swiotlb but since recent DMA reworks it is even harder
to tell what happens with DMA setup. Thanks,

>>> However that by itself wont work if the TCE entires are not correctly
>>> setup in the TCE tables.  The first patch; i.e this patch, helps
>>> accomplish that.
>>>> Hope this clears up the confusion.

-- 
Alexey