Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices

From: Jason Wang <jasowang@redhat.com>
To: "Liu, Yi L" <yi.l.liu@intel.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	Peter Xu <peterx@redhat.com>
Cc: "Lan, Tianyu" <tianyu.lan@intel.com>,
	"Raj, Ashok" <ashok.raj@intel.com>,
	"mst@redhat.com" <mst@redhat.com>,
	"jan.kiszka@siemens.com" <jan.kiszka@siemens.com>,
	"bd.aviv@gmail.com" <bd.aviv@gmail.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>
Subject: Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices
Date: Thu, 19 Jan 2017 15:02:37 +0800	[thread overview]
Message-ID: <4103ea92-e713-9114-5be9-7e698fc14631@redhat.com> (raw)
In-Reply-To: <A2975661238FB949B60364EF0F2C257438FF9300@shsmsx102.ccr.corp.intel.com>

On 2017年01月19日 14:44, Liu, Yi L wrote:
>> -----Original Message-----
>> From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org]
>> On Behalf Of Tian, Kevin
>> Sent: Wednesday, January 18, 2017 5:39 PM
>> To: Peter Xu <peterx@redhat.com>; Jason Wang <jasowang@redhat.com>
>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Raj, Ashok <ashok.raj@intel.com>;
>> mst@redhat.com; jan.kiszka@siemens.com; bd.aviv@gmail.com; qemu-
>> devel@nongnu.org; alex.williamson@redhat.com
>> Subject: Re: [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio
>> devices
>>
>>> From: Peter Xu [mailto:peterx@redhat.com]
>>> Sent: Wednesday, January 18, 2017 4:46 PM
>>>
>>> On Wed, Jan 18, 2017 at 04:36:05PM +0800, Jason Wang wrote:
>>>>
>>>> On 2017年01月18日 16:11, Peter Xu wrote:
>>>>> On Wed, Jan 18, 2017 at 11:10:53AM +0800, Jason Wang wrote:
>>>>>> On 2017年01月17日 22:45, Peter Xu wrote:
>>>>>>> On Mon, Jan 16, 2017 at 05:54:55PM +0800, Jason Wang wrote:
>>>>>>>> On 2017年01月16日 17:18, Peter Xu wrote:
>>>>>>>>>>>   static void vtd_iotlb_page_invalidate(IntelIOMMUState *s,
>>>>>>>>>>> uint16_t
>>> domain_id,
>>>>>>>>>>>                                         hwaddr addr, uint8_t
>>>>>>>>>>> am)
>>>>>>>>>>>   {
>>>>>>>>>>> @@ -1222,6 +1251,7 @@ static void
>>> vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>>>>>>>>>>>       info.addr = addr;
>>>>>>>>>>>       info.mask = ~((1 << am) - 1);
>>>>>>>>>>>       g_hash_table_foreach_remove(s->iotlb,
>>>>>>>>>>> vtd_hash_remove_by_page,
>>> &info);
>>>>>>>>>>> +    vtd_iotlb_page_invalidate_notify(s, domain_id, addr,
>>>>>>>>>>> + am);
>>>>>>>>>> Is the case of GLOBAL or DSI flush missed, or we don't care it at all?
>>>>>>>>> IMHO we don't. For device assignment, since we are having CM=1
>>>>>>>>> here, we should have explicit page invalidations even if guest
>>>>>>>>> sends global/domain invalidations.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> -- peterx
>>>>>>>> Is this spec required?
>>>>>>> I think not. IMO the spec is very coarse grained on describing
>>>>>>> cache mode...
>>>>>>>
>>>>>>>> Btw, it looks to me that both DSI and GLOBAL are indeed explicit
>>>>>>>> flush.
>>>>>>> Actually when cache mode is on, it is unclear to me on how we
>>>>>>> should treat domain/global invalidations, at least from the spec
>>>>>>> (as mentioned earlier). My understanding is that they are not
>>>>>>> "explicit flushes", which IMHO should only mean page selective
>>>>>>> IOTLB invalidations.
>>>>>> Probably not, at least from the view of performance. DSI and
>>>>>> global should be more efficient in some cases.
>>>>> I agree with you that DSI/GLOBAL flushes are more efficient in some
>>>>> way. But IMHO that does not mean these invalidations are "explicit
>>>>> invalidations", and I suspect whether cache mode has to coop with it.
>>>> Well, the spec does not forbid DSI/GLOBAL with CM and the driver
>>>> codes had used them for almost ten years. I can hardly believe it's wrong.
>>> I think we have misunderstanding here. :)
>>>
>>> I never thought we should not send DSI/GLOBAL invalidations with cache
>>> mode. I just think we should not do anything special even if we have
>>> cache mode on when we receive these signals.
>>>
>>> In the spec, "explicit invalidation" is mentioned in the cache mode
>>> chapter:
>>>
>>>      The Caching Mode (CM) field in Capability Register indicates if
>>>      the hardware implementation caches not-present or erroneous
>>>      translation-structure entries. When the CM field is reported as
>>>      Set, any software updates to any remapping structures (including
>>>      updates to not-present entries or present entries whose
>>>      programming resulted in translation faults) requires explicit
>>>      invalidation of the caches.
>>>
>>> And I thought we were discussion about "what is explicit invalidation"
>>> mentioned above.
>> Check 6.5.3.1 Implicit Invalidation on Page Requests
>>
>> 	In addition to the explicit invalidation through invalidation commands
>> 	(see Section 6.5.1 and Section 6.5.2) or through deferred invalidation
>> 	messages (see Section 6.5.4), identified above, Page Requests from
>> 	endpoint devices invalidate entries in the IOTLBs and paging-structure
>> 	caches.
>>
>> My impression is that above indirectly defines invalidation commands (
>> PSI/DSI/GLOBAL) as explicit invalidation, because they are explicitly issued by
>> driver. Then section 6.5.3.1 further describes implicit invalidations caused by
>> other VT-d operations.
>>
>> I will check with VT-d spec owner to clarify.
>>
>>>>> But here I should add one more thing besides PSI - context entry
>>>>> invalidation should be one of "the explicit invalidations" as well,
>>>>> which we need to handle just like PSI when cache mode is on.
>>>>>
>>>>>>>> Just have a quick go through on driver codes and find this
>>>>>>>> something interesting in intel_iommu_flush_iotlb_psi():
>>>>>>>>
>>>>>>>> ...
>>>>>>>>      /*
>>>>>>>>       * Fallback to domain selective flush if no PSI support or the size is
>>>>>>>>       * too big.
>>>>>>>>       * PSI requires page size to be 2 ^ x, and the base address is
>> naturally
>>>>>>>>       * aligned to the size
>>>>>>>>       */
>>>>>>>>      if (!cap_pgsel_inv(iommu->cap) || mask >
>>> cap_max_amask_val(iommu->cap))
>>>>>>>>          iommu->flush.flush_iotlb(iommu, did, 0, 0,
>>>>>>>>                          DMA_TLB_DSI_FLUSH);
>>>>>>>>      else
>>>>>>>>          iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
>>>>>>>>                          DMA_TLB_PSI_FLUSH); ...
>>>>>>> I think this is interesting... and I doubt its correctness while
>>>>>>> with cache mode enabled.
>>>>>>>
>>>>>>> If so (sending domain invalidation instead of a big range of page
>>>>>>> invalidations), how should we capture which pages are unmapped in
>>>>>>> emulated IOMMU?
>>>>>> We don't need to track individual pages here, since all pages for
>>>>>> a specific domain were unmapped I believe?
>>>>> IMHO this might not be the correct behavior.
>>>>>
>>>>> If we receive one domain specific invalidation, I agree that we
>>>>> should invalidate the IOTLB cache for all the devices inside the domain.
>>>>> However, when cache mode is on, we should be depending on the PSIs
>>>>> to unmap each page (unless we want to unmap the whole address
>>>>> space, in this case it's very possible that the guest is just
>>>>> unmapping a range, not the entire space). If we convert several
>>>>> PSIs into one big DSI, IMHO we will leave those pages
>>>>> mapped/unmapped while we should unmap/map them.
>>>> Confused, do you have an example for this? (I fail to understand why
>>>> DSI can't work, at least implementation can convert DSI to several
>>>> PSIs internally).
>>> That's how I understand it. It might be wrong. Btw, could you
>>> elaborate a bit on how can we convert a DSI into several PSIs?
>>>
>>> Thanks,
>> If my understanding above is correct, there is nothing wrong with above
>> IOMMU driver code - actually it makes sense on bare metal when CM is
>> disabled.
>>
>> But yes, DSI/GLOBAL is far less efficient than PSI when CM is enabled.
>> We rely on cache invalidations to indirectly capture remapping structure change.
>> PSI provides accurate info, while DSI/GLOBAL doesn't. To emulate correct
>> behavior of DSI/GLOBAL, we have to pretend that the whole address space
>> (iova=0, size=agaw) needs to be unmapped (for GLOBAL it further means
>> multiple address spaces)
>>
>> Though not efficient, it doesn't mean it's wrong since guest driver follows spec.
>> We can ask for linux IOMMU driver change (CC Ashok) to not use above
>> optimization when cache mode is enabled, but anyway we need emulate correct
>> DSI/GLOBAL behavior to follow spec, because:
>>
>> - even when driver fix is in place, old version still has this logic;
>>
>> - there is still scenario where guest IOMMU driver does want to invalidate the
>> whole address space, e.g. when changing context entry. Asking guest driver to
>> use PSI for such purpose is another bad thing.
> Hi Kevin/Peter/Jason,
>
> I agree we should think DSI/GLOBAL. Herby, I guess there may be a chance to ignore
> DSI/GLOBAL flush if the following assumption is correct.
>
> It seems like that all DSI/GLOBAL flush would always be after a context entry invalidation.

Well it looks like at least for DSI, flush could happen when the size is 
too big?

>
> With this assumption, I remember Peter added memory_replay in context invalidation.
> This memory_replay would walk guest second-level page table and do map. So the
> second-level page table in host should be able to get the latest mapping info. Guest
> IOMMU driver would issue an DSI/GLOBAL flush after changing context. Since the
> mapping info has updated in host, then there is no need to deal this DSI/GLOBAL flush.
>
> So gentlemen, pls help judge if the assumption is correct. If it is correct, then Peter's patch
> may just work without special process against DSI/GLOBAL flush.
>   
> Regards,
> Yi L

Even if this may be the usual case, I think we'd better not make the 
codes depends on (usual) guest behaviors.

Thanks