RE: A problem of Intel IOMMU hardware ？ - Longpeng (Mike, Cloud Infrastructure Service Product Dept.)

From: "Longpeng (Mike, Cloud Infrastructure Service Product Dept.)"  <longpeng2@huawei.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: "Tian, Kevin" <kevin.tian@intel.com>,
	chenjiashang <chenjiashang@huawei.com>,
	David Woodhouse <dwmw2@infradead.org>,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"Gonglei (Arei)" <arei.gonglei@huawei.com>,
	"will@kernel.org" <will@kernel.org>,
	Lu Baolu <baolu.lu@linux.intel.com>,
	Joerg Roedel <joro@8bytes.org>
Subject: RE: A problem of Intel IOMMU hardware ？
Date: Sun, 21 Mar 2021 23:51:26 +0000	[thread overview]
Message-ID: <ac1e9b4699c4438f80ab771e5fbb4ee9@huawei.com> (raw)
In-Reply-To: <55E334BA-C6D2-4892-9207-32654FBF4360@gmail.com>

Hi Nadav,

> -----Original Message-----
> From: Nadav Amit [mailto:nadav.amit@gmail.com]
> Sent: Friday, March 19, 2021 12:46 AM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> <longpeng2@huawei.com>
> Cc: Tian, Kevin <kevin.tian@intel.com>; chenjiashang
> <chenjiashang@huawei.com>; David Woodhouse <dwmw2@infradead.org>;
> iommu@lists.linux-foundation.org; LKML <linux-kernel@vger.kernel.org>;
> alex.williamson@redhat.com; Gonglei (Arei) <arei.gonglei@huawei.com>;
> will@kernel.org
> Subject: Re: A problem of Intel IOMMU hardware ？
> 
> 
> 
> > On Mar 18, 2021, at 2:25 AM, Longpeng (Mike, Cloud Infrastructure Service
> Product Dept.) <longpeng2@huawei.com> wrote:
> >
> >
> >
> >> -----Original Message-----
> >> From: Tian, Kevin [mailto:kevin.tian@intel.com]
> >> Sent: Thursday, March 18, 2021 4:56 PM
> >> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> >> <longpeng2@huawei.com>; Nadav Amit <nadav.amit@gmail.com>
> >> Cc: chenjiashang <chenjiashang@huawei.com>; David Woodhouse
> >> <dwmw2@infradead.org>; iommu@lists.linux-foundation.org; LKML
> >> <linux-kernel@vger.kernel.org>; alex.williamson@redhat.com; Gonglei
> >> (Arei) <arei.gonglei@huawei.com>; will@kernel.org
> >> Subject: RE: A problem of Intel IOMMU hardware ？
> >>
> >>> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> >>> <longpeng2@huawei.com>
> >>>
> >>>> -----Original Message-----
> >>>> From: Tian, Kevin [mailto:kevin.tian@intel.com]
> >>>> Sent: Thursday, March 18, 2021 4:27 PM
> >>>> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> >>>> <longpeng2@huawei.com>; Nadav Amit <nadav.amit@gmail.com>
> >>>> Cc: chenjiashang <chenjiashang@huawei.com>; David Woodhouse
> >>>> <dwmw2@infradead.org>; iommu@lists.linux-foundation.org; LKML
> >>>> <linux-kernel@vger.kernel.org>; alex.williamson@redhat.com; Gonglei
> >>> (Arei)
> >>>> <arei.gonglei@huawei.com>; will@kernel.org
> >>>> Subject: RE: A problem of Intel IOMMU hardware ？
> >>>>
> >>>>> From: iommu <iommu-bounces@lists.linux-foundation.org> On Behalf
> >>>>> Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> >>>>>
> >>>>>> 2. Consider ensuring that the problem is not somehow related to
> >>>>>> queued invalidations. Try to use __iommu_flush_iotlb() instead of
> >>>> qi_flush_iotlb().
> >>>>>>
> >>>>>
> >>>>> I tried to force to use __iommu_flush_iotlb(), but maybe something
> >>>>> wrong, the system crashed, so I prefer to lower the priority of
> >>>>> this
> >>> operation.
> >>>>>
> >>>>
> >>>> The VT-d spec clearly says that register-based invalidation can be
> >>>> used only
> >>> when
> >>>> queued-invalidations are not enabled. Intel-IOMMU driver doesn't
> >>>> provide
> >>> an
> >>>> option to disable queued-invalidation though, when the hardware is
> >>> capable. If you
> >>>> really want to try, tweak the code in intel_iommu_init_qi.
> >>>>
> >>>
> >>> Hi Kevin,
> >>>
> >>> Thanks to point out this. Do you have any ideas about this problem ?
> >>> I tried to descript the problem much clear in my reply to Alex, hope
> >>> you could have a look if you're interested.
> >>>
> >>
> >> btw I saw you used 4.18 kernel in this test. What about latest kernel?
> >>
> >
> > Not test yet. It's hard to upgrade kernel in our environment.
> >
> >> Also one way to separate sw/hw bug is to trace the low level
> >> interface (e.g.,
> >> qi_flush_iotlb) which actually sends invalidation descriptors to the
> >> IOMMU hardware. Check the window between b) and c) and see whether
> >> the software does the right thing as expected there.
> >>
> >
> > We add some log in iommu driver these days, the software seems fine.
> > But we didn't look inside the qi_submit_sync yet, I'll try it tonight.
> 
> So here is my guess:
> 
> Intel probably used as a basis for the IOTLB an implementation of some other
> (regular) TLB design.
> 
> Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”):
> 
> "Software wishing to prevent this uncertainty should not write to a
> paging-structure entry in a way that would change, for any linear address, both the
> page size and either the page frame, access rights, or other attributes.”
> 
> 
> Now the aforementioned uncertainty is a bit different (multiple
> *valid* translations of a single address). Yet, perhaps this is yet another thing that
> might happen.
> 
> From a brief look on the handling of MMU (not IOMMU) hugepages in Linux, indeed
> the PMD is first cleared and flushed before a new valid PMD is set. This is possible
> for MMUs since they allow the software to handle spurious page-faults gracefully.
> This is not the case for the IOMMU though (without PRI).
> 

But in my case, the flush_iotlb is called after the range of (0x0, 0xa0000) is unmapped,
I've no idea why this invalidation isn't effective except I've not look inside the qi yet, but
there is no complaints from the driver.

Could you please point out the code of MMU you mentioned above? In MMU code, is it
possible that all the entries of the PTE are all not-present but the PMD entry is still present?

*Page table after (0x0, 0xa0000) is unmapped:
PML4: 0x      1a34fbb003
  PDPE: 0x      1a34fbb003
   PDE: 0x      1a34fbf003
    PTE: 0x               0

*Page table after (0x0, 0xc0000000) is mapped:
PML4: 0x      1a34fbb003
  PDPE: 0x      1a34fbb003
   PDE: 0x       15ec00883

> Not sure this explains everything though. If that is the problem, then during a
> mapping that changes page-sizes, a TLB flush is needed, similarly to the one
> Longpeng did manually.
>