All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Longpeng (Mike, Cloud Infrastructure Service Product Dept.)"  <longpeng2@huawei.com>
To: Lu Baolu <baolu.lu@linux.intel.com>,
	"dwmw2@infradead.org" <dwmw2@infradead.org>,
	"joro@8bytes.org" <joro@8bytes.org>,
	"will@kernel.org" <will@kernel.org>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>
Cc: "iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"Gonglei (Arei)" <arei.gonglei@huawei.com>,
	chenjiashang <chenjiashang@huawei.com>
Subject: RE: A problem of Intel IOMMU hardware ?
Date: Wed, 17 Mar 2021 09:40:37 +0000	[thread overview]
Message-ID: <a6315a46fadd4171bc69b5e53e3ad67b@huawei.com> (raw)
In-Reply-To: <692186fd-42b8-4054-ead2-f6c6b1bf5b2d@linux.intel.com>

Hi Baolu,

> -----Original Message-----
> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> Sent: Wednesday, March 17, 2021 1:17 PM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> <longpeng2@huawei.com>; dwmw2@infradead.org; joro@8bytes.org;
> will@kernel.org; alex.williamson@redhat.com
> Cc: baolu.lu@linux.intel.com; iommu@lists.linux-foundation.org; LKML
> <linux-kernel@vger.kernel.org>; Gonglei (Arei) <arei.gonglei@huawei.com>;
> chenjiashang <chenjiashang@huawei.com>
> Subject: Re: A problem of Intel IOMMU hardware ?
> 
> Hi Longpeng,
> 
> On 3/17/21 11:16 AM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> wrote:
> > Hi guys,
> >
> > We find the Intel iommu cache (i.e. iotlb) maybe works wrong in a
> > special situation, it would cause DMA fails or get wrong data.
> >
> > The reproducer (based on Alex's vfio testsuite[1]) is in attachment,
> > it can reproduce the problem with high probability (~50%).
> >
> > The machine we used is:
> > processor	: 47
> > vendor_id	: GenuineIntel
> > cpu family	: 6
> > model		: 85
> > model name	: Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
> > stepping	: 4
> > microcode	: 0x2000069
> >
> > And the iommu capability reported is:
> > ver 1:0 cap 8d2078c106f0466 ecap f020df (caching mode = 0 ,
> > page-selective invalidation = 1)
> >
> > (The problem is also on 'Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz'
> > and
> > 'Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz')
> >
> > We run the reproducer on Linux 4.18 and it works as follow:
> >
> > Step 1. alloc 4G *2M-hugetlb* memory (N.B. no problem with 4K-page
> > mapping)
> 
> I don't understand 2M-hugetlb here means exactly. The IOMMU hardware
> supports both 2M and 1G super page. The mapping physical memory is 4G.
> Why couldn't it use 1G super page?
> 

We use hugetlbfs(support 1G and 2M, but we choose 2M in our case) to request
the memory in userspace: 
vaddr = (unsigned long)mmap(0, MAP_SIZE, PROT_READ | PROT_WRITE,
				    MAP_PRIVATE | MAP_ANONYMOUS | *MAP_HUGETLB*, 0, 0);

Yep, IOMMU support both 2M and 1G superpage, we just haven't to test the 1G case
yet, because our productions use 2M hugetlbfs page.

> > Step 2. DMA Map 4G memory
> > Step 3.
> >      while (1) {
> >          {UNMAP, 0x0, 0xa0000}, ------------------------------------ (a)
> >          {UNMAP, 0xc0000, 0xbff40000},
> 
> Have these two ranges been mapped before? Does the IOMMU driver complains
> when you trying to unmap a range which has never been mapped? The IOMMU
> driver implicitly assumes that mapping and unmapping are paired.
> 

Of course yes, please Step-2, we DMA mapped all the memory(4G) before the while loop.
The driver never complained during MAP and UNMAP operations.

> >          {MAP,   0x0, 0xc0000000}, --------------------------------- (b)
> >                  use GDB to pause at here, and then DMA read IOVA=0,
> 
> IOVA 0 seems to be a special one. Have you verified with other addresses than
> IOVA 0?
> 

Yes, we also test IOVA=0x1000, it has problem too.

But one of the differeces between (0x0, 0xa0000) and (0x0, 0xc0000000) is the former
can only use 4K mapping in DMA pagetable but the latter uses 2M mapping. Is it possible
the hardware cache management works something wrong in this case?

> >                  sometimes DMA success (as expected),
> >                  but sometimes DMA error (report not-present).
> >          {UNMAP, 0x0, 0xc0000000}, --------------------------------- (c)
> >          {MAP,   0x0, 0xa0000},
> >          {MAP,   0xc0000, 0xbff40000},
> >      }
> >
> > The DMA read operations sholud success between (b) and (c), it should
> > NOT report not-present at least!
> >
> > After analysis the problem, we think maybe it's caused by the Intel iommu iotlb.
> > It seems the DMA Remapping hardware still uses the IOTLB or other caches of
> (a).
> >
> > When do DMA unmap at (a), the iotlb will be flush:
> >      intel_iommu_unmap
> >          domain_unmap
> >              iommu_flush_iotlb_psi
> >
> > When do DMA map at (b), no need to flush the iotlb according to the
> > capability of this iommu:
> >      intel_iommu_map
> >          domain_pfn_mapping
> >              domain_mapping
> >                  __mapping_notify_one
> >                      if (cap_caching_mode(iommu->cap)) // FALSE
> >                          iommu_flush_iotlb_psi
> 
> That's true. The iotlb flushing is not needed in case of PTE been changed from
> non-present to present unless caching mode.
> 

Yes, I also think the driver code is correct. But it's so confused that the problem
is disappear if we force it to flush here.

> > But the problem will disappear if we FORCE flush here. So we suspect
> > the iommu hardware.
> >
> > Do you have any suggestion ?
> 
> Best regards,
> baolu

WARNING: multiple messages have this Message-ID (diff)
From: "Longpeng (Mike, Cloud Infrastructure Service Product Dept.)" <longpeng2@huawei.com>
To: Lu Baolu <baolu.lu@linux.intel.com>,
	"dwmw2@infradead.org" <dwmw2@infradead.org>,
	"joro@8bytes.org" <joro@8bytes.org>,
	"will@kernel.org" <will@kernel.org>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>
Cc: chenjiashang <chenjiashang@huawei.com>,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	"Gonglei \(Arei\)" <arei.gonglei@huawei.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: RE: A problem of Intel IOMMU hardware ?
Date: Wed, 17 Mar 2021 09:40:37 +0000	[thread overview]
Message-ID: <a6315a46fadd4171bc69b5e53e3ad67b@huawei.com> (raw)
In-Reply-To: <692186fd-42b8-4054-ead2-f6c6b1bf5b2d@linux.intel.com>

Hi Baolu,

> -----Original Message-----
> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> Sent: Wednesday, March 17, 2021 1:17 PM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> <longpeng2@huawei.com>; dwmw2@infradead.org; joro@8bytes.org;
> will@kernel.org; alex.williamson@redhat.com
> Cc: baolu.lu@linux.intel.com; iommu@lists.linux-foundation.org; LKML
> <linux-kernel@vger.kernel.org>; Gonglei (Arei) <arei.gonglei@huawei.com>;
> chenjiashang <chenjiashang@huawei.com>
> Subject: Re: A problem of Intel IOMMU hardware ?
> 
> Hi Longpeng,
> 
> On 3/17/21 11:16 AM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> wrote:
> > Hi guys,
> >
> > We find the Intel iommu cache (i.e. iotlb) maybe works wrong in a
> > special situation, it would cause DMA fails or get wrong data.
> >
> > The reproducer (based on Alex's vfio testsuite[1]) is in attachment,
> > it can reproduce the problem with high probability (~50%).
> >
> > The machine we used is:
> > processor	: 47
> > vendor_id	: GenuineIntel
> > cpu family	: 6
> > model		: 85
> > model name	: Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
> > stepping	: 4
> > microcode	: 0x2000069
> >
> > And the iommu capability reported is:
> > ver 1:0 cap 8d2078c106f0466 ecap f020df (caching mode = 0 ,
> > page-selective invalidation = 1)
> >
> > (The problem is also on 'Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz'
> > and
> > 'Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz')
> >
> > We run the reproducer on Linux 4.18 and it works as follow:
> >
> > Step 1. alloc 4G *2M-hugetlb* memory (N.B. no problem with 4K-page
> > mapping)
> 
> I don't understand 2M-hugetlb here means exactly. The IOMMU hardware
> supports both 2M and 1G super page. The mapping physical memory is 4G.
> Why couldn't it use 1G super page?
> 

We use hugetlbfs(support 1G and 2M, but we choose 2M in our case) to request
the memory in userspace: 
vaddr = (unsigned long)mmap(0, MAP_SIZE, PROT_READ | PROT_WRITE,
				    MAP_PRIVATE | MAP_ANONYMOUS | *MAP_HUGETLB*, 0, 0);

Yep, IOMMU support both 2M and 1G superpage, we just haven't to test the 1G case
yet, because our productions use 2M hugetlbfs page.

> > Step 2. DMA Map 4G memory
> > Step 3.
> >      while (1) {
> >          {UNMAP, 0x0, 0xa0000}, ------------------------------------ (a)
> >          {UNMAP, 0xc0000, 0xbff40000},
> 
> Have these two ranges been mapped before? Does the IOMMU driver complains
> when you trying to unmap a range which has never been mapped? The IOMMU
> driver implicitly assumes that mapping and unmapping are paired.
> 

Of course yes, please Step-2, we DMA mapped all the memory(4G) before the while loop.
The driver never complained during MAP and UNMAP operations.

> >          {MAP,   0x0, 0xc0000000}, --------------------------------- (b)
> >                  use GDB to pause at here, and then DMA read IOVA=0,
> 
> IOVA 0 seems to be a special one. Have you verified with other addresses than
> IOVA 0?
> 

Yes, we also test IOVA=0x1000, it has problem too.

But one of the differeces between (0x0, 0xa0000) and (0x0, 0xc0000000) is the former
can only use 4K mapping in DMA pagetable but the latter uses 2M mapping. Is it possible
the hardware cache management works something wrong in this case?

> >                  sometimes DMA success (as expected),
> >                  but sometimes DMA error (report not-present).
> >          {UNMAP, 0x0, 0xc0000000}, --------------------------------- (c)
> >          {MAP,   0x0, 0xa0000},
> >          {MAP,   0xc0000, 0xbff40000},
> >      }
> >
> > The DMA read operations sholud success between (b) and (c), it should
> > NOT report not-present at least!
> >
> > After analysis the problem, we think maybe it's caused by the Intel iommu iotlb.
> > It seems the DMA Remapping hardware still uses the IOTLB or other caches of
> (a).
> >
> > When do DMA unmap at (a), the iotlb will be flush:
> >      intel_iommu_unmap
> >          domain_unmap
> >              iommu_flush_iotlb_psi
> >
> > When do DMA map at (b), no need to flush the iotlb according to the
> > capability of this iommu:
> >      intel_iommu_map
> >          domain_pfn_mapping
> >              domain_mapping
> >                  __mapping_notify_one
> >                      if (cap_caching_mode(iommu->cap)) // FALSE
> >                          iommu_flush_iotlb_psi
> 
> That's true. The iotlb flushing is not needed in case of PTE been changed from
> non-present to present unless caching mode.
> 

Yes, I also think the driver code is correct. But it's so confused that the problem
is disappear if we force it to flush here.

> > But the problem will disappear if we FORCE flush here. So we suspect
> > the iommu hardware.
> >
> > Do you have any suggestion ?
> 
> Best regards,
> baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

  reply	other threads:[~2021-03-17  9:41 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-17  3:16 A problem of Intel IOMMU hardware ? Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-17  3:16 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-17  5:16 ` Lu Baolu
2021-03-17  5:16   ` Lu Baolu
2021-03-17  9:40   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.) [this message]
2021-03-17  9:40     ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-17 15:18   ` Alex Williamson
2021-03-17 15:18     ` Alex Williamson
2021-03-18  2:58     ` Lu Baolu
2021-03-18  2:58       ` Lu Baolu
2021-03-18  4:46       ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-18  4:46         ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-18  7:48         ` Nadav Amit
2021-03-18  7:48           ` Nadav Amit
2021-03-17  5:46 ` Nadav Amit
2021-03-17  5:46   ` Nadav Amit
2021-03-17  9:35   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-17  9:35     ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-17 18:12     ` Nadav Amit
2021-03-17 18:12       ` Nadav Amit
2021-03-18  3:03       ` Lu Baolu
2021-03-18  3:03         ` Lu Baolu
2021-03-18  8:20       ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-18  8:20         ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-18  8:27         ` Tian, Kevin
2021-03-18  8:27           ` Tian, Kevin
2021-03-18  8:38           ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-18  8:38             ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-18  8:43             ` Tian, Kevin
2021-03-18  8:43               ` Tian, Kevin
2021-03-18  8:54               ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-18  8:54                 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-18  8:56             ` Tian, Kevin
2021-03-18  8:56               ` Tian, Kevin
2021-03-18  9:25               ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-18  9:25                 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-18 16:46                 ` Nadav Amit
2021-03-18 16:46                   ` Nadav Amit
2021-03-21 23:51                   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-21 23:51                     ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-22  0:27                   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-22  0:27                     ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-03-27  2:31                   ` Lu Baolu
2021-03-27  2:31                     ` Lu Baolu
2021-03-27  4:36                     ` Nadav Amit
2021-03-27  4:36                       ` Nadav Amit
2021-03-27  5:27                       ` Lu Baolu
2021-03-27  5:27                         ` Lu Baolu
2021-03-19  0:15               ` Lu Baolu
2021-03-19  0:15                 ` Lu Baolu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a6315a46fadd4171bc69b5e53e3ad67b@huawei.com \
    --to=longpeng2@huawei.com \
    --cc=alex.williamson@redhat.com \
    --cc=arei.gonglei@huawei.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=chenjiashang@huawei.com \
    --cc=dwmw2@infradead.org \
    --cc=iommu@lists.linux-foundation.org \
    --cc=joro@8bytes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.