[RFC] iommu/amd: fix a race in fetch_pte()
diff mbox series

Message ID 20200407021246.10941-1-cai@lca.pw
State New
Headers show
Series
  • [RFC] iommu/amd: fix a race in fetch_pte()
Related show

Commit Message

Qian Cai April 7, 2020, 2:12 a.m. UTC
fetch_pte() could race with increase_address_space() because it held no
lock from iommu_unmap_page(). On the CPU that runs fetch_pte() it could
see a stale domain->pt_root and a new increased domain->mode from
increase_address_space(). As the result, it could trigger invalid
accesses later on. Fix it by using a pair of smp_[w|r]mb in those
places.

 kernel BUG at drivers/iommu/amd_iommu.c:1704!
 BUG_ON(unmapped && !is_power_of_2(unmapped));
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
 RIP: 0010:amd_iommu_unmap+0x1b2/0x1d0
 Call Trace:
  <IRQ>
  __iommu_unmap+0x106/0x320
  iommu_unmap_fast+0xe/0x10
  __iommu_dma_unmap+0xdc/0x1a0
  iommu_dma_unmap_sg+0xae/0xd0
  scsi_dma_unmap+0xe7/0x150
  pqi_raid_io_complete+0x37/0x60 [smartpqi]
  pqi_irq_handler+0x1fc/0x13f0 [smartpqi]
  __handle_irq_event_percpu+0x78/0x4f0
  handle_irq_event_percpu+0x70/0x100
  handle_irq_event+0x5a/0x8b
  handle_edge_irq+0x10c/0x370
  do_IRQ+0x9e/0x1e0
  common_interrupt+0xf/0xf
  </IRQ>

Signed-off-by: Qian Cai <cai@lca.pw>
---
 drivers/iommu/amd_iommu.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Qian Cai April 7, 2020, 3:36 p.m. UTC | #1
> On Apr 6, 2020, at 10:12 PM, Qian Cai <cai@lca.pw> wrote:
> 
> fetch_pte() could race with increase_address_space() because it held no
> lock from iommu_unmap_page(). On the CPU that runs fetch_pte() it could
> see a stale domain->pt_root and a new increased domain->mode from
> increase_address_space(). As the result, it could trigger invalid
> accesses later on. Fix it by using a pair of smp_[w|r]mb in those
> places.

After further testing, the change along is insufficient. What I am chasing right
now is the swap device will go offline after heavy memory pressure below. The
symptom is similar to what we have in the commit,

754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)

Apparently, it is no possible to take the domain->lock in fetch_pte() because it
could sleep.

Thoughts?

[ 7292.727377][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd80000 flags=0x0010]
[ 7292.740571][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd81000 flags=0x0010]
[ 7292.753268][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd81900 flags=0x0010]
[ 7292.766078][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd81d00 flags=0x0010]
[ 7292.778447][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82000 flags=0x0010]
[ 7292.790724][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82400 flags=0x0010]
[ 7292.803195][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82800 flags=0x0010]
[ 7292.815664][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82c00 flags=0x0010]
[ 7292.828089][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd83000 flags=0x0010]
[ 7292.840468][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd83400 flags=0x0010]
[ 7292.852795][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffd83800 flags=0x0010]
[ 7292.864566][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffd83c00 flags=0x0010]
[ 7326.583908][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803
[ 7326.592386][    C8] smartpqi 0000:23:00.0: controller offline
[ 7326.663728][   C66] sd 0:1:0:0: [sda] tag#467 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=6s
[ 7326.664321][ T2738] smartpqi 0000:23:00.0: resetting scsi 0:1:0:0
[ 7326.673849][   C66] sd 0:1:0:0: [sda] tag#467 CDB: opcode=0x28 28 00 02 ee 2e 60 00 00 08 00
[ 7326.680118][ T2738] smartpqi 0000:23:00.0: reset of scsi 0:1:0:0: FAILED
[ 7326.688612][   C66] blk_update_request: I/O error, dev sda, sector 49163872 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 7326.695456][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
[ 7326.706632][   C66] Read-error on swap-device (254:1:45833824)
[ 7326.714030][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
[ 7326.723382][T45523] sd 0:1:0:0: rejecting I/O to offline device
[ 7326.727352][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
[ 7326.727379][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery
[ 7326.727442][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery

> 
> kernel BUG at drivers/iommu/amd_iommu.c:1704!
> BUG_ON(unmapped && !is_power_of_2(unmapped));
> Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> RIP: 0010:amd_iommu_unmap+0x1b2/0x1d0
> Call Trace:
>  <IRQ>
>  __iommu_unmap+0x106/0x320
>  iommu_unmap_fast+0xe/0x10
>  __iommu_dma_unmap+0xdc/0x1a0
>  iommu_dma_unmap_sg+0xae/0xd0
>  scsi_dma_unmap+0xe7/0x150
>  pqi_raid_io_complete+0x37/0x60 [smartpqi]
>  pqi_irq_handler+0x1fc/0x13f0 [smartpqi]
>  __handle_irq_event_percpu+0x78/0x4f0
>  handle_irq_event_percpu+0x70/0x100
>  handle_irq_event+0x5a/0x8b
>  handle_edge_irq+0x10c/0x370
>  do_IRQ+0x9e/0x1e0
>  common_interrupt+0xf/0xf
>  </IRQ>
> 
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
> drivers/iommu/amd_iommu.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 20cce366e951..22328a23335f 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -1434,6 +1434,11 @@ static bool increase_address_space(struct protection_domain *domain,
> 	*pte             = PM_LEVEL_PDE(domain->mode,
> 					iommu_virt_to_phys(domain->pt_root));
> 	domain->pt_root  = pte;
> +	/*
> +	 * Make sure fetch_pte() will see the new domain->pt_root before it
> +	 * snapshots domain->mode.
> +	 */
> +	smp_wmb();
> 	domain->mode    += 1;
> 
> 	ret = true;
> @@ -1460,6 +1465,8 @@ static u64 *alloc_pte(struct protection_domain *domain,
> 		*updated = increase_address_space(domain, address, gfp) || *updated;
> 
> 	level   = domain->mode - 1;
> +	/* To pair with smp_wmb() in increase_address_space(). */
> +	smp_rmb();
> 	pte     = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
> 	address = PAGE_SIZE_ALIGN(address, page_size);
> 	end_lvl = PAGE_SIZE_LEVEL(page_size);
> @@ -1545,6 +1552,8 @@ static u64 *fetch_pte(struct protection_domain *domain,
> 		return NULL;
> 
> 	level	   =  domain->mode - 1;
> +	/* To pair with smp_wmb() in increase_address_space(). */
> +	smp_rmb();
> 	pte	   = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
> 	*page_size =  PTE_LEVEL_PAGE_SIZE(level);
> 
> -- 
> 2.21.0 (Apple Git-122.2)
>
Joerg Roedel April 8, 2020, 2:19 p.m. UTC | #2
Hi Qian,

On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
> After further testing, the change along is insufficient. What I am chasing right
> now is the swap device will go offline after heavy memory pressure below. The
> symptom is similar to what we have in the commit,
> 
> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
> 
> Apparently, it is no possible to take the domain->lock in fetch_pte() because it
> could sleep.

Thanks a lot for finding and tracking down another race in the AMD IOMMU
page-table code.  The domain->lock is a spin-lock and taking it can't
sleep. But fetch_pte() is a fast-path and must not take any locks.

I think the best fix is to update the pt_root and mode of the domain
atomically by storing the mode in the lower 12 bits of pt_root. This way
they are stored together and can be read/write atomically.

Regards,

	Joerg
Qian Cai April 14, 2020, 1:36 a.m. UTC | #3
> On Apr 8, 2020, at 10:19 AM, Joerg Roedel <joro@8bytes.org> wrote:
> 
> Hi Qian,
> 
> On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
>> After further testing, the change along is insufficient. What I am chasing right
>> now is the swap device will go offline after heavy memory pressure below. The
>> symptom is similar to what we have in the commit,
>> 
>> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
>> 
>> Apparently, it is no possible to take the domain->lock in fetch_pte() because it
>> could sleep.
> 
> Thanks a lot for finding and tracking down another race in the AMD IOMMU
> page-table code.  The domain->lock is a spin-lock and taking it can't
> sleep. But fetch_pte() is a fast-path and must not take any locks.
> 
> I think the best fix is to update the pt_root and mode of the domain
> atomically by storing the mode in the lower 12 bits of pt_root. This way
> they are stored together and can be read/write atomically.

Like this?

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..b36c6b07cbfd 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, int mode,
 
 static void free_pagetable(struct protection_domain *domain)
 {
-	unsigned long root = (unsigned long)domain->pt_root;
+	int level = iommu_get_mode(domain->pt_root);
+	unsigned long root = iommu_get_root(domain->pt_root);
 	struct page *freelist = NULL;
 
-	BUG_ON(domain->mode < PAGE_MODE_NONE ||
-	       domain->mode > PAGE_MODE_6_LEVEL);
+	BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL);
 
-	freelist = free_sub_pt(root, domain->mode, freelist);
+	freelist = free_sub_pt(root, level, freelist);
 
 	free_page_list(freelist);
 }
@@ -1417,24 +1417,27 @@ static bool increase_address_space(struct protection_domain *domain,
 				   unsigned long address,
 				   gfp_t gfp)
 {
+	int level;
 	unsigned long flags;
 	bool ret = false;
 	u64 *pte;
 
 	spin_lock_irqsave(&domain->lock, flags);
 
-	if (address <= PM_LEVEL_SIZE(domain->mode) ||
-	    WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
+	level = iommu_get_mode(domain->pt_root);
+
+	if (address <= PM_LEVEL_SIZE(level) ||
+	    WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL))
 		goto out;
 
 	pte = (void *)get_zeroed_page(gfp);
 	if (!pte)
 		goto out;
 
-	*pte             = PM_LEVEL_PDE(domain->mode,
-					iommu_virt_to_phys(domain->pt_root));
-	domain->pt_root  = pte;
-	domain->mode    += 1;
+	*pte = PM_LEVEL_PDE(level,
+		iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root)));
+
+	WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1);
 
 	ret = true;
 
@@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain *domain,
 		      bool *updated)
 {
 	int level, end_lvl;
-	u64 *pte, *page;
+	u64 *pte, *page, *pt_root, *root;
 
 	BUG_ON(!is_power_of_2(page_size));
 
-	while (address > PM_LEVEL_SIZE(domain->mode))
+	while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root)))
 		*updated = increase_address_space(domain, address, gfp) || *updated;
 
-	level   = domain->mode - 1;
-	pte     = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
+	pt_root = READ_ONCE(domain->pt_root);
+	root    = (void *)iommu_get_root(pt_root);
+	level   = iommu_get_mode(pt_root) - 1;
+	pte     = &root[PM_LEVEL_INDEX(level, address)];
 	address = PAGE_SIZE_ALIGN(address, page_size);
 	end_lvl = PAGE_SIZE_LEVEL(page_size);
 
@@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain *domain,
 		      unsigned long address,
 		      unsigned long *page_size)
 {
-	int level;
 	u64 *pte;
+	u64 *pt_root = READ_ONCE(domain->pt_root);
+	u64 *root    = (void *)iommu_get_root(pt_root);
+	int level    = iommu_get_mode(pt_root);
 
 	*page_size = 0;
 
-	if (address > PM_LEVEL_SIZE(domain->mode))
+	if (address > PM_LEVEL_SIZE(level))
 		return NULL;
 
-	level	   =  domain->mode - 1;
-	pte	   = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
+	level--;
+	pte	   = &root[PM_LEVEL_INDEX(level, address)];
 	*page_size =  PTE_LEVEL_PAGE_SIZE(level);
 
 	while (level > 0) {
@@ -1814,12 +1821,13 @@ static struct protection_domain *dma_ops_domain_alloc(void)
 	if (protection_domain_init(domain))
 		goto free_domain;
 
-	domain->mode = PAGE_MODE_3_LEVEL;
 	domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
 	domain->flags = PD_DMA_OPS_MASK;
 	if (!domain->pt_root)
 		goto free_domain;
 
+	domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_3_LEVEL);
+
 	if (iommu_get_dma_cookie(&domain->domain) == -ENOMEM)
 		goto free_domain;
 
@@ -1847,10 +1855,10 @@ static void set_dte_entry(u16 devid, struct protection_domain *domain,
 	u64 flags = 0;
 	u32 old_domid;
 
-	if (domain->mode != PAGE_MODE_NONE)
-		pte_root = iommu_virt_to_phys(domain->pt_root);
+	if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
+		pte_root = iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root));
 
-	pte_root |= (domain->mode & DEV_ENTRY_MODE_MASK)
+	pte_root |= ((unsigned long)domain->pt_root & DEV_ENTRY_MODE_MASK)
 		    << DEV_ENTRY_MODE_SHIFT;
 	pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV;
 
@@ -2382,13 +2390,14 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
 		if (!pdomain)
 			return NULL;
 
-		pdomain->mode    = PAGE_MODE_3_LEVEL;
 		pdomain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
 		if (!pdomain->pt_root) {
 			protection_domain_free(pdomain);
 			return NULL;
 		}
 
+		pdomain->pt_root = iommu_set_mode(pdomain->pt_root,
+						  PAGE_MODE_3_LEVEL);
 		pdomain->domain.geometry.aperture_start = 0;
 		pdomain->domain.geometry.aperture_end   = ~0ULL;
 		pdomain->domain.geometry.force_aperture = true;
@@ -2406,7 +2415,8 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
 		if (!pdomain)
 			return NULL;
 
-		pdomain->mode = PAGE_MODE_NONE;
+		pdomain->pt_root = iommu_set_mode(pdomain->pt_root,
+						  PAGE_MODE_NONE);
 		break;
 	default:
 		return NULL;
@@ -2435,7 +2445,7 @@ static void amd_iommu_domain_free(struct iommu_domain *dom)
 		dma_ops_domain_free(domain);
 		break;
 	default:
-		if (domain->mode != PAGE_MODE_NONE)
+		if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
 			free_pagetable(domain);
 
 		if (domain->flags & PD_IOMMUV2_MASK)
@@ -2521,7 +2531,7 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
 	int prot = 0;
 	int ret;
 
-	if (domain->mode == PAGE_MODE_NONE)
+	if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
 		return -EINVAL;
 
 	if (iommu_prot & IOMMU_READ)
@@ -2542,7 +2552,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
 {
 	struct protection_domain *domain = to_pdomain(dom);
 
-	if (domain->mode == PAGE_MODE_NONE)
+	if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
 		return 0;
 
 	return iommu_unmap_page(domain, iova, page_size);
@@ -2555,7 +2565,7 @@ static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
 	unsigned long offset_mask, pte_pgsize;
 	u64 *pte, __pte;
 
-	if (domain->mode == PAGE_MODE_NONE)
+	if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
 		return iova;
 
 	pte = fetch_pte(domain, iova, &pte_pgsize);
@@ -2713,7 +2723,7 @@ void amd_iommu_domain_direct_map(struct iommu_domain *dom)
 	spin_lock_irqsave(&domain->lock, flags);
 
 	/* Update data structure */
-	domain->mode    = PAGE_MODE_NONE;
+	domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_NONE);
 
 	/* Make changes visible to IOMMUs */
 	update_domain(domain);
@@ -2910,7 +2920,7 @@ static int __set_gcr3(struct protection_domain *domain, int pasid,
 {
 	u64 *pte;
 
-	if (domain->mode != PAGE_MODE_NONE)
+	if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
 		return -EINVAL;
 
 	pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, true);
@@ -2926,7 +2936,7 @@ static int __clear_gcr3(struct protection_domain *domain, int pasid)
 {
 	u64 *pte;
 
-	if (domain->mode != PAGE_MODE_NONE)
+	if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
 		return -EINVAL;
 
 	pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, false);
diff --git a/drivers/iommu/amd_iommu_proto.h b/drivers/iommu/amd_iommu_proto.h
index 92c2ba6468a0..34d4dd66cf9b 100644
--- a/drivers/iommu/amd_iommu_proto.h
+++ b/drivers/iommu/amd_iommu_proto.h
@@ -67,6 +67,21 @@ static inline int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
 extern int amd_iommu_complete_ppr(struct pci_dev *pdev, int pasid,
 				  int status, int tag);
 
+static inline int iommu_get_mode(void *pt_root)
+{
+	return (unsigned long)pt_root & ~PAGE_MASK;
+}
+
+static inline unsigned long iommu_get_root(void *pt_root)
+{
+	return (unsigned long)pt_root & PAGE_MASK;
+}
+
+static inline void *iommu_set_mode(void *pt_root, int mode)
+{
+	return (void *)(iommu_get_root(pt_root) + mode);
+}
+
 static inline bool is_rd890_iommu(struct pci_dev *pdev)
 {
 	return (pdev->vendor == PCI_VENDOR_ID_ATI) &&
diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h
index ca8c4522045b..221adefa56a0 100644
--- a/drivers/iommu/amd_iommu_types.h
+++ b/drivers/iommu/amd_iommu_types.h
@@ -468,8 +468,8 @@ struct protection_domain {
 				       iommu core code */
 	spinlock_t lock;	/* mostly used to lock the page table*/
 	u16 id;			/* the domain id written to the device table */
-	int mode;		/* paging mode (0-6 levels) */
-	u64 *pt_root;		/* page table root pointer */
+	u64 *pt_root;		/* page table root pointer and paging mode (0-6
+				   levels) */
 	int glx;		/* Number of levels for GCR3 table */
 	u64 *gcr3_tbl;		/* Guest CR3 table */
 	unsigned long flags;	/* flags to find out type of domain */
Qian Cai April 17, 2020, 1:42 a.m. UTC | #4
> On Apr 13, 2020, at 9:36 PM, Qian Cai <cai@lca.pw> wrote:
> 
> 
> 
>> On Apr 8, 2020, at 10:19 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> 
>> Hi Qian,
>> 
>> On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
>>> After further testing, the change along is insufficient. What I am chasing right
>>> now is the swap device will go offline after heavy memory pressure below. The
>>> symptom is similar to what we have in the commit,
>>> 
>>> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
>>> 
>>> Apparently, it is no possible to take the domain->lock in fetch_pte() because it
>>> could sleep.
>> 
>> Thanks a lot for finding and tracking down another race in the AMD IOMMU
>> page-table code.  The domain->lock is a spin-lock and taking it can't
>> sleep. But fetch_pte() is a fast-path and must not take any locks.
>> 
>> I think the best fix is to update the pt_root and mode of the domain
>> atomically by storing the mode in the lower 12 bits of pt_root. This way
>> they are stored together and can be read/write atomically.
> 
> Like this?

So, this is still not enough that would still trigger storage driver offline under
memory pressure for a bit longer. It looks to me that in fetch_pte() there are
could still racy?

	level	   =  domain->mode - 1;
	pte	   = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
							<— increase_address_space();
	*page_size =  PTE_LEVEL_PAGE_SIZE(level);
	
	while (level > 0) {
		*page_size = PTE_LEVEL_PAGE_SIZE(level);

Then in iommu_unmap_page(),

	while (unmapped < page_size) {
		pte = fetch_pte(dom, bus_addr, &unmap_size);
		…
		bus_addr  = (bus_addr & ~(unmap_size - 1)) + unmap_size;

bus_addr would be unsync with dom->mode when it enter fetch_pte() again.
Could that be a problem?

> 
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 20cce366e951..b36c6b07cbfd 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, int mode,
> 
> static void free_pagetable(struct protection_domain *domain)
> {
> -	unsigned long root = (unsigned long)domain->pt_root;
> +	int level = iommu_get_mode(domain->pt_root);
> +	unsigned long root = iommu_get_root(domain->pt_root);
> 	struct page *freelist = NULL;
> 
> -	BUG_ON(domain->mode < PAGE_MODE_NONE ||
> -	       domain->mode > PAGE_MODE_6_LEVEL);
> +	BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL);
> 
> -	freelist = free_sub_pt(root, domain->mode, freelist);
> +	freelist = free_sub_pt(root, level, freelist);
> 
> 	free_page_list(freelist);
> }
> @@ -1417,24 +1417,27 @@ static bool increase_address_space(struct protection_domain *domain,
> 				   unsigned long address,
> 				   gfp_t gfp)
> {
> +	int level;
> 	unsigned long flags;
> 	bool ret = false;
> 	u64 *pte;
> 
> 	spin_lock_irqsave(&domain->lock, flags);
> 
> -	if (address <= PM_LEVEL_SIZE(domain->mode) ||
> -	    WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
> +	level = iommu_get_mode(domain->pt_root);
> +
> +	if (address <= PM_LEVEL_SIZE(level) ||
> +	    WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL))
> 		goto out;
> 
> 	pte = (void *)get_zeroed_page(gfp);
> 	if (!pte)
> 		goto out;
> 
> -	*pte             = PM_LEVEL_PDE(domain->mode,
> -					iommu_virt_to_phys(domain->pt_root));
> -	domain->pt_root  = pte;
> -	domain->mode    += 1;
> +	*pte = PM_LEVEL_PDE(level,
> +		iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root)));
> +
> +	WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1);
> 
> 	ret = true;
> 
> @@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain *domain,
> 		      bool *updated)
> {
> 	int level, end_lvl;
> -	u64 *pte, *page;
> +	u64 *pte, *page, *pt_root, *root;
> 
> 	BUG_ON(!is_power_of_2(page_size));
> 
> -	while (address > PM_LEVEL_SIZE(domain->mode))
> +	while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root)))
> 		*updated = increase_address_space(domain, address, gfp) || *updated;
> 
> -	level   = domain->mode - 1;
> -	pte     = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
> +	pt_root = READ_ONCE(domain->pt_root);
> +	root    = (void *)iommu_get_root(pt_root);
> +	level   = iommu_get_mode(pt_root) - 1;
> +	pte     = &root[PM_LEVEL_INDEX(level, address)];
> 	address = PAGE_SIZE_ALIGN(address, page_size);
> 	end_lvl = PAGE_SIZE_LEVEL(page_size);
> 
> @@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain *domain,
> 		      unsigned long address,
> 		      unsigned long *page_size)
> {
> -	int level;
> 	u64 *pte;
> +	u64 *pt_root = READ_ONCE(domain->pt_root);
> +	u64 *root    = (void *)iommu_get_root(pt_root);
> +	int level    = iommu_get_mode(pt_root);
> 
> 	*page_size = 0;
> 
> -	if (address > PM_LEVEL_SIZE(domain->mode))
> +	if (address > PM_LEVEL_SIZE(level))
> 		return NULL;
> 
> -	level	   =  domain->mode - 1;
> -	pte	   = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
> +	level--;
> +	pte	   = &root[PM_LEVEL_INDEX(level, address)];
> 	*page_size =  PTE_LEVEL_PAGE_SIZE(level);
> 
> 	while (level > 0) {
> @@ -1814,12 +1821,13 @@ static struct protection_domain *dma_ops_domain_alloc(void)
> 	if (protection_domain_init(domain))
> 		goto free_domain;
> 
> -	domain->mode = PAGE_MODE_3_LEVEL;
> 	domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
> 	domain->flags = PD_DMA_OPS_MASK;
> 	if (!domain->pt_root)
> 		goto free_domain;
> 
> +	domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_3_LEVEL);
> +
> 	if (iommu_get_dma_cookie(&domain->domain) == -ENOMEM)
> 		goto free_domain;
> 
> @@ -1847,10 +1855,10 @@ static void set_dte_entry(u16 devid, struct protection_domain *domain,
> 	u64 flags = 0;
> 	u32 old_domid;
> 
> -	if (domain->mode != PAGE_MODE_NONE)
> -		pte_root = iommu_virt_to_phys(domain->pt_root);
> +	if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
> +		pte_root = iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root));
> 
> -	pte_root |= (domain->mode & DEV_ENTRY_MODE_MASK)
> +	pte_root |= ((unsigned long)domain->pt_root & DEV_ENTRY_MODE_MASK)
> 		    << DEV_ENTRY_MODE_SHIFT;
> 	pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV;
> 
> @@ -2382,13 +2390,14 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
> 		if (!pdomain)
> 			return NULL;
> 
> -		pdomain->mode    = PAGE_MODE_3_LEVEL;
> 		pdomain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
> 		if (!pdomain->pt_root) {
> 			protection_domain_free(pdomain);
> 			return NULL;
> 		}
> 
> +		pdomain->pt_root = iommu_set_mode(pdomain->pt_root,
> +						  PAGE_MODE_3_LEVEL);
> 		pdomain->domain.geometry.aperture_start = 0;
> 		pdomain->domain.geometry.aperture_end   = ~0ULL;
> 		pdomain->domain.geometry.force_aperture = true;
> @@ -2406,7 +2415,8 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
> 		if (!pdomain)
> 			return NULL;
> 
> -		pdomain->mode = PAGE_MODE_NONE;
> +		pdomain->pt_root = iommu_set_mode(pdomain->pt_root,
> +						  PAGE_MODE_NONE);
> 		break;
> 	default:
> 		return NULL;
> @@ -2435,7 +2445,7 @@ static void amd_iommu_domain_free(struct iommu_domain *dom)
> 		dma_ops_domain_free(domain);
> 		break;
> 	default:
> -		if (domain->mode != PAGE_MODE_NONE)
> +		if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
> 			free_pagetable(domain);
> 
> 		if (domain->flags & PD_IOMMUV2_MASK)
> @@ -2521,7 +2531,7 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
> 	int prot = 0;
> 	int ret;
> 
> -	if (domain->mode == PAGE_MODE_NONE)
> +	if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
> 		return -EINVAL;
> 
> 	if (iommu_prot & IOMMU_READ)
> @@ -2542,7 +2552,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
> {
> 	struct protection_domain *domain = to_pdomain(dom);
> 
> -	if (domain->mode == PAGE_MODE_NONE)
> +	if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
> 		return 0;
> 
> 	return iommu_unmap_page(domain, iova, page_size);
> @@ -2555,7 +2565,7 @@ static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
> 	unsigned long offset_mask, pte_pgsize;
> 	u64 *pte, __pte;
> 
> -	if (domain->mode == PAGE_MODE_NONE)
> +	if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE)
> 		return iova;
> 
> 	pte = fetch_pte(domain, iova, &pte_pgsize);
> @@ -2713,7 +2723,7 @@ void amd_iommu_domain_direct_map(struct iommu_domain *dom)
> 	spin_lock_irqsave(&domain->lock, flags);
> 
> 	/* Update data structure */
> -	domain->mode    = PAGE_MODE_NONE;
> +	domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_NONE);
> 
> 	/* Make changes visible to IOMMUs */
> 	update_domain(domain);
> @@ -2910,7 +2920,7 @@ static int __set_gcr3(struct protection_domain *domain, int pasid,
> {
> 	u64 *pte;
> 
> -	if (domain->mode != PAGE_MODE_NONE)
> +	if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
> 		return -EINVAL;
> 
> 	pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, true);
> @@ -2926,7 +2936,7 @@ static int __clear_gcr3(struct protection_domain *domain, int pasid)
> {
> 	u64 *pte;
> 
> -	if (domain->mode != PAGE_MODE_NONE)
> +	if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE)
> 		return -EINVAL;
> 
> 	pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, false);
> diff --git a/drivers/iommu/amd_iommu_proto.h b/drivers/iommu/amd_iommu_proto.h
> index 92c2ba6468a0..34d4dd66cf9b 100644
> --- a/drivers/iommu/amd_iommu_proto.h
> +++ b/drivers/iommu/amd_iommu_proto.h
> @@ -67,6 +67,21 @@ static inline int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
> extern int amd_iommu_complete_ppr(struct pci_dev *pdev, int pasid,
> 				  int status, int tag);
> 
> +static inline int iommu_get_mode(void *pt_root)
> +{
> +	return (unsigned long)pt_root & ~PAGE_MASK;
> +}
> +
> +static inline unsigned long iommu_get_root(void *pt_root)
> +{
> +	return (unsigned long)pt_root & PAGE_MASK;
> +}
> +
> +static inline void *iommu_set_mode(void *pt_root, int mode)
> +{
> +	return (void *)(iommu_get_root(pt_root) + mode);
> +}
> +
> static inline bool is_rd890_iommu(struct pci_dev *pdev)
> {
> 	return (pdev->vendor == PCI_VENDOR_ID_ATI) &&
> diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h
> index ca8c4522045b..221adefa56a0 100644
> --- a/drivers/iommu/amd_iommu_types.h
> +++ b/drivers/iommu/amd_iommu_types.h
> @@ -468,8 +468,8 @@ struct protection_domain {
> 				       iommu core code */
> 	spinlock_t lock;	/* mostly used to lock the page table*/
> 	u16 id;			/* the domain id written to the device table */
> -	int mode;		/* paging mode (0-6 levels) */
> -	u64 *pt_root;		/* page table root pointer */
> +	u64 *pt_root;		/* page table root pointer and paging mode (0-6
> +				   levels) */
> 	int glx;		/* Number of levels for GCR3 table */
> 	u64 *gcr3_tbl;		/* Guest CR3 table */
> 	unsigned long flags;	/* flags to find out type of domain */
> -- 
> 2.21.0 (Apple Git-122.2)
Joerg Roedel April 18, 2020, 12:10 p.m. UTC | #5
On Thu, Apr 16, 2020 at 09:42:41PM -0400, Qian Cai wrote:
> So, this is still not enough that would still trigger storage driver offline under
> memory pressure for a bit longer. It looks to me that in fetch_pte() there are
> could still racy?

Yes, your patch still looks racy. You need to atomically read
domain->pt_root to a stack variable and derive the pt_root pointer and
the mode from that variable instead of domain->pt_root directly. If you
read the domain->pt_root twice there could still be an update between
the two reads.
Probably the lock in increase_address_space() can also be avoided if
pt_root is updated using cmpxchg()?

Regards,

	Joerg
Qian Cai April 18, 2020, 1:01 p.m. UTC | #6
> On Apr 18, 2020, at 8:10 AM, Joerg Roedel <joro@8bytes.org> wrote:
> 
> Yes, your patch still looks racy. You need to atomically read
> domain->pt_root to a stack variable and derive the pt_root pointer and
> the mode from that variable instead of domain->pt_root directly. If you
> read the domain->pt_root twice there could still be an update between
> the two reads.
> Probably the lock in increase_address_space() can also be avoided if
> pt_root is updated using cmpxchg()?

Hard to tell without testing further. I’ll leave that optimization in the future, and focus on fixing those races first.
Joerg Roedel April 18, 2020, 6:34 p.m. UTC | #7
On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote:
> Hard to tell without testing further. I’ll leave that optimization in
> the future, and focus on fixing those races first.

Yeah right, we should fix the existing races first before introducing
new ones ;)

Btw, THANKS A LOT for tracking down all these race condition bugs, I am
not even remotely able to trigger them with the hardware I have around.

I did some hacking and the attached diff shows how I think this race
condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1,
but did no further testing. Can you test it please?

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..28229a38af4d 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -151,6 +151,26 @@ static struct protection_domain *to_pdomain(struct iommu_domain *dom)
 	return container_of(dom, struct protection_domain, domain);
 }
 
+static void amd_iommu_domain_get_pgtable(struct protection_domain *domain,
+					 struct domain_pgtable *pgtable)
+{
+	u64 pt_root = atomic64_read(&domain->pt_root);
+
+	pgtable->root = (u64 *)(pt_root & PAGE_MASK);
+	pgtable->mode = pt_root & 7; /* lowest 3 bits encode pgtable mode */
+}
+
+static u64 amd_iommu_domain_encode_pgtable(u64 *root, int mode)
+{
+	u64 pt_root;
+
+	/* lowest 3 bits encode pgtable mode */
+	pt_root = mode & 7;
+	pt_root |= (u64)root;
+
+	return pt_root;
+}
+
 static struct iommu_dev_data *alloc_dev_data(u16 devid)
 {
 	struct iommu_dev_data *dev_data;
@@ -1397,13 +1417,18 @@ static struct page *free_sub_pt(unsigned long root, int mode,
 
 static void free_pagetable(struct protection_domain *domain)
 {
-	unsigned long root = (unsigned long)domain->pt_root;
+	struct domain_pgtable pgtable;
 	struct page *freelist = NULL;
+	unsigned long root;
+
+	amd_iommu_domain_get_pgtable(domain, &pgtable);
+	atomic64_set(&domain->pt_root, 0);
 
-	BUG_ON(domain->mode < PAGE_MODE_NONE ||
-	       domain->mode > PAGE_MODE_6_LEVEL);
+	BUG_ON(pgtable.mode < PAGE_MODE_NONE ||
+	       pgtable.mode > PAGE_MODE_6_LEVEL);
 
-	freelist = free_sub_pt(root, domain->mode, freelist);
+	root = (unsigned long)pgtable.root;
+	freelist = free_sub_pt(root, pgtable.mode, freelist);
 
 	free_page_list(freelist);
 }
@@ -1417,24 +1442,28 @@ static bool increase_address_space(struct protection_domain *domain,
 				   unsigned long address,
 				   gfp_t gfp)
 {
+	struct domain_pgtable pgtable;
 	unsigned long flags;
 	bool ret = false;
-	u64 *pte;
+	u64 *pte, root;
 
 	spin_lock_irqsave(&domain->lock, flags);
 
-	if (address <= PM_LEVEL_SIZE(domain->mode) ||
-	    WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
+	amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+	if (address <= PM_LEVEL_SIZE(pgtable.mode) ||
+	    WARN_ON_ONCE(pgtable.mode == PAGE_MODE_6_LEVEL))
 		goto out;
 
 	pte = (void *)get_zeroed_page(gfp);
 	if (!pte)
 		goto out;
 
-	*pte             = PM_LEVEL_PDE(domain->mode,
-					iommu_virt_to_phys(domain->pt_root));
-	domain->pt_root  = pte;
-	domain->mode    += 1;
+	*pte = PM_LEVEL_PDE(pgtable.mode, iommu_virt_to_phys(pgtable.root));
+
+	root = amd_iommu_domain_encode_pgtable(pte, pgtable.mode + 1);
+
+	atomic64_set(&domain->pt_root, root);
 
 	ret = true;
 
@@ -1451,16 +1480,22 @@ static u64 *alloc_pte(struct protection_domain *domain,
 		      gfp_t gfp,
 		      bool *updated)
 {
+	struct domain_pgtable pgtable;
 	int level, end_lvl;
 	u64 *pte, *page;
 
 	BUG_ON(!is_power_of_2(page_size));
 
-	while (address > PM_LEVEL_SIZE(domain->mode))
+	amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+	while (address > PM_LEVEL_SIZE(pgtable.mode)) {
 		*updated = increase_address_space(domain, address, gfp) || *updated;
+		amd_iommu_domain_get_pgtable(domain, &pgtable);
+	}
+
 
-	level   = domain->mode - 1;
-	pte     = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
+	level   = pgtable.mode - 1;
+	pte     = &pgtable.root[PM_LEVEL_INDEX(level, address)];
 	address = PAGE_SIZE_ALIGN(address, page_size);
 	end_lvl = PAGE_SIZE_LEVEL(page_size);
 
@@ -1536,16 +1571,19 @@ static u64 *fetch_pte(struct protection_domain *domain,
 		      unsigned long address,
 		      unsigned long *page_size)
 {
+	struct domain_pgtable pgtable;
 	int level;
 	u64 *pte;
 
 	*page_size = 0;
 
-	if (address > PM_LEVEL_SIZE(domain->mode))
+	amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+	if (address > PM_LEVEL_SIZE(pgtable.mode))
 		return NULL;
 
-	level	   =  domain->mode - 1;
-	pte	   = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
+	level	   =  pgtable.mode - 1;
+	pte	   = &pgtable.root[PM_LEVEL_INDEX(level, address)];
 	*page_size =  PTE_LEVEL_PAGE_SIZE(level);
 
 	while (level > 0) {
@@ -1806,6 +1844,7 @@ static void dma_ops_domain_free(struct protection_domain *domain)
 static struct protection_domain *dma_ops_domain_alloc(void)
 {
 	struct protection_domain *domain;
+	u64 *pt_root, root;
 
 	domain = kzalloc(sizeof(struct protection_domain), GFP_KERNEL);
 	if (!domain)
@@ -1814,12 +1853,14 @@ static struct protection_domain *dma_ops_domain_alloc(void)
 	if (protection_domain_init(domain))
 		goto free_domain;
 
-	domain->mode = PAGE_MODE_3_LEVEL;
-	domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
-	domain->flags = PD_DMA_OPS_MASK;
-	if (!domain->pt_root)
+	pt_root = (void *)get_zeroed_page(GFP_KERNEL);
+	if (!pt_root)
 		goto free_domain;
 
+	root = amd_iommu_domain_encode_pgtable(pt_root, PAGE_MODE_3_LEVEL);
+	atomic64_set(&domain->pt_root, root);
+	domain->flags = PD_DMA_OPS_MASK;
+
 	if (iommu_get_dma_cookie(&domain->domain) == -ENOMEM)
 		goto free_domain;
 
@@ -1843,14 +1884,17 @@ static bool dma_ops_domain(struct protection_domain *domain)
 static void set_dte_entry(u16 devid, struct protection_domain *domain,
 			  bool ats, bool ppr)
 {
+	struct domain_pgtable pgtable;
 	u64 pte_root = 0;
 	u64 flags = 0;
 	u32 old_domid;
 
-	if (domain->mode != PAGE_MODE_NONE)
-		pte_root = iommu_virt_to_phys(domain->pt_root);
+	amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+	if (pgtable.mode != PAGE_MODE_NONE)
+		pte_root = iommu_virt_to_phys(pgtable.root);
 
-	pte_root |= (domain->mode & DEV_ENTRY_MODE_MASK)
+	pte_root |= (pgtable.mode & DEV_ENTRY_MODE_MASK)
 		    << DEV_ENTRY_MODE_SHIFT;
 	pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV;
 
@@ -2375,6 +2419,7 @@ static struct protection_domain *protection_domain_alloc(void)
 static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
 {
 	struct protection_domain *pdomain;
+	u64 *pt_root, root;
 
 	switch (type) {
 	case IOMMU_DOMAIN_UNMANAGED:
@@ -2382,13 +2427,15 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
 		if (!pdomain)
 			return NULL;
 
-		pdomain->mode    = PAGE_MODE_3_LEVEL;
-		pdomain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
-		if (!pdomain->pt_root) {
+		pt_root = (void *)get_zeroed_page(GFP_KERNEL);
+		if (!pt_root) {
 			protection_domain_free(pdomain);
 			return NULL;
 		}
 
+		root = amd_iommu_domain_encode_pgtable(pt_root, PAGE_MODE_3_LEVEL);
+		atomic64_set(&pdomain->pt_root, root);
+
 		pdomain->domain.geometry.aperture_start = 0;
 		pdomain->domain.geometry.aperture_end   = ~0ULL;
 		pdomain->domain.geometry.force_aperture = true;
@@ -2406,7 +2453,7 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
 		if (!pdomain)
 			return NULL;
 
-		pdomain->mode = PAGE_MODE_NONE;
+		atomic64_set(&pdomain->pt_root, PAGE_MODE_NONE);
 		break;
 	default:
 		return NULL;
@@ -2418,6 +2465,7 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
 static void amd_iommu_domain_free(struct iommu_domain *dom)
 {
 	struct protection_domain *domain;
+	struct domain_pgtable pgtable;
 
 	domain = to_pdomain(dom);
 
@@ -2435,7 +2483,9 @@ static void amd_iommu_domain_free(struct iommu_domain *dom)
 		dma_ops_domain_free(domain);
 		break;
 	default:
-		if (domain->mode != PAGE_MODE_NONE)
+		amd_iommu_domain_get_pgtable(domain, &pgtable);
+
+		if (pgtable.mode != PAGE_MODE_NONE)
 			free_pagetable(domain);
 
 		if (domain->flags & PD_IOMMUV2_MASK)
@@ -2518,10 +2568,12 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
 			 gfp_t gfp)
 {
 	struct protection_domain *domain = to_pdomain(dom);
+	struct domain_pgtable pgtable;
 	int prot = 0;
 	int ret;
 
-	if (domain->mode == PAGE_MODE_NONE)
+	amd_iommu_domain_get_pgtable(domain, &pgtable);
+	if (pgtable.mode == PAGE_MODE_NONE)
 		return -EINVAL;
 
 	if (iommu_prot & IOMMU_READ)
@@ -2541,8 +2593,10 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
 			      struct iommu_iotlb_gather *gather)
 {
 	struct protection_domain *domain = to_pdomain(dom);
+	struct domain_pgtable pgtable;
 
-	if (domain->mode == PAGE_MODE_NONE)
+	amd_iommu_domain_get_pgtable(domain, &pgtable);
+	if (pgtable.mode == PAGE_MODE_NONE)
 		return 0;
 
 	return iommu_unmap_page(domain, iova, page_size);
@@ -2553,9 +2607,11 @@ static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
 {
 	struct protection_domain *domain = to_pdomain(dom);
 	unsigned long offset_mask, pte_pgsize;
+	struct domain_pgtable pgtable;
 	u64 *pte, __pte;
 
-	if (domain->mode == PAGE_MODE_NONE)
+	amd_iommu_domain_get_pgtable(domain, &pgtable);
+	if (pgtable.mode == PAGE_MODE_NONE)
 		return iova;
 
 	pte = fetch_pte(domain, iova, &pte_pgsize);
@@ -2708,16 +2764,26 @@ EXPORT_SYMBOL(amd_iommu_unregister_ppr_notifier);
 void amd_iommu_domain_direct_map(struct iommu_domain *dom)
 {
 	struct protection_domain *domain = to_pdomain(dom);
+	struct domain_pgtable pgtable;
 	unsigned long flags;
+	u64 pt_root;
 
 	spin_lock_irqsave(&domain->lock, flags);
 
+	/* First save pgtable configuration*/
+	amd_iommu_domain_get_pgtable(domain, &pgtable);
+
 	/* Update data structure */
-	domain->mode    = PAGE_MODE_NONE;
+	pt_root = amd_iommu_domain_encode_pgtable(NULL, PAGE_MODE_NONE);
+	atomic64_set(&domain->pt_root, pt_root);
 
 	/* Make changes visible to IOMMUs */
 	update_domain(domain);
 
+	/* Restore old pgtable in domain->ptroot to free page-table */
+	pt_root = amd_iommu_domain_encode_pgtable(pgtable.root, pgtable.mode);
+	atomic64_set(&domain->pt_root, pt_root);
+
 	/* Page-table is not visible to IOMMU anymore, so free it */
 	free_pagetable(domain);
 
@@ -2908,9 +2974,11 @@ static u64 *__get_gcr3_pte(u64 *root, int level, int pasid, bool alloc)
 static int __set_gcr3(struct protection_domain *domain, int pasid,
 		      unsigned long cr3)
 {
+	struct domain_pgtable pgtable;
 	u64 *pte;
 
-	if (domain->mode != PAGE_MODE_NONE)
+	amd_iommu_domain_get_pgtable(domain, &pgtable);
+	if (pgtable.mode != PAGE_MODE_NONE)
 		return -EINVAL;
 
 	pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, true);
@@ -2924,9 +2992,11 @@ static int __set_gcr3(struct protection_domain *domain, int pasid,
 
 static int __clear_gcr3(struct protection_domain *domain, int pasid)
 {
+	struct domain_pgtable pgtable;
 	u64 *pte;
 
-	if (domain->mode != PAGE_MODE_NONE)
+	amd_iommu_domain_get_pgtable(domain, &pgtable);
+	if (pgtable.mode != PAGE_MODE_NONE)
 		return -EINVAL;
 
 	pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, false);
diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h
index ca8c4522045b..7a8fdec138bd 100644
--- a/drivers/iommu/amd_iommu_types.h
+++ b/drivers/iommu/amd_iommu_types.h
@@ -468,8 +468,7 @@ struct protection_domain {
 				       iommu core code */
 	spinlock_t lock;	/* mostly used to lock the page table*/
 	u16 id;			/* the domain id written to the device table */
-	int mode;		/* paging mode (0-6 levels) */
-	u64 *pt_root;		/* page table root pointer */
+	atomic64_t pt_root;	/* pgtable root and pgtable mode */
 	int glx;		/* Number of levels for GCR3 table */
 	u64 *gcr3_tbl;		/* Guest CR3 table */
 	unsigned long flags;	/* flags to find out type of domain */
@@ -477,6 +476,12 @@ struct protection_domain {
 	unsigned dev_iommu[MAX_IOMMUS]; /* per-IOMMU reference count */
 };
 
+/* For decocded pt_root */
+struct domain_pgtable {
+	int mode;
+	u64 *root;
+};
+
 /*
  * Structure where we save information about one hardware AMD IOMMU in the
  * system.
Qian Cai April 20, 2020, 2:07 a.m. UTC | #8
> On Apr 18, 2020, at 2:34 PM, Joerg Roedel <joro@8bytes.org> wrote:
> 
> On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote:
>> Hard to tell without testing further. I’ll leave that optimization in
>> the future, and focus on fixing those races first.
> 
> Yeah right, we should fix the existing races first before introducing
> new ones ;)
> 
> Btw, THANKS A LOT for tracking down all these race condition bugs, I am
> not even remotely able to trigger them with the hardware I have around.
> 
> I did some hacking and the attached diff shows how I think this race
> condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1,
> but did no further testing. Can you test it please?

Sure, give it a few days to see if it could survive.
Qian Cai April 20, 2020, 1:26 p.m. UTC | #9
> On Apr 18, 2020, at 2:34 PM, Joerg Roedel <joro@8bytes.org> wrote:
> 
> On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote:
>> Hard to tell without testing further. I’ll leave that optimization in
>> the future, and focus on fixing those races first.
> 
> Yeah right, we should fix the existing races first before introducing
> new ones ;)
> 
> Btw, THANKS A LOT for tracking down all these race condition bugs, I am
> not even remotely able to trigger them with the hardware I have around.
> 
> I did some hacking and the attached diff shows how I think this race
> condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1,
> but did no further testing. Can you test it please?

No dice. There could be some other races. For example,

> @@ -1536,16 +1571,19 @@ static u64 *fetch_pte(struct protection_domain *domain,
> 		      unsigned long address,
> 		      unsigned long *page_size)
...
> 	amd_iommu_domain_get_pgtable(domain, &pgtable);
> 
> 	if (address > PM_LEVEL_SIZE(pgtable.mode))
> 		return NULL;
> 
> 	level	   =  pgtable.mode - 1;
> 	pte	   = &pgtable.root[PM_LEVEL_INDEX(level, address)];

<— increase_address_space()

> 	*page_size =  PTE_LEVEL_PAGE_SIZE(level);
> 

	while (level > 0) {
		*page_size = PTE_LEVEL_PAGE_SIZE(level);

Then in iommu_unmap_page(),

	while (unmapped < page_size) {
		pte = fetch_pte(dom, bus_addr, &unmap_size);
		…
		bus_addr  = (bus_addr & ~(unmap_size - 1)) + unmap_size;

bus_addr would be unsync with dom->mode when it enter fetch_pte() again.
Could that be a problem?


[ 5159.274829][ T4057] LTP: starting oom02
[ 5160.382787][   C52] perf: interrupt took too long (7443 > 6208), lowering kernel.perf_event_max_sample_rate to 26800
[ 5167.951785][  T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc0000 flags=0x0010]
[ 5167.964540][  T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc1000 flags=0x0010]
[ 5167.977442][  T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc1900 flags=0x0010]
[ 5167.989901][  T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc1d00 flags=0x0010]
[ 5168.002291][  T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2000 flags=0x0010]
[ 5168.014665][  T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2400 flags=0x0010]
[ 5168.027132][  T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2800 flags=0x0010]
[ 5168.039566][  T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2c00 flags=0x0010]
[ 5168.051956][  T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc3000 flags=0x0010]
[ 5168.064310][  T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc3400 flags=0x0010]
[ 5168.076652][  T812] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xfffffffffffc3800 flags=0x0010]
[ 5168.088290][  T812] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xfffffffffffc3c00 flags=0x0010]
[ 5183.695829][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803
[ 5183.704390][    C8] smartpqi 0000:23:00.0: controller offline
[ 5183.756594][  C101] blk_update_request: I/O error, dev sda, sector 22306304 op 0x1:(WRITE) flags 0x8000000 phys_seg 4 prio class 0
[ 5183.756628][   C34] sd 0:1:0:0: [sda] tag#655 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756759][   C56] blk_update_request: I/O error, dev sda, sector 58480128 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756810][   C79] sd 0:1:0:0: [sda] tag#234 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756816][  C121] sd 0:1:0:0: [sda] tag#104 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756837][   T53] sd 0:1:0:0: [sda] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.756882][  C121] sd 0:1:0:0: [sda] tag#104 CDB: opcode=0x2a 2a 00 00 4d d4 00 00 02 00 00
[ 5183.756892][   C79] sd 0:1:0:0: [sda] tag#234 CDB: opcode=0x2a 2a 00 02 03 e4 00 00 02 00 00
[ 5183.756909][  C121] blk_update_request: I/O error, dev sda, sector 5100544 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756920][   C79] blk_update_request: I/O error, dev sda, sector 33809408 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756939][   T53] sd 0:1:0:0: [sda] tag#4 CDB: opcode=0x2a 2a 00 02 4b f8 00 00 02 00 00
[ 5183.756967][   T53] blk_update_request: I/O error, dev sda, sector 38533120 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.756989][   C49] blk_update_request: I/O error, dev sda, sector 30181376 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757045][   C51] sd 0:1:0:0: [sda] tag#452 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757107][   C51] sd 0:1:0:0: [sda] tag#452 CDB: opcode=0x2a 2a 00 02 95 06 00 00 02 00 00
[ 5183.757125][   C51] blk_update_request: I/O error, dev sda, sector 43320832 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757199][   C82] blk_update_request: I/O error, dev sda, sector 10187776 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757209][  C109] blk_update_request: I/O error, dev sda, sector 29812736 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757215][   C77] sd 0:1:0:0: [sda] tag#849 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757222][  C110] sd 0:1:0:0: [sda] tag#558 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757237][   C92] blk_update_request: I/O error, dev sda, sector 6410240 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0
[ 5183.757244][   C91] sd 0:1:0:0: [sda] tag#73 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757251][   C68] sd 0:1:0:0: [sda] tag#416 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757458][   C77] sd 0:1:0:0: [sda] tag#849 CDB: opcode=0x2a 2a 00 02 78 a4 00 00 02 00 00
[ 5183.757467][  C110] sd 0:1:0:0: [sda] tag#558 CDB: opcode=0x2a 2a 00 03 58 94 00 00 02 00 00
[ 5183.757515][  C122] sd 0:1:0:0: [sda] tag#747 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s
[ 5183.757525][   C68] sd 0:1:0:0: [sda] tag#416 CDB: opcode=0x2a 2a 00 01 0e 32 00 00 02 00 00
[ 5183.757536][   C91] sd 0:1:0:0: [sda] tag#73 CDB: opcode=0x2a 2a 00 01 a2 86 00 00 02 00 00
[ 5183.757727][  C122] sd 0:1:0:0: [sda] tag#747 CDB: opcode=0x2a 2a 00 02 a7 24 00 00 02 00 00
[ 5183.758530][   T53] Write-error on swap-device (254:1:64823296)
[ 5183.758758][   T53] Write-error on swap-device (254:1:35201024)
[ 5183.758811][  C105] Write-error on swap-device (254:1:52690944)
[ 5183.758959][   C82] Write-error on swap-device (254:1:6856704)
Joerg Roedel April 29, 2020, 8:47 a.m. UTC | #10
Hi Qian,

On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
> 
> No dice. There could be some other races. For example,

Okay, I think I know what is happening. The increase_address_space()
function increases the address space, but does not update the
DTE and does not flush the old DTE from the caches. But this needs to
happen before domain->pt_root is updated, because otherwise another CPU
can come along and map something into the increased address-space which
is not yet accessible by the device because the DTE is not updated yet.

Regards,

	Joerg
Joerg Roedel April 29, 2020, 11:20 a.m. UTC | #11
On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
> No dice. There could be some other races. For example,

Can you please test this branch:

	https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes

It has the previous fix in it and a couple more to make sure the
device-table is updated and flushed before increase_address_space()
updates domain->pt_root.

Thanks,

	Joerg
Qian Cai April 30, 2020, 1:04 a.m. UTC | #12
> On Apr 29, 2020, at 7:20 AM, Joerg Roedel <joro@8bytes.org> wrote:
> 
> On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
>> No dice. There could be some other races. For example,
> 
> Can you please test this branch:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
> 
> It has the previous fix in it and a couple more to make sure the
> device-table is updated and flushed before increase_address_space()
> updates domain->pt_root.

It looks promising as it survives for a day’s stress testing. I’ll keep it running for a few days to be sure.
Qian Cai May 3, 2020, 1:04 p.m. UTC | #13
> On Apr 29, 2020, at 7:20 AM, Joerg Roedel <joro@8bytes.org> wrote:
> 
> On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote:
>> No dice. There could be some other races. For example,
> 
> Can you please test this branch:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
> 
> It has the previous fix in it and a couple more to make sure the
> device-table is updated and flushed before increase_address_space()
> updates domain->pt_root.

I believe this closed the existing races as it had survived for many days. Great work!
Joerg Roedel May 3, 2020, 6:39 p.m. UTC | #14
Hi Qian,

On Sun, May 03, 2020 at 09:04:03AM -0400, Qian Cai wrote:
> > On Apr 29, 2020, at 7:20 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > Can you please test this branch:
> > 
> > 	https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes
> > 
> > It has the previous fix in it and a couple more to make sure the
> > device-table is updated and flushed before increase_address_space()
> > updates domain->pt_root.
> 
> I believe this closed the existing races as it had survived for many
> days. Great work!

Thanks a lot for testing these changes! Can I add your Tested-by when I
send them to the mailing list tomorrow?

Regards,

	Joerg
Qian Cai May 3, 2020, 7:12 p.m. UTC | #15
> On May 3, 2020, at 2:39 PM, Joerg Roedel <joro@8bytes.org> wrote:
> 
> Can I add your Tested-by when I
> send them to the mailing list tomorrow?

Sure. Feel free to add,

Tested-by: Qian Cai <cai@lca.pw>

Patch
diff mbox series

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..22328a23335f 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1434,6 +1434,11 @@  static bool increase_address_space(struct protection_domain *domain,
 	*pte             = PM_LEVEL_PDE(domain->mode,
 					iommu_virt_to_phys(domain->pt_root));
 	domain->pt_root  = pte;
+	/*
+	 * Make sure fetch_pte() will see the new domain->pt_root before it
+	 * snapshots domain->mode.
+	 */
+	smp_wmb();
 	domain->mode    += 1;
 
 	ret = true;
@@ -1460,6 +1465,8 @@  static u64 *alloc_pte(struct protection_domain *domain,
 		*updated = increase_address_space(domain, address, gfp) || *updated;
 
 	level   = domain->mode - 1;
+	/* To pair with smp_wmb() in increase_address_space(). */
+	smp_rmb();
 	pte     = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
 	address = PAGE_SIZE_ALIGN(address, page_size);
 	end_lvl = PAGE_SIZE_LEVEL(page_size);
@@ -1545,6 +1552,8 @@  static u64 *fetch_pte(struct protection_domain *domain,
 		return NULL;
 
 	level	   =  domain->mode - 1;
+	/* To pair with smp_wmb() in increase_address_space(). */
+	smp_rmb();
 	pte	   = &domain->pt_root[PM_LEVEL_INDEX(level, address)];
 	*page_size =  PTE_LEVEL_PAGE_SIZE(level);