* [RFC PATCH] iommu/amd: fix a race in fetch_pte() @ 2020-04-07 2:12 Qian Cai 2020-04-07 15:36 ` Qian Cai 0 siblings, 1 reply; 16+ messages in thread From: Qian Cai @ 2020-04-07 2:12 UTC (permalink / raw) To: joro; +Cc: iommu, linux-kernel, Qian Cai fetch_pte() could race with increase_address_space() because it held no lock from iommu_unmap_page(). On the CPU that runs fetch_pte() it could see a stale domain->pt_root and a new increased domain->mode from increase_address_space(). As the result, it could trigger invalid accesses later on. Fix it by using a pair of smp_[w|r]mb in those places. kernel BUG at drivers/iommu/amd_iommu.c:1704! BUG_ON(unmapped && !is_power_of_2(unmapped)); Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 RIP: 0010:amd_iommu_unmap+0x1b2/0x1d0 Call Trace: <IRQ> __iommu_unmap+0x106/0x320 iommu_unmap_fast+0xe/0x10 __iommu_dma_unmap+0xdc/0x1a0 iommu_dma_unmap_sg+0xae/0xd0 scsi_dma_unmap+0xe7/0x150 pqi_raid_io_complete+0x37/0x60 [smartpqi] pqi_irq_handler+0x1fc/0x13f0 [smartpqi] __handle_irq_event_percpu+0x78/0x4f0 handle_irq_event_percpu+0x70/0x100 handle_irq_event+0x5a/0x8b handle_edge_irq+0x10c/0x370 do_IRQ+0x9e/0x1e0 common_interrupt+0xf/0xf </IRQ> Signed-off-by: Qian Cai <cai@lca.pw> --- drivers/iommu/amd_iommu.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 20cce366e951..22328a23335f 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -1434,6 +1434,11 @@ static bool increase_address_space(struct protection_domain *domain, *pte = PM_LEVEL_PDE(domain->mode, iommu_virt_to_phys(domain->pt_root)); domain->pt_root = pte; + /* + * Make sure fetch_pte() will see the new domain->pt_root before it + * snapshots domain->mode. + */ + smp_wmb(); domain->mode += 1; ret = true; @@ -1460,6 +1465,8 @@ static u64 *alloc_pte(struct protection_domain *domain, *updated = increase_address_space(domain, address, gfp) || *updated; level = domain->mode - 1; + /* To pair with smp_wmb() in increase_address_space(). */ + smp_rmb(); pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)]; address = PAGE_SIZE_ALIGN(address, page_size); end_lvl = PAGE_SIZE_LEVEL(page_size); @@ -1545,6 +1552,8 @@ static u64 *fetch_pte(struct protection_domain *domain, return NULL; level = domain->mode - 1; + /* To pair with smp_wmb() in increase_address_space(). */ + smp_rmb(); pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)]; *page_size = PTE_LEVEL_PAGE_SIZE(level); -- 2.21.0 (Apple Git-122.2) ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-07 2:12 [RFC PATCH] iommu/amd: fix a race in fetch_pte() Qian Cai @ 2020-04-07 15:36 ` Qian Cai 2020-04-08 14:19 ` Joerg Roedel 0 siblings, 1 reply; 16+ messages in thread From: Qian Cai @ 2020-04-07 15:36 UTC (permalink / raw) To: joro; +Cc: iommu, linux-kernel > On Apr 6, 2020, at 10:12 PM, Qian Cai <cai@lca.pw> wrote: > > fetch_pte() could race with increase_address_space() because it held no > lock from iommu_unmap_page(). On the CPU that runs fetch_pte() it could > see a stale domain->pt_root and a new increased domain->mode from > increase_address_space(). As the result, it could trigger invalid > accesses later on. Fix it by using a pair of smp_[w|r]mb in those > places. After further testing, the change along is insufficient. What I am chasing right now is the swap device will go offline after heavy memory pressure below. The symptom is similar to what we have in the commit, 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”) Apparently, it is no possible to take the domain->lock in fetch_pte() because it could sleep. Thoughts? [ 7292.727377][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd80000 flags=0x0010] [ 7292.740571][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd81000 flags=0x0010] [ 7292.753268][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd81900 flags=0x0010] [ 7292.766078][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd81d00 flags=0x0010] [ 7292.778447][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82000 flags=0x0010] [ 7292.790724][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82400 flags=0x0010] [ 7292.803195][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82800 flags=0x0010] [ 7292.815664][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd82c00 flags=0x0010] [ 7292.828089][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd83000 flags=0x0010] [ 7292.840468][ T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffd83400 flags=0x0010] [ 7292.852795][ T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffd83800 flags=0x0010] [ 7292.864566][ T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffd83c00 flags=0x0010] [ 7326.583908][ C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 [ 7326.592386][ C8] smartpqi 0000:23:00.0: controller offline [ 7326.663728][ C66] sd 0:1:0:0: [sda] tag#467 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=6s [ 7326.664321][ T2738] smartpqi 0000:23:00.0: resetting scsi 0:1:0:0 [ 7326.673849][ C66] sd 0:1:0:0: [sda] tag#467 CDB: opcode=0x28 28 00 02 ee 2e 60 00 00 08 00 [ 7326.680118][ T2738] smartpqi 0000:23:00.0: reset of scsi 0:1:0:0: FAILED [ 7326.688612][ C66] blk_update_request: I/O error, dev sda, sector 49163872 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [ 7326.695456][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery [ 7326.706632][ C66] Read-error on swap-device (254:1:45833824) [ 7326.714030][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery [ 7326.723382][T45523] sd 0:1:0:0: rejecting I/O to offline device [ 7326.727352][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery [ 7326.727379][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery [ 7326.727442][ T2738] sd 0:1:0:0: Device offlined - not ready after error recovery > > kernel BUG at drivers/iommu/amd_iommu.c:1704! > BUG_ON(unmapped && !is_power_of_2(unmapped)); > Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 > RIP: 0010:amd_iommu_unmap+0x1b2/0x1d0 > Call Trace: > <IRQ> > __iommu_unmap+0x106/0x320 > iommu_unmap_fast+0xe/0x10 > __iommu_dma_unmap+0xdc/0x1a0 > iommu_dma_unmap_sg+0xae/0xd0 > scsi_dma_unmap+0xe7/0x150 > pqi_raid_io_complete+0x37/0x60 [smartpqi] > pqi_irq_handler+0x1fc/0x13f0 [smartpqi] > __handle_irq_event_percpu+0x78/0x4f0 > handle_irq_event_percpu+0x70/0x100 > handle_irq_event+0x5a/0x8b > handle_edge_irq+0x10c/0x370 > do_IRQ+0x9e/0x1e0 > common_interrupt+0xf/0xf > </IRQ> > > Signed-off-by: Qian Cai <cai@lca.pw> > --- > drivers/iommu/amd_iommu.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c > index 20cce366e951..22328a23335f 100644 > --- a/drivers/iommu/amd_iommu.c > +++ b/drivers/iommu/amd_iommu.c > @@ -1434,6 +1434,11 @@ static bool increase_address_space(struct protection_domain *domain, > *pte = PM_LEVEL_PDE(domain->mode, > iommu_virt_to_phys(domain->pt_root)); > domain->pt_root = pte; > + /* > + * Make sure fetch_pte() will see the new domain->pt_root before it > + * snapshots domain->mode. > + */ > + smp_wmb(); > domain->mode += 1; > > ret = true; > @@ -1460,6 +1465,8 @@ static u64 *alloc_pte(struct protection_domain *domain, > *updated = increase_address_space(domain, address, gfp) || *updated; > > level = domain->mode - 1; > + /* To pair with smp_wmb() in increase_address_space(). */ > + smp_rmb(); > pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)]; > address = PAGE_SIZE_ALIGN(address, page_size); > end_lvl = PAGE_SIZE_LEVEL(page_size); > @@ -1545,6 +1552,8 @@ static u64 *fetch_pte(struct protection_domain *domain, > return NULL; > > level = domain->mode - 1; > + /* To pair with smp_wmb() in increase_address_space(). */ > + smp_rmb(); > pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)]; > *page_size = PTE_LEVEL_PAGE_SIZE(level); > > -- > 2.21.0 (Apple Git-122.2) > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-07 15:36 ` Qian Cai @ 2020-04-08 14:19 ` Joerg Roedel 2020-04-14 1:36 ` Qian Cai 0 siblings, 1 reply; 16+ messages in thread From: Joerg Roedel @ 2020-04-08 14:19 UTC (permalink / raw) To: Qian Cai; +Cc: iommu, linux-kernel Hi Qian, On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote: > After further testing, the change along is insufficient. What I am chasing right > now is the swap device will go offline after heavy memory pressure below. The > symptom is similar to what we have in the commit, > > 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”) > > Apparently, it is no possible to take the domain->lock in fetch_pte() because it > could sleep. Thanks a lot for finding and tracking down another race in the AMD IOMMU page-table code. The domain->lock is a spin-lock and taking it can't sleep. But fetch_pte() is a fast-path and must not take any locks. I think the best fix is to update the pt_root and mode of the domain atomically by storing the mode in the lower 12 bits of pt_root. This way they are stored together and can be read/write atomically. Regards, Joerg ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-08 14:19 ` Joerg Roedel @ 2020-04-14 1:36 ` Qian Cai 2020-04-17 1:42 ` Qian Cai 0 siblings, 1 reply; 16+ messages in thread From: Qian Cai @ 2020-04-14 1:36 UTC (permalink / raw) To: Joerg Roedel; +Cc: iommu, linux-kernel > On Apr 8, 2020, at 10:19 AM, Joerg Roedel <joro@8bytes.org> wrote: > > Hi Qian, > > On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote: >> After further testing, the change along is insufficient. What I am chasing right >> now is the swap device will go offline after heavy memory pressure below. The >> symptom is similar to what we have in the commit, >> >> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”) >> >> Apparently, it is no possible to take the domain->lock in fetch_pte() because it >> could sleep. > > Thanks a lot for finding and tracking down another race in the AMD IOMMU > page-table code. The domain->lock is a spin-lock and taking it can't > sleep. But fetch_pte() is a fast-path and must not take any locks. > > I think the best fix is to update the pt_root and mode of the domain > atomically by storing the mode in the lower 12 bits of pt_root. This way > they are stored together and can be read/write atomically. Like this? diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 20cce366e951..b36c6b07cbfd 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, int mode, static void free_pagetable(struct protection_domain *domain) { - unsigned long root = (unsigned long)domain->pt_root; + int level = iommu_get_mode(domain->pt_root); + unsigned long root = iommu_get_root(domain->pt_root); struct page *freelist = NULL; - BUG_ON(domain->mode < PAGE_MODE_NONE || - domain->mode > PAGE_MODE_6_LEVEL); + BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL); - freelist = free_sub_pt(root, domain->mode, freelist); + freelist = free_sub_pt(root, level, freelist); free_page_list(freelist); } @@ -1417,24 +1417,27 @@ static bool increase_address_space(struct protection_domain *domain, unsigned long address, gfp_t gfp) { + int level; unsigned long flags; bool ret = false; u64 *pte; spin_lock_irqsave(&domain->lock, flags); - if (address <= PM_LEVEL_SIZE(domain->mode) || - WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL)) + level = iommu_get_mode(domain->pt_root); + + if (address <= PM_LEVEL_SIZE(level) || + WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL)) goto out; pte = (void *)get_zeroed_page(gfp); if (!pte) goto out; - *pte = PM_LEVEL_PDE(domain->mode, - iommu_virt_to_phys(domain->pt_root)); - domain->pt_root = pte; - domain->mode += 1; + *pte = PM_LEVEL_PDE(level, + iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root))); + + WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1); ret = true; @@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain *domain, bool *updated) { int level, end_lvl; - u64 *pte, *page; + u64 *pte, *page, *pt_root, *root; BUG_ON(!is_power_of_2(page_size)); - while (address > PM_LEVEL_SIZE(domain->mode)) + while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root))) *updated = increase_address_space(domain, address, gfp) || *updated; - level = domain->mode - 1; - pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)]; + pt_root = READ_ONCE(domain->pt_root); + root = (void *)iommu_get_root(pt_root); + level = iommu_get_mode(pt_root) - 1; + pte = &root[PM_LEVEL_INDEX(level, address)]; address = PAGE_SIZE_ALIGN(address, page_size); end_lvl = PAGE_SIZE_LEVEL(page_size); @@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain *domain, unsigned long address, unsigned long *page_size) { - int level; u64 *pte; + u64 *pt_root = READ_ONCE(domain->pt_root); + u64 *root = (void *)iommu_get_root(pt_root); + int level = iommu_get_mode(pt_root); *page_size = 0; - if (address > PM_LEVEL_SIZE(domain->mode)) + if (address > PM_LEVEL_SIZE(level)) return NULL; - level = domain->mode - 1; - pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)]; + level--; + pte = &root[PM_LEVEL_INDEX(level, address)]; *page_size = PTE_LEVEL_PAGE_SIZE(level); while (level > 0) { @@ -1814,12 +1821,13 @@ static struct protection_domain *dma_ops_domain_alloc(void) if (protection_domain_init(domain)) goto free_domain; - domain->mode = PAGE_MODE_3_LEVEL; domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL); domain->flags = PD_DMA_OPS_MASK; if (!domain->pt_root) goto free_domain; + domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_3_LEVEL); + if (iommu_get_dma_cookie(&domain->domain) == -ENOMEM) goto free_domain; @@ -1847,10 +1855,10 @@ static void set_dte_entry(u16 devid, struct protection_domain *domain, u64 flags = 0; u32 old_domid; - if (domain->mode != PAGE_MODE_NONE) - pte_root = iommu_virt_to_phys(domain->pt_root); + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE) + pte_root = iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root)); - pte_root |= (domain->mode & DEV_ENTRY_MODE_MASK) + pte_root |= ((unsigned long)domain->pt_root & DEV_ENTRY_MODE_MASK) << DEV_ENTRY_MODE_SHIFT; pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV; @@ -2382,13 +2390,14 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type) if (!pdomain) return NULL; - pdomain->mode = PAGE_MODE_3_LEVEL; pdomain->pt_root = (void *)get_zeroed_page(GFP_KERNEL); if (!pdomain->pt_root) { protection_domain_free(pdomain); return NULL; } + pdomain->pt_root = iommu_set_mode(pdomain->pt_root, + PAGE_MODE_3_LEVEL); pdomain->domain.geometry.aperture_start = 0; pdomain->domain.geometry.aperture_end = ~0ULL; pdomain->domain.geometry.force_aperture = true; @@ -2406,7 +2415,8 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type) if (!pdomain) return NULL; - pdomain->mode = PAGE_MODE_NONE; + pdomain->pt_root = iommu_set_mode(pdomain->pt_root, + PAGE_MODE_NONE); break; default: return NULL; @@ -2435,7 +2445,7 @@ static void amd_iommu_domain_free(struct iommu_domain *dom) dma_ops_domain_free(domain); break; default: - if (domain->mode != PAGE_MODE_NONE) + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE) free_pagetable(domain); if (domain->flags & PD_IOMMUV2_MASK) @@ -2521,7 +2531,7 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova, int prot = 0; int ret; - if (domain->mode == PAGE_MODE_NONE) + if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE) return -EINVAL; if (iommu_prot & IOMMU_READ) @@ -2542,7 +2552,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova, { struct protection_domain *domain = to_pdomain(dom); - if (domain->mode == PAGE_MODE_NONE) + if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE) return 0; return iommu_unmap_page(domain, iova, page_size); @@ -2555,7 +2565,7 @@ static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom, unsigned long offset_mask, pte_pgsize; u64 *pte, __pte; - if (domain->mode == PAGE_MODE_NONE) + if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE) return iova; pte = fetch_pte(domain, iova, &pte_pgsize); @@ -2713,7 +2723,7 @@ void amd_iommu_domain_direct_map(struct iommu_domain *dom) spin_lock_irqsave(&domain->lock, flags); /* Update data structure */ - domain->mode = PAGE_MODE_NONE; + domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_NONE); /* Make changes visible to IOMMUs */ update_domain(domain); @@ -2910,7 +2920,7 @@ static int __set_gcr3(struct protection_domain *domain, int pasid, { u64 *pte; - if (domain->mode != PAGE_MODE_NONE) + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE) return -EINVAL; pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, true); @@ -2926,7 +2936,7 @@ static int __clear_gcr3(struct protection_domain *domain, int pasid) { u64 *pte; - if (domain->mode != PAGE_MODE_NONE) + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE) return -EINVAL; pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, false); diff --git a/drivers/iommu/amd_iommu_proto.h b/drivers/iommu/amd_iommu_proto.h index 92c2ba6468a0..34d4dd66cf9b 100644 --- a/drivers/iommu/amd_iommu_proto.h +++ b/drivers/iommu/amd_iommu_proto.h @@ -67,6 +67,21 @@ static inline int amd_iommu_create_irq_domain(struct amd_iommu *iommu) extern int amd_iommu_complete_ppr(struct pci_dev *pdev, int pasid, int status, int tag); +static inline int iommu_get_mode(void *pt_root) +{ + return (unsigned long)pt_root & ~PAGE_MASK; +} + +static inline unsigned long iommu_get_root(void *pt_root) +{ + return (unsigned long)pt_root & PAGE_MASK; +} + +static inline void *iommu_set_mode(void *pt_root, int mode) +{ + return (void *)(iommu_get_root(pt_root) + mode); +} + static inline bool is_rd890_iommu(struct pci_dev *pdev) { return (pdev->vendor == PCI_VENDOR_ID_ATI) && diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h index ca8c4522045b..221adefa56a0 100644 --- a/drivers/iommu/amd_iommu_types.h +++ b/drivers/iommu/amd_iommu_types.h @@ -468,8 +468,8 @@ struct protection_domain { iommu core code */ spinlock_t lock; /* mostly used to lock the page table*/ u16 id; /* the domain id written to the device table */ - int mode; /* paging mode (0-6 levels) */ - u64 *pt_root; /* page table root pointer */ + u64 *pt_root; /* page table root pointer and paging mode (0-6 + levels) */ int glx; /* Number of levels for GCR3 table */ u64 *gcr3_tbl; /* Guest CR3 table */ unsigned long flags; /* flags to find out type of domain */ -- 2.21.0 (Apple Git-122.2) ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-14 1:36 ` Qian Cai @ 2020-04-17 1:42 ` Qian Cai 2020-04-18 12:10 ` Joerg Roedel 0 siblings, 1 reply; 16+ messages in thread From: Qian Cai @ 2020-04-17 1:42 UTC (permalink / raw) To: Joerg Roedel; +Cc: iommu, linux-kernel > On Apr 13, 2020, at 9:36 PM, Qian Cai <cai@lca.pw> wrote: > > > >> On Apr 8, 2020, at 10:19 AM, Joerg Roedel <joro@8bytes.org> wrote: >> >> Hi Qian, >> >> On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote: >>> After further testing, the change along is insufficient. What I am chasing right >>> now is the swap device will go offline after heavy memory pressure below. The >>> symptom is similar to what we have in the commit, >>> >>> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”) >>> >>> Apparently, it is no possible to take the domain->lock in fetch_pte() because it >>> could sleep. >> >> Thanks a lot for finding and tracking down another race in the AMD IOMMU >> page-table code. The domain->lock is a spin-lock and taking it can't >> sleep. But fetch_pte() is a fast-path and must not take any locks. >> >> I think the best fix is to update the pt_root and mode of the domain >> atomically by storing the mode in the lower 12 bits of pt_root. This way >> they are stored together and can be read/write atomically. > > Like this? So, this is still not enough that would still trigger storage driver offline under memory pressure for a bit longer. It looks to me that in fetch_pte() there are could still racy? level = domain->mode - 1; pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)]; <— increase_address_space(); *page_size = PTE_LEVEL_PAGE_SIZE(level); while (level > 0) { *page_size = PTE_LEVEL_PAGE_SIZE(level); Then in iommu_unmap_page(), while (unmapped < page_size) { pte = fetch_pte(dom, bus_addr, &unmap_size); … bus_addr = (bus_addr & ~(unmap_size - 1)) + unmap_size; bus_addr would be unsync with dom->mode when it enter fetch_pte() again. Could that be a problem? > > diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c > index 20cce366e951..b36c6b07cbfd 100644 > --- a/drivers/iommu/amd_iommu.c > +++ b/drivers/iommu/amd_iommu.c > @@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, int mode, > > static void free_pagetable(struct protection_domain *domain) > { > - unsigned long root = (unsigned long)domain->pt_root; > + int level = iommu_get_mode(domain->pt_root); > + unsigned long root = iommu_get_root(domain->pt_root); > struct page *freelist = NULL; > > - BUG_ON(domain->mode < PAGE_MODE_NONE || > - domain->mode > PAGE_MODE_6_LEVEL); > + BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL); > > - freelist = free_sub_pt(root, domain->mode, freelist); > + freelist = free_sub_pt(root, level, freelist); > > free_page_list(freelist); > } > @@ -1417,24 +1417,27 @@ static bool increase_address_space(struct protection_domain *domain, > unsigned long address, > gfp_t gfp) > { > + int level; > unsigned long flags; > bool ret = false; > u64 *pte; > > spin_lock_irqsave(&domain->lock, flags); > > - if (address <= PM_LEVEL_SIZE(domain->mode) || > - WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL)) > + level = iommu_get_mode(domain->pt_root); > + > + if (address <= PM_LEVEL_SIZE(level) || > + WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL)) > goto out; > > pte = (void *)get_zeroed_page(gfp); > if (!pte) > goto out; > > - *pte = PM_LEVEL_PDE(domain->mode, > - iommu_virt_to_phys(domain->pt_root)); > - domain->pt_root = pte; > - domain->mode += 1; > + *pte = PM_LEVEL_PDE(level, > + iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root))); > + > + WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1); > > ret = true; > > @@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain *domain, > bool *updated) > { > int level, end_lvl; > - u64 *pte, *page; > + u64 *pte, *page, *pt_root, *root; > > BUG_ON(!is_power_of_2(page_size)); > > - while (address > PM_LEVEL_SIZE(domain->mode)) > + while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root))) > *updated = increase_address_space(domain, address, gfp) || *updated; > > - level = domain->mode - 1; > - pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)]; > + pt_root = READ_ONCE(domain->pt_root); > + root = (void *)iommu_get_root(pt_root); > + level = iommu_get_mode(pt_root) - 1; > + pte = &root[PM_LEVEL_INDEX(level, address)]; > address = PAGE_SIZE_ALIGN(address, page_size); > end_lvl = PAGE_SIZE_LEVEL(page_size); > > @@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain *domain, > unsigned long address, > unsigned long *page_size) > { > - int level; > u64 *pte; > + u64 *pt_root = READ_ONCE(domain->pt_root); > + u64 *root = (void *)iommu_get_root(pt_root); > + int level = iommu_get_mode(pt_root); > > *page_size = 0; > > - if (address > PM_LEVEL_SIZE(domain->mode)) > + if (address > PM_LEVEL_SIZE(level)) > return NULL; > > - level = domain->mode - 1; > - pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)]; > + level--; > + pte = &root[PM_LEVEL_INDEX(level, address)]; > *page_size = PTE_LEVEL_PAGE_SIZE(level); > > while (level > 0) { > @@ -1814,12 +1821,13 @@ static struct protection_domain *dma_ops_domain_alloc(void) > if (protection_domain_init(domain)) > goto free_domain; > > - domain->mode = PAGE_MODE_3_LEVEL; > domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL); > domain->flags = PD_DMA_OPS_MASK; > if (!domain->pt_root) > goto free_domain; > > + domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_3_LEVEL); > + > if (iommu_get_dma_cookie(&domain->domain) == -ENOMEM) > goto free_domain; > > @@ -1847,10 +1855,10 @@ static void set_dte_entry(u16 devid, struct protection_domain *domain, > u64 flags = 0; > u32 old_domid; > > - if (domain->mode != PAGE_MODE_NONE) > - pte_root = iommu_virt_to_phys(domain->pt_root); > + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE) > + pte_root = iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root)); > > - pte_root |= (domain->mode & DEV_ENTRY_MODE_MASK) > + pte_root |= ((unsigned long)domain->pt_root & DEV_ENTRY_MODE_MASK) > << DEV_ENTRY_MODE_SHIFT; > pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV; > > @@ -2382,13 +2390,14 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type) > if (!pdomain) > return NULL; > > - pdomain->mode = PAGE_MODE_3_LEVEL; > pdomain->pt_root = (void *)get_zeroed_page(GFP_KERNEL); > if (!pdomain->pt_root) { > protection_domain_free(pdomain); > return NULL; > } > > + pdomain->pt_root = iommu_set_mode(pdomain->pt_root, > + PAGE_MODE_3_LEVEL); > pdomain->domain.geometry.aperture_start = 0; > pdomain->domain.geometry.aperture_end = ~0ULL; > pdomain->domain.geometry.force_aperture = true; > @@ -2406,7 +2415,8 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type) > if (!pdomain) > return NULL; > > - pdomain->mode = PAGE_MODE_NONE; > + pdomain->pt_root = iommu_set_mode(pdomain->pt_root, > + PAGE_MODE_NONE); > break; > default: > return NULL; > @@ -2435,7 +2445,7 @@ static void amd_iommu_domain_free(struct iommu_domain *dom) > dma_ops_domain_free(domain); > break; > default: > - if (domain->mode != PAGE_MODE_NONE) > + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE) > free_pagetable(domain); > > if (domain->flags & PD_IOMMUV2_MASK) > @@ -2521,7 +2531,7 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova, > int prot = 0; > int ret; > > - if (domain->mode == PAGE_MODE_NONE) > + if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE) > return -EINVAL; > > if (iommu_prot & IOMMU_READ) > @@ -2542,7 +2552,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova, > { > struct protection_domain *domain = to_pdomain(dom); > > - if (domain->mode == PAGE_MODE_NONE) > + if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE) > return 0; > > return iommu_unmap_page(domain, iova, page_size); > @@ -2555,7 +2565,7 @@ static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom, > unsigned long offset_mask, pte_pgsize; > u64 *pte, __pte; > > - if (domain->mode == PAGE_MODE_NONE) > + if (iommu_get_mode(domain->pt_root) == PAGE_MODE_NONE) > return iova; > > pte = fetch_pte(domain, iova, &pte_pgsize); > @@ -2713,7 +2723,7 @@ void amd_iommu_domain_direct_map(struct iommu_domain *dom) > spin_lock_irqsave(&domain->lock, flags); > > /* Update data structure */ > - domain->mode = PAGE_MODE_NONE; > + domain->pt_root = iommu_set_mode(domain->pt_root, PAGE_MODE_NONE); > > /* Make changes visible to IOMMUs */ > update_domain(domain); > @@ -2910,7 +2920,7 @@ static int __set_gcr3(struct protection_domain *domain, int pasid, > { > u64 *pte; > > - if (domain->mode != PAGE_MODE_NONE) > + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE) > return -EINVAL; > > pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, true); > @@ -2926,7 +2936,7 @@ static int __clear_gcr3(struct protection_domain *domain, int pasid) > { > u64 *pte; > > - if (domain->mode != PAGE_MODE_NONE) > + if (iommu_get_mode(domain->pt_root) != PAGE_MODE_NONE) > return -EINVAL; > > pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, false); > diff --git a/drivers/iommu/amd_iommu_proto.h b/drivers/iommu/amd_iommu_proto.h > index 92c2ba6468a0..34d4dd66cf9b 100644 > --- a/drivers/iommu/amd_iommu_proto.h > +++ b/drivers/iommu/amd_iommu_proto.h > @@ -67,6 +67,21 @@ static inline int amd_iommu_create_irq_domain(struct amd_iommu *iommu) > extern int amd_iommu_complete_ppr(struct pci_dev *pdev, int pasid, > int status, int tag); > > +static inline int iommu_get_mode(void *pt_root) > +{ > + return (unsigned long)pt_root & ~PAGE_MASK; > +} > + > +static inline unsigned long iommu_get_root(void *pt_root) > +{ > + return (unsigned long)pt_root & PAGE_MASK; > +} > + > +static inline void *iommu_set_mode(void *pt_root, int mode) > +{ > + return (void *)(iommu_get_root(pt_root) + mode); > +} > + > static inline bool is_rd890_iommu(struct pci_dev *pdev) > { > return (pdev->vendor == PCI_VENDOR_ID_ATI) && > diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h > index ca8c4522045b..221adefa56a0 100644 > --- a/drivers/iommu/amd_iommu_types.h > +++ b/drivers/iommu/amd_iommu_types.h > @@ -468,8 +468,8 @@ struct protection_domain { > iommu core code */ > spinlock_t lock; /* mostly used to lock the page table*/ > u16 id; /* the domain id written to the device table */ > - int mode; /* paging mode (0-6 levels) */ > - u64 *pt_root; /* page table root pointer */ > + u64 *pt_root; /* page table root pointer and paging mode (0-6 > + levels) */ > int glx; /* Number of levels for GCR3 table */ > u64 *gcr3_tbl; /* Guest CR3 table */ > unsigned long flags; /* flags to find out type of domain */ > -- > 2.21.0 (Apple Git-122.2) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-17 1:42 ` Qian Cai @ 2020-04-18 12:10 ` Joerg Roedel 2020-04-18 13:01 ` Qian Cai 0 siblings, 1 reply; 16+ messages in thread From: Joerg Roedel @ 2020-04-18 12:10 UTC (permalink / raw) To: Qian Cai; +Cc: iommu, linux-kernel On Thu, Apr 16, 2020 at 09:42:41PM -0400, Qian Cai wrote: > So, this is still not enough that would still trigger storage driver offline under > memory pressure for a bit longer. It looks to me that in fetch_pte() there are > could still racy? Yes, your patch still looks racy. You need to atomically read domain->pt_root to a stack variable and derive the pt_root pointer and the mode from that variable instead of domain->pt_root directly. If you read the domain->pt_root twice there could still be an update between the two reads. Probably the lock in increase_address_space() can also be avoided if pt_root is updated using cmpxchg()? Regards, Joerg ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-18 12:10 ` Joerg Roedel @ 2020-04-18 13:01 ` Qian Cai 2020-04-18 18:34 ` Joerg Roedel 0 siblings, 1 reply; 16+ messages in thread From: Qian Cai @ 2020-04-18 13:01 UTC (permalink / raw) To: Joerg Roedel; +Cc: iommu, linux-kernel > On Apr 18, 2020, at 8:10 AM, Joerg Roedel <joro@8bytes.org> wrote: > > Yes, your patch still looks racy. You need to atomically read > domain->pt_root to a stack variable and derive the pt_root pointer and > the mode from that variable instead of domain->pt_root directly. If you > read the domain->pt_root twice there could still be an update between > the two reads. > Probably the lock in increase_address_space() can also be avoided if > pt_root is updated using cmpxchg()? Hard to tell without testing further. I’ll leave that optimization in the future, and focus on fixing those races first. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-18 13:01 ` Qian Cai @ 2020-04-18 18:34 ` Joerg Roedel 2020-04-20 2:07 ` Qian Cai 2020-04-20 13:26 ` Qian Cai 0 siblings, 2 replies; 16+ messages in thread From: Joerg Roedel @ 2020-04-18 18:34 UTC (permalink / raw) To: Qian Cai; +Cc: iommu, linux-kernel On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote: > Hard to tell without testing further. I’ll leave that optimization in > the future, and focus on fixing those races first. Yeah right, we should fix the existing races first before introducing new ones ;) Btw, THANKS A LOT for tracking down all these race condition bugs, I am not even remotely able to trigger them with the hardware I have around. I did some hacking and the attached diff shows how I think this race condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1, but did no further testing. Can you test it please? diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 20cce366e951..28229a38af4d 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -151,6 +151,26 @@ static struct protection_domain *to_pdomain(struct iommu_domain *dom) return container_of(dom, struct protection_domain, domain); } +static void amd_iommu_domain_get_pgtable(struct protection_domain *domain, + struct domain_pgtable *pgtable) +{ + u64 pt_root = atomic64_read(&domain->pt_root); + + pgtable->root = (u64 *)(pt_root & PAGE_MASK); + pgtable->mode = pt_root & 7; /* lowest 3 bits encode pgtable mode */ +} + +static u64 amd_iommu_domain_encode_pgtable(u64 *root, int mode) +{ + u64 pt_root; + + /* lowest 3 bits encode pgtable mode */ + pt_root = mode & 7; + pt_root |= (u64)root; + + return pt_root; +} + static struct iommu_dev_data *alloc_dev_data(u16 devid) { struct iommu_dev_data *dev_data; @@ -1397,13 +1417,18 @@ static struct page *free_sub_pt(unsigned long root, int mode, static void free_pagetable(struct protection_domain *domain) { - unsigned long root = (unsigned long)domain->pt_root; + struct domain_pgtable pgtable; struct page *freelist = NULL; + unsigned long root; + + amd_iommu_domain_get_pgtable(domain, &pgtable); + atomic64_set(&domain->pt_root, 0); - BUG_ON(domain->mode < PAGE_MODE_NONE || - domain->mode > PAGE_MODE_6_LEVEL); + BUG_ON(pgtable.mode < PAGE_MODE_NONE || + pgtable.mode > PAGE_MODE_6_LEVEL); - freelist = free_sub_pt(root, domain->mode, freelist); + root = (unsigned long)pgtable.root; + freelist = free_sub_pt(root, pgtable.mode, freelist); free_page_list(freelist); } @@ -1417,24 +1442,28 @@ static bool increase_address_space(struct protection_domain *domain, unsigned long address, gfp_t gfp) { + struct domain_pgtable pgtable; unsigned long flags; bool ret = false; - u64 *pte; + u64 *pte, root; spin_lock_irqsave(&domain->lock, flags); - if (address <= PM_LEVEL_SIZE(domain->mode) || - WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL)) + amd_iommu_domain_get_pgtable(domain, &pgtable); + + if (address <= PM_LEVEL_SIZE(pgtable.mode) || + WARN_ON_ONCE(pgtable.mode == PAGE_MODE_6_LEVEL)) goto out; pte = (void *)get_zeroed_page(gfp); if (!pte) goto out; - *pte = PM_LEVEL_PDE(domain->mode, - iommu_virt_to_phys(domain->pt_root)); - domain->pt_root = pte; - domain->mode += 1; + *pte = PM_LEVEL_PDE(pgtable.mode, iommu_virt_to_phys(pgtable.root)); + + root = amd_iommu_domain_encode_pgtable(pte, pgtable.mode + 1); + + atomic64_set(&domain->pt_root, root); ret = true; @@ -1451,16 +1480,22 @@ static u64 *alloc_pte(struct protection_domain *domain, gfp_t gfp, bool *updated) { + struct domain_pgtable pgtable; int level, end_lvl; u64 *pte, *page; BUG_ON(!is_power_of_2(page_size)); - while (address > PM_LEVEL_SIZE(domain->mode)) + amd_iommu_domain_get_pgtable(domain, &pgtable); + + while (address > PM_LEVEL_SIZE(pgtable.mode)) { *updated = increase_address_space(domain, address, gfp) || *updated; + amd_iommu_domain_get_pgtable(domain, &pgtable); + } + - level = domain->mode - 1; - pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)]; + level = pgtable.mode - 1; + pte = &pgtable.root[PM_LEVEL_INDEX(level, address)]; address = PAGE_SIZE_ALIGN(address, page_size); end_lvl = PAGE_SIZE_LEVEL(page_size); @@ -1536,16 +1571,19 @@ static u64 *fetch_pte(struct protection_domain *domain, unsigned long address, unsigned long *page_size) { + struct domain_pgtable pgtable; int level; u64 *pte; *page_size = 0; - if (address > PM_LEVEL_SIZE(domain->mode)) + amd_iommu_domain_get_pgtable(domain, &pgtable); + + if (address > PM_LEVEL_SIZE(pgtable.mode)) return NULL; - level = domain->mode - 1; - pte = &domain->pt_root[PM_LEVEL_INDEX(level, address)]; + level = pgtable.mode - 1; + pte = &pgtable.root[PM_LEVEL_INDEX(level, address)]; *page_size = PTE_LEVEL_PAGE_SIZE(level); while (level > 0) { @@ -1806,6 +1844,7 @@ static void dma_ops_domain_free(struct protection_domain *domain) static struct protection_domain *dma_ops_domain_alloc(void) { struct protection_domain *domain; + u64 *pt_root, root; domain = kzalloc(sizeof(struct protection_domain), GFP_KERNEL); if (!domain) @@ -1814,12 +1853,14 @@ static struct protection_domain *dma_ops_domain_alloc(void) if (protection_domain_init(domain)) goto free_domain; - domain->mode = PAGE_MODE_3_LEVEL; - domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL); - domain->flags = PD_DMA_OPS_MASK; - if (!domain->pt_root) + pt_root = (void *)get_zeroed_page(GFP_KERNEL); + if (!pt_root) goto free_domain; + root = amd_iommu_domain_encode_pgtable(pt_root, PAGE_MODE_3_LEVEL); + atomic64_set(&domain->pt_root, root); + domain->flags = PD_DMA_OPS_MASK; + if (iommu_get_dma_cookie(&domain->domain) == -ENOMEM) goto free_domain; @@ -1843,14 +1884,17 @@ static bool dma_ops_domain(struct protection_domain *domain) static void set_dte_entry(u16 devid, struct protection_domain *domain, bool ats, bool ppr) { + struct domain_pgtable pgtable; u64 pte_root = 0; u64 flags = 0; u32 old_domid; - if (domain->mode != PAGE_MODE_NONE) - pte_root = iommu_virt_to_phys(domain->pt_root); + amd_iommu_domain_get_pgtable(domain, &pgtable); + + if (pgtable.mode != PAGE_MODE_NONE) + pte_root = iommu_virt_to_phys(pgtable.root); - pte_root |= (domain->mode & DEV_ENTRY_MODE_MASK) + pte_root |= (pgtable.mode & DEV_ENTRY_MODE_MASK) << DEV_ENTRY_MODE_SHIFT; pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V | DTE_FLAG_TV; @@ -2375,6 +2419,7 @@ static struct protection_domain *protection_domain_alloc(void) static struct iommu_domain *amd_iommu_domain_alloc(unsigned type) { struct protection_domain *pdomain; + u64 *pt_root, root; switch (type) { case IOMMU_DOMAIN_UNMANAGED: @@ -2382,13 +2427,15 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type) if (!pdomain) return NULL; - pdomain->mode = PAGE_MODE_3_LEVEL; - pdomain->pt_root = (void *)get_zeroed_page(GFP_KERNEL); - if (!pdomain->pt_root) { + pt_root = (void *)get_zeroed_page(GFP_KERNEL); + if (!pt_root) { protection_domain_free(pdomain); return NULL; } + root = amd_iommu_domain_encode_pgtable(pt_root, PAGE_MODE_3_LEVEL); + atomic64_set(&pdomain->pt_root, root); + pdomain->domain.geometry.aperture_start = 0; pdomain->domain.geometry.aperture_end = ~0ULL; pdomain->domain.geometry.force_aperture = true; @@ -2406,7 +2453,7 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type) if (!pdomain) return NULL; - pdomain->mode = PAGE_MODE_NONE; + atomic64_set(&pdomain->pt_root, PAGE_MODE_NONE); break; default: return NULL; @@ -2418,6 +2465,7 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned type) static void amd_iommu_domain_free(struct iommu_domain *dom) { struct protection_domain *domain; + struct domain_pgtable pgtable; domain = to_pdomain(dom); @@ -2435,7 +2483,9 @@ static void amd_iommu_domain_free(struct iommu_domain *dom) dma_ops_domain_free(domain); break; default: - if (domain->mode != PAGE_MODE_NONE) + amd_iommu_domain_get_pgtable(domain, &pgtable); + + if (pgtable.mode != PAGE_MODE_NONE) free_pagetable(domain); if (domain->flags & PD_IOMMUV2_MASK) @@ -2518,10 +2568,12 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova, gfp_t gfp) { struct protection_domain *domain = to_pdomain(dom); + struct domain_pgtable pgtable; int prot = 0; int ret; - if (domain->mode == PAGE_MODE_NONE) + amd_iommu_domain_get_pgtable(domain, &pgtable); + if (pgtable.mode == PAGE_MODE_NONE) return -EINVAL; if (iommu_prot & IOMMU_READ) @@ -2541,8 +2593,10 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova, struct iommu_iotlb_gather *gather) { struct protection_domain *domain = to_pdomain(dom); + struct domain_pgtable pgtable; - if (domain->mode == PAGE_MODE_NONE) + amd_iommu_domain_get_pgtable(domain, &pgtable); + if (pgtable.mode == PAGE_MODE_NONE) return 0; return iommu_unmap_page(domain, iova, page_size); @@ -2553,9 +2607,11 @@ static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom, { struct protection_domain *domain = to_pdomain(dom); unsigned long offset_mask, pte_pgsize; + struct domain_pgtable pgtable; u64 *pte, __pte; - if (domain->mode == PAGE_MODE_NONE) + amd_iommu_domain_get_pgtable(domain, &pgtable); + if (pgtable.mode == PAGE_MODE_NONE) return iova; pte = fetch_pte(domain, iova, &pte_pgsize); @@ -2708,16 +2764,26 @@ EXPORT_SYMBOL(amd_iommu_unregister_ppr_notifier); void amd_iommu_domain_direct_map(struct iommu_domain *dom) { struct protection_domain *domain = to_pdomain(dom); + struct domain_pgtable pgtable; unsigned long flags; + u64 pt_root; spin_lock_irqsave(&domain->lock, flags); + /* First save pgtable configuration*/ + amd_iommu_domain_get_pgtable(domain, &pgtable); + /* Update data structure */ - domain->mode = PAGE_MODE_NONE; + pt_root = amd_iommu_domain_encode_pgtable(NULL, PAGE_MODE_NONE); + atomic64_set(&domain->pt_root, pt_root); /* Make changes visible to IOMMUs */ update_domain(domain); + /* Restore old pgtable in domain->ptroot to free page-table */ + pt_root = amd_iommu_domain_encode_pgtable(pgtable.root, pgtable.mode); + atomic64_set(&domain->pt_root, pt_root); + /* Page-table is not visible to IOMMU anymore, so free it */ free_pagetable(domain); @@ -2908,9 +2974,11 @@ static u64 *__get_gcr3_pte(u64 *root, int level, int pasid, bool alloc) static int __set_gcr3(struct protection_domain *domain, int pasid, unsigned long cr3) { + struct domain_pgtable pgtable; u64 *pte; - if (domain->mode != PAGE_MODE_NONE) + amd_iommu_domain_get_pgtable(domain, &pgtable); + if (pgtable.mode != PAGE_MODE_NONE) return -EINVAL; pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, true); @@ -2924,9 +2992,11 @@ static int __set_gcr3(struct protection_domain *domain, int pasid, static int __clear_gcr3(struct protection_domain *domain, int pasid) { + struct domain_pgtable pgtable; u64 *pte; - if (domain->mode != PAGE_MODE_NONE) + amd_iommu_domain_get_pgtable(domain, &pgtable); + if (pgtable.mode != PAGE_MODE_NONE) return -EINVAL; pte = __get_gcr3_pte(domain->gcr3_tbl, domain->glx, pasid, false); diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h index ca8c4522045b..7a8fdec138bd 100644 --- a/drivers/iommu/amd_iommu_types.h +++ b/drivers/iommu/amd_iommu_types.h @@ -468,8 +468,7 @@ struct protection_domain { iommu core code */ spinlock_t lock; /* mostly used to lock the page table*/ u16 id; /* the domain id written to the device table */ - int mode; /* paging mode (0-6 levels) */ - u64 *pt_root; /* page table root pointer */ + atomic64_t pt_root; /* pgtable root and pgtable mode */ int glx; /* Number of levels for GCR3 table */ u64 *gcr3_tbl; /* Guest CR3 table */ unsigned long flags; /* flags to find out type of domain */ @@ -477,6 +476,12 @@ struct protection_domain { unsigned dev_iommu[MAX_IOMMUS]; /* per-IOMMU reference count */ }; +/* For decocded pt_root */ +struct domain_pgtable { + int mode; + u64 *root; +}; + /* * Structure where we save information about one hardware AMD IOMMU in the * system. ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-18 18:34 ` Joerg Roedel @ 2020-04-20 2:07 ` Qian Cai 2020-04-20 13:26 ` Qian Cai 1 sibling, 0 replies; 16+ messages in thread From: Qian Cai @ 2020-04-20 2:07 UTC (permalink / raw) To: Joerg Roedel; +Cc: iommu, LKML > On Apr 18, 2020, at 2:34 PM, Joerg Roedel <joro@8bytes.org> wrote: > > On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote: >> Hard to tell without testing further. I’ll leave that optimization in >> the future, and focus on fixing those races first. > > Yeah right, we should fix the existing races first before introducing > new ones ;) > > Btw, THANKS A LOT for tracking down all these race condition bugs, I am > not even remotely able to trigger them with the hardware I have around. > > I did some hacking and the attached diff shows how I think this race > condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1, > but did no further testing. Can you test it please? Sure, give it a few days to see if it could survive. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-18 18:34 ` Joerg Roedel 2020-04-20 2:07 ` Qian Cai @ 2020-04-20 13:26 ` Qian Cai 2020-04-29 8:47 ` Joerg Roedel 2020-04-29 11:20 ` Joerg Roedel 1 sibling, 2 replies; 16+ messages in thread From: Qian Cai @ 2020-04-20 13:26 UTC (permalink / raw) To: Joerg Roedel; +Cc: iommu, linux-kernel > On Apr 18, 2020, at 2:34 PM, Joerg Roedel <joro@8bytes.org> wrote: > > On Sat, Apr 18, 2020 at 09:01:35AM -0400, Qian Cai wrote: >> Hard to tell without testing further. I’ll leave that optimization in >> the future, and focus on fixing those races first. > > Yeah right, we should fix the existing races first before introducing > new ones ;) > > Btw, THANKS A LOT for tracking down all these race condition bugs, I am > not even remotely able to trigger them with the hardware I have around. > > I did some hacking and the attached diff shows how I think this race > condition needs to be fixed. I boot-tested this fix on-top of v5.7-rc1, > but did no further testing. Can you test it please? No dice. There could be some other races. For example, > @@ -1536,16 +1571,19 @@ static u64 *fetch_pte(struct protection_domain *domain, > unsigned long address, > unsigned long *page_size) ... > amd_iommu_domain_get_pgtable(domain, &pgtable); > > if (address > PM_LEVEL_SIZE(pgtable.mode)) > return NULL; > > level = pgtable.mode - 1; > pte = &pgtable.root[PM_LEVEL_INDEX(level, address)]; <— increase_address_space() > *page_size = PTE_LEVEL_PAGE_SIZE(level); > while (level > 0) { *page_size = PTE_LEVEL_PAGE_SIZE(level); Then in iommu_unmap_page(), while (unmapped < page_size) { pte = fetch_pte(dom, bus_addr, &unmap_size); … bus_addr = (bus_addr & ~(unmap_size - 1)) + unmap_size; bus_addr would be unsync with dom->mode when it enter fetch_pte() again. Could that be a problem? [ 5159.274829][ T4057] LTP: starting oom02 [ 5160.382787][ C52] perf: interrupt took too long (7443 > 6208), lowering kernel.perf_event_max_sample_rate to 26800 [ 5167.951785][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc0000 flags=0x0010] [ 5167.964540][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc1000 flags=0x0010] [ 5167.977442][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc1900 flags=0x0010] [ 5167.989901][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc1d00 flags=0x0010] [ 5168.002291][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2000 flags=0x0010] [ 5168.014665][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2400 flags=0x0010] [ 5168.027132][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2800 flags=0x0010] [ 5168.039566][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc2c00 flags=0x0010] [ 5168.051956][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc3000 flags=0x0010] [ 5168.064310][ T812] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xfffffffffffc3400 flags=0x0010] [ 5168.076652][ T812] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xfffffffffffc3800 flags=0x0010] [ 5168.088290][ T812] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xfffffffffffc3c00 flags=0x0010] [ 5183.695829][ C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 [ 5183.704390][ C8] smartpqi 0000:23:00.0: controller offline [ 5183.756594][ C101] blk_update_request: I/O error, dev sda, sector 22306304 op 0x1:(WRITE) flags 0x8000000 phys_seg 4 prio class 0 [ 5183.756628][ C34] sd 0:1:0:0: [sda] tag#655 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.756759][ C56] blk_update_request: I/O error, dev sda, sector 58480128 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 [ 5183.756810][ C79] sd 0:1:0:0: [sda] tag#234 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.756816][ C121] sd 0:1:0:0: [sda] tag#104 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.756837][ T53] sd 0:1:0:0: [sda] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.756882][ C121] sd 0:1:0:0: [sda] tag#104 CDB: opcode=0x2a 2a 00 00 4d d4 00 00 02 00 00 [ 5183.756892][ C79] sd 0:1:0:0: [sda] tag#234 CDB: opcode=0x2a 2a 00 02 03 e4 00 00 02 00 00 [ 5183.756909][ C121] blk_update_request: I/O error, dev sda, sector 5100544 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 [ 5183.756920][ C79] blk_update_request: I/O error, dev sda, sector 33809408 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 [ 5183.756939][ T53] sd 0:1:0:0: [sda] tag#4 CDB: opcode=0x2a 2a 00 02 4b f8 00 00 02 00 00 [ 5183.756967][ T53] blk_update_request: I/O error, dev sda, sector 38533120 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 [ 5183.756989][ C49] blk_update_request: I/O error, dev sda, sector 30181376 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 [ 5183.757045][ C51] sd 0:1:0:0: [sda] tag#452 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.757107][ C51] sd 0:1:0:0: [sda] tag#452 CDB: opcode=0x2a 2a 00 02 95 06 00 00 02 00 00 [ 5183.757125][ C51] blk_update_request: I/O error, dev sda, sector 43320832 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 [ 5183.757199][ C82] blk_update_request: I/O error, dev sda, sector 10187776 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 [ 5183.757209][ C109] blk_update_request: I/O error, dev sda, sector 29812736 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 [ 5183.757215][ C77] sd 0:1:0:0: [sda] tag#849 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.757222][ C110] sd 0:1:0:0: [sda] tag#558 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.757237][ C92] blk_update_request: I/O error, dev sda, sector 6410240 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 [ 5183.757244][ C91] sd 0:1:0:0: [sda] tag#73 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.757251][ C68] sd 0:1:0:0: [sda] tag#416 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.757458][ C77] sd 0:1:0:0: [sda] tag#849 CDB: opcode=0x2a 2a 00 02 78 a4 00 00 02 00 00 [ 5183.757467][ C110] sd 0:1:0:0: [sda] tag#558 CDB: opcode=0x2a 2a 00 03 58 94 00 00 02 00 00 [ 5183.757515][ C122] sd 0:1:0:0: [sda] tag#747 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=15s [ 5183.757525][ C68] sd 0:1:0:0: [sda] tag#416 CDB: opcode=0x2a 2a 00 01 0e 32 00 00 02 00 00 [ 5183.757536][ C91] sd 0:1:0:0: [sda] tag#73 CDB: opcode=0x2a 2a 00 01 a2 86 00 00 02 00 00 [ 5183.757727][ C122] sd 0:1:0:0: [sda] tag#747 CDB: opcode=0x2a 2a 00 02 a7 24 00 00 02 00 00 [ 5183.758530][ T53] Write-error on swap-device (254:1:64823296) [ 5183.758758][ T53] Write-error on swap-device (254:1:35201024) [ 5183.758811][ C105] Write-error on swap-device (254:1:52690944) [ 5183.758959][ C82] Write-error on swap-device (254:1:6856704) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-20 13:26 ` Qian Cai @ 2020-04-29 8:47 ` Joerg Roedel 2020-04-29 11:20 ` Joerg Roedel 1 sibling, 0 replies; 16+ messages in thread From: Joerg Roedel @ 2020-04-29 8:47 UTC (permalink / raw) To: Qian Cai; +Cc: iommu, linux-kernel Hi Qian, On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote: > > No dice. There could be some other races. For example, Okay, I think I know what is happening. The increase_address_space() function increases the address space, but does not update the DTE and does not flush the old DTE from the caches. But this needs to happen before domain->pt_root is updated, because otherwise another CPU can come along and map something into the increased address-space which is not yet accessible by the device because the DTE is not updated yet. Regards, Joerg ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-20 13:26 ` Qian Cai 2020-04-29 8:47 ` Joerg Roedel @ 2020-04-29 11:20 ` Joerg Roedel 2020-04-30 1:04 ` Qian Cai 2020-05-03 13:04 ` Qian Cai 1 sibling, 2 replies; 16+ messages in thread From: Joerg Roedel @ 2020-04-29 11:20 UTC (permalink / raw) To: Qian Cai; +Cc: iommu, linux-kernel On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote: > No dice. There could be some other races. For example, Can you please test this branch: https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes It has the previous fix in it and a couple more to make sure the device-table is updated and flushed before increase_address_space() updates domain->pt_root. Thanks, Joerg ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-29 11:20 ` Joerg Roedel @ 2020-04-30 1:04 ` Qian Cai 2020-05-03 13:04 ` Qian Cai 1 sibling, 0 replies; 16+ messages in thread From: Qian Cai @ 2020-04-30 1:04 UTC (permalink / raw) To: Joerg Roedel; +Cc: iommu, linux-kernel > On Apr 29, 2020, at 7:20 AM, Joerg Roedel <joro@8bytes.org> wrote: > > On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote: >> No dice. There could be some other races. For example, > > Can you please test this branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes > > It has the previous fix in it and a couple more to make sure the > device-table is updated and flushed before increase_address_space() > updates domain->pt_root. It looks promising as it survives for a day’s stress testing. I’ll keep it running for a few days to be sure. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-04-29 11:20 ` Joerg Roedel 2020-04-30 1:04 ` Qian Cai @ 2020-05-03 13:04 ` Qian Cai 2020-05-03 18:39 ` Joerg Roedel 1 sibling, 1 reply; 16+ messages in thread From: Qian Cai @ 2020-05-03 13:04 UTC (permalink / raw) To: Joerg Roedel; +Cc: iommu, linux-kernel > On Apr 29, 2020, at 7:20 AM, Joerg Roedel <joro@8bytes.org> wrote: > > On Mon, Apr 20, 2020 at 09:26:12AM -0400, Qian Cai wrote: >> No dice. There could be some other races. For example, > > Can you please test this branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes > > It has the previous fix in it and a couple more to make sure the > device-table is updated and flushed before increase_address_space() > updates domain->pt_root. I believe this closed the existing races as it had survived for many days. Great work! ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-05-03 13:04 ` Qian Cai @ 2020-05-03 18:39 ` Joerg Roedel 2020-05-03 19:12 ` Qian Cai 0 siblings, 1 reply; 16+ messages in thread From: Joerg Roedel @ 2020-05-03 18:39 UTC (permalink / raw) To: Qian Cai; +Cc: iommu, linux-kernel Hi Qian, On Sun, May 03, 2020 at 09:04:03AM -0400, Qian Cai wrote: > > On Apr 29, 2020, at 7:20 AM, Joerg Roedel <joro@8bytes.org> wrote: > > Can you please test this branch: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=amd-iommu-fixes > > > > It has the previous fix in it and a couple more to make sure the > > device-table is updated and flushed before increase_address_space() > > updates domain->pt_root. > > I believe this closed the existing races as it had survived for many > days. Great work! Thanks a lot for testing these changes! Can I add your Tested-by when I send them to the mailing list tomorrow? Regards, Joerg ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte() 2020-05-03 18:39 ` Joerg Roedel @ 2020-05-03 19:12 ` Qian Cai 0 siblings, 0 replies; 16+ messages in thread From: Qian Cai @ 2020-05-03 19:12 UTC (permalink / raw) To: Joerg Roedel; +Cc: iommu, linux-kernel > On May 3, 2020, at 2:39 PM, Joerg Roedel <joro@8bytes.org> wrote: > > Can I add your Tested-by when I > send them to the mailing list tomorrow? Sure. Feel free to add, Tested-by: Qian Cai <cai@lca.pw> ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2020-05-03 19:12 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-04-07 2:12 [RFC PATCH] iommu/amd: fix a race in fetch_pte() Qian Cai 2020-04-07 15:36 ` Qian Cai 2020-04-08 14:19 ` Joerg Roedel 2020-04-14 1:36 ` Qian Cai 2020-04-17 1:42 ` Qian Cai 2020-04-18 12:10 ` Joerg Roedel 2020-04-18 13:01 ` Qian Cai 2020-04-18 18:34 ` Joerg Roedel 2020-04-20 2:07 ` Qian Cai 2020-04-20 13:26 ` Qian Cai 2020-04-29 8:47 ` Joerg Roedel 2020-04-29 11:20 ` Joerg Roedel 2020-04-30 1:04 ` Qian Cai 2020-05-03 13:04 ` Qian Cai 2020-05-03 18:39 ` Joerg Roedel 2020-05-03 19:12 ` Qian Cai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).