linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] iommu/amd: fix a race in increase_address_space()
@ 2019-09-04 21:24 Qian Cai
  2019-09-05 11:43 ` Joerg Roedel
  0 siblings, 1 reply; 4+ messages in thread
From: Qian Cai @ 2019-09-04 21:24 UTC (permalink / raw)
  To: jroedel; +Cc: hch, iommu, don.brace, esc.storagedev, linux-kernel, Qian Cai

When the system is under some memory pressure, it could cause disks on
the "smartpqi" driver going offline below. From the UBSAN report, it
indicates that "domain->mode" becomes 7. Further investigation indicates
that the only place that would increase "domain->mode" is
increase_address_space() but it has this check before actually
increasing it,

	if (domain->mode == PAGE_MODE_6_LEVEL)
		/* address space already 64 bit large */
		return false;

This gives a clue that there must be a race between multiple concurrent
threads in increase_address_space().

In amd_iommu_map() and amd_iommu_unmap(), there is
mutex_lock() and mutex_unlock() around a iommu_map_page(). There are
several places that could call iommu_map_page() with interrupt enabled.
One for sure is alloc_coherent() which call __map_single() that had a
comment said that it needs to hold a lock but not so. I am not sure
about the map_page() and map_sg() where my investigation so far
indicating that many call sites there are in the atomic context.

smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000
address=0xff702a00 flags=0x0010]
UBSAN: Undefined behaviour in drivers/iommu/amd_iommu.c:1464:29
shift exponent 66 is too large for 64-bit type 'long unsigned int'
CPU: 39 PID: 821 Comm: kswapd4 Tainted: G           O
5.3.0-rc7-next-20190903+ #4
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40
07/10/2019
Call Trace:
 dump_stack+0x62/0x9a
 ubsan_epilogue+0xd/0x3a
smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000
address=0xff702c80 flags=0x0010]
 __ubsan_handle_shift_out_of_bounds.cold.12+0x21/0x68
 iommu_map_page.cold.39+0x7f/0x84
 map_sg+0x1ce/0x2f0
 scsi_dma_map+0xc6/0x160
 pqi_raid_submit_scsi_cmd_with_io_request+0x1c3/0x470 [smartpqi]
smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000
address=0xff702900 flags=0x0010]
 pqi_scsi_queue_command+0x7d6/0xeb0 [smartpqi]
 scsi_queue_rq+0x7ee/0x12b0
 __blk_mq_try_issue_directly+0x295/0x3f0
 blk_mq_try_issue_directly+0xad/0x130
 blk_mq_make_request+0xb12/0xee0
 generic_make_request+0x179/0x4a0
 submit_bio+0xaa/0x270
 __swap_writepage+0x8f5/0xba0
 swap_writepage+0x65/0xb0
 pageout.isra.3+0x3e5/0xa00
 shrink_page_list+0x15a0/0x2660
 shrink_inactive_list+0x373/0x770
 shrink_node_memcg+0x4ff/0x1550
 shrink_node+0x123/0x800
 balance_pgdat+0x493/0x9b0
 kswapd+0x39b/0x7f0
 kthread+0x1df/0x200
 ret_from_fork+0x22/0x40
========================================================================
smartpqi 0000:23:00.0: resetting scsi 0:1:0:0
INFO: task khugepaged:797 blocked for more than 122 seconds.
      Tainted: G           O      5.3.0-rc7-next-20190903+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
khugepaged      D22656   797      2 0x80004000
Call Trace:
 __schedule+0x4bb/0xc20
 schedule+0x6c/0x140
 io_schedule+0x21/0x50
 rq_qos_wait+0x18e/0x2b0
 wbt_wait+0x12b/0x1b0
 __rq_qos_throttle+0x3b/0x60
 blk_mq_make_request+0x217/0xee0
 generic_make_request+0x179/0x4a0
 submit_bio+0xaa/0x270
 __swap_writepage+0x8f5/0xba0
 swap_writepage+0x65/0xb0
 pageout.isra.3+0x3e5/0xa00
 shrink_page_list+0x15a0/0x2660
 shrink_inactive_list+0x373/0x770
 shrink_node_memcg+0x4ff/0x1550
 shrink_node+0x123/0x800
 do_try_to_free_pages+0x22f/0x820
 try_to_free_pages+0x242/0x4d0
 __alloc_pages_nodemask+0x9f6/0x1bb0
 khugepaged_alloc_page+0x6f/0xf0
 collapse_huge_page+0x103/0x1060
 khugepaged_scan_pmd+0x840/0xa70
 khugepaged+0x1571/0x18d0
 kthread+0x1df/0x200
 ret_from_fork+0x22/0x40
INFO: task systemd-journal:3112 blocked for more than 123 seconds.
      Tainted: G           O      5.3.0-rc7-next-20190903+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
systemd-journal D20744  3112      1 0x00004120
Call Trace:
 __do_page_cache_readahead+0x149/0x3a0
 filemap_fault+0x9da/0x1050
 __xfs_filemap_fault+0x167/0x450 [xfs]
 xfs_filemap_fault+0x68/0x70 [xfs]
 __do_fault+0x83/0x220
 __handle_mm_fault+0x1034/0x1a50
 handle_mm_fault+0x17f/0x37e
 __do_page_fault+0x369/0x630
 do_page_fault+0x50/0x2d3
 page_fault+0x2f/0x40
RIP: 0033:0x56088e81faa0
Code: Bad RIP value.
RSP: 002b:00007ffdc2158ba8 EFLAGS: 000102_iter+0x135/0x850
 shrink_node+0x123/0x800
 do_try_to_free_pages+0x22f/0x820
 try_to_free_pages+0x242/0x4d0
 __alloc_pages_nodemask+0x9f6/0x1bb0
[11967.13967  __swap_writepage+0x8f5/0xba0
 swap_writepage+0x65/0xb0
 pageout.isra.3+0x3e5/0xa00
 shrink_page_list+0x15a0/0x2660
 io_schedule+0x21/0x50
 rq_qos_wait+0x18e/0x2b0
 wbt_wait+0x12b/0x1b0
 __rq_qos_throttle+0x3b/0x60
 blk_mq_make_request+0x217/0xee0
 schedule+0x6c/0x140
 io_schedule+0x21/0x50
 rq_qos_wait+0x18e/0x2b0
 wbt_wait+0x12b/0x1b0
 __rq_qos_throttle+0x3b/0x60
 blk_mq_make_request+0x217/0xee0
[11968.69198.0+0x60/0x60
 swap_writepage+0x65/0xb0
 pageout.isra.3+0x3e5/0xa00
 shrink_page_list+0x15a0/0x2660
 generic_make_request+0x179/0x4a0
 submit_bio+0xaa/0x270
 __swap_writepage+0x8f5/0xba0
 swap_writepage+0x65/0xb0
 pageout.isra.3+0x3e5/0xa00
 __swap_writepage+0x8f5/0xba0
 swap_writepage+0x65/0xb0
 pageout.isra.3+0x3e5/0xa00
 shrink_page_list+0x15a0/0x2660
 wbt_wait+0x12b/0x1b0
 __rq_qos_throttle+0x3b/0x60
 blk_mq_make_request+0x217/0xee0
 generic_make_request+0x179/0x4a0
 submit_bio+0xaa/0x270
 __swap_writepage+0x8f5/0xba0
1 lock held by khungtaskd/791:
 #0: ffffffffa1b8b360 (rcu_read_lock){....}, at:
debug_show_all_locks+0x33/0x16a
1 lock held by khugepaged/797:
 #0: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by kswapd0/820:
 #0: ffffffffa1cc55c0 (fs_reclaim){....}, at:
__fs_reclaim_acquire+0x5/0x30
 #1: ffff88846fa43538 (&md->io_barrier){....}, at:
dm_get_live_table+0x5/0x70 [dm_mod]
2 locks held by kswapd4/821:
 #0: ffffffffa1cc55c0 (fs_reclaim){....}, at:
__fs_reclaim_acquire+0x5/0x30
 #1: ffff88846fa43538 (&md->io_barrier){....}, at:
dm_get_live_table+0x5/0x70 [dm_mod]
1 lock held by scsi_eh_0/1577:
 #0: ffff8884cb52cca0 (&ctrl_info->lun_reset_mutex){....}, at:
pqi_eh_device_reset_handler+0x313/0x970 [smartpqi]
[11acquire.part.15+0x5/0x30
2 locks held by oom01/121253:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121254:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121255:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121256:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
[11971.6650x5/0x30
2 locks held by oom01/121278:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121279:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121280:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121281:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acqfffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121299:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121300:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121301:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121302:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, 2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121320:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121321:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121322:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121323:
 #0: ffff888338118840 (&om01/121340:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121341:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121342:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121343:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121366:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121367:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121368:
 #0: ffff888338118840 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x18e/0x630
 #1: ffffffffa1cc55c0 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
2 locks held by oom01/121369:
smartpqi 0000:23:00.0: no heartbeat detected - last heartbeat count:
11794
smartpqi 0000:23:00.0: controller offline
smartpqi 0000:23:00.0: reset of scsi 0:1:0:0: FAILED
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
========================================================================
sd 0:1:0:0: Device offlined - not ready after error recovery
UBSAN: Undefined behaviour in drivers/iommu/amd_iommu.c:1526:28
shift exponent 66 is too large for 64-bit type 'long unsigned int'
CPU: 8 PID: 49038 Comm: kworker/8:0 Tainted: G           O
5.3.0-rc7-next-20190903+ #4
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40
07/10/2019
sd 0:1:0:0: Device offlined - not ready after error recovery
Workqueue: events pqi_ctrl_offline_worker [smartpqi]
sd 0:1:0:0: Device offlined - not ready after error recovery
Call Trace:
 dump_stack+0x62/0x9a
 ubsan_epilogue+0xd/0x3a
sd 0:1:0:0: Device offlined - not ready after error recovery
 __ubsan_handle_shift_out_of_bounds.cold.12+0x21/0x68
 fetch_pte.cold.23+0x9c/0xcf
 iommu_unmap_page+0xcd/0x180
sd 0:1:0:0: Device offlined - not ready after error recovery
 __unmap_single.isra.18+0x6a/0x120
 unmap_sg+0x74/0x90
 scsi_dma_unmap+0xdc/0x140
sd 0:1:0:0: Device offlined - not ready after error recovery
 pqi_raid_io_complete+0x2d/0x60 [smartpqi]
 pqi_ctrl_offline_worker+0x128/0x270 [smartpqi]
sd 0:1:0:0: Device offlined - not ready after error recovery
 process_one_work+0x53b/0xa70
 worker_thread+0x63/0x5b0
 kthread+0x1df/0x200
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
 ret_from_fork+0x22/0x40
sd 0:1:0:0: Device offlined - not ready after error recovery
========================================================================
sd 0:1:0:0: Device offlined - not ready after error recovery
========================================================================
sd 0:1:0:0: Device offlined - not ready after error recovery
UBSAN: Undefined behaviour in drivers/iommu/amd_iommu.c:1527:16
shift exponent 66 is too large for 64-bit type 'long long unsigned int'
CPU: 8 PID: 49038 Comm: kworker/8:0 Tainted: G           O
5.3.0-rc7-next-20190903+ #4
sd 0:1:0:0: Device offlined - not ready after error recovery
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40
07/10/2019
Workqueue: events pqi_ctrl_offline_worker [smartpqi]
Call Trace:
 dump_stack+0x62/0x9a
sd 0:1:0:0: Device offlined - not ready after error recovery
 ubsan_epilogue+0xd/0x3a
sd 0:1:0:0: Device offlined - not ready after error recovery
 __ubsan_handle_shift_out_of_bounds.cold.12+0x21/0x68
sd 0:1:0:0: Device offlined - not ready after error recovery
 fetch_pte.cold.23+0xca/0xcf
 iommu_unmap_page+0xcd/0x180
 __unmap_single.isra.18+0x6a/0x120
sd 0:1:0:0: Device offlined - not ready after error recovery
 unmap_sg+0x74/0x90
sd 0:1:0:0: Device offlined - not ready after error recovery
 scsi_dma_unmap+0xdc/0x140
 pqi_raid_io_complete+0x2d/0x60 [smartpqi]
 pqi_ctrl_offline_worker+0x128/0x270 [smartpqi]
sd 0:1:0:0: Device offlined - not ready after error recovery
 process_one_work+0x53b/0xa70
 worker_thread+0x63/0x5b0
 kthread+0x1df/0x200
sd 0:1:0:0: Device offlined - not ready after error recovery
sd 0:1:0:0: Device offlined - not ready after error recovery
 ret_from_fork+0x22/0x40
========================================================================
sd 0:1:0:0: rejecting I/O to offline device
sd 0:1:0:0: Device offlined - not ready after error recovery
blk_update_request: I/O error, dev sda, 43208)
Write-error on swap-device (254:1:29843216)
Write-error on swap-device (254:1:29843224)
Write-error on swap-device (254:1:29843232)
Write-error on swap-device (254:1:29843240)
---
 drivers/iommu/amd_iommu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index b607a92791d3..3874a0f5e197 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2660,8 +2660,10 @@ static void *alloc_coherent(struct device *dev, size_t size,
 	if (!dma_mask)
 		dma_mask = *dev->dma_mask;
 
+	mutex_lock(&domain->api_lock);
 	*dma_addr = __map_single(dev, dma_dom, page_to_phys(page),
 				 size, DMA_BIDIRECTIONAL, dma_mask);
+	mutex_unlock(&domain->api_lock);
 
 	if (*dma_addr == DMA_MAPPING_ERROR)
 		goto out_free;
@@ -2696,7 +2698,9 @@ static void free_coherent(struct device *dev, size_t size,
 
 	dma_dom = to_dma_ops_domain(domain);
 
+	mutex_lock(&domain->api_lock);
 	__unmap_single(dma_dom, dma_addr, size, DMA_BIDIRECTIONAL);
+	mutex_unlock(&domain->api_lock);
 
 free_mem:
 	if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] iommu/amd: fix a race in increase_address_space()
  2019-09-04 21:24 [RFC PATCH] iommu/amd: fix a race in increase_address_space() Qian Cai
@ 2019-09-05 11:43 ` Joerg Roedel
  2019-09-05 20:57   ` Qian Cai
  0 siblings, 1 reply; 4+ messages in thread
From: Joerg Roedel @ 2019-09-05 11:43 UTC (permalink / raw)
  To: Qian Cai; +Cc: hch, iommu, don.brace, esc.storagedev, linux-kernel

Hi Qian,

On Wed, Sep 04, 2019 at 05:24:22PM -0400, Qian Cai wrote:
> 	if (domain->mode == PAGE_MODE_6_LEVEL)
> 		/* address space already 64 bit large */
> 		return false;
> 
> This gives a clue that there must be a race between multiple concurrent
> threads in increase_address_space().

Thanks for tracking this down, there is a race indeed.

> +	mutex_lock(&domain->api_lock);
>  	*dma_addr = __map_single(dev, dma_dom, page_to_phys(page),
>  				 size, DMA_BIDIRECTIONAL, dma_mask);
> +	mutex_unlock(&domain->api_lock);
>  
>  	if (*dma_addr == DMA_MAPPING_ERROR)
>  		goto out_free;
> @@ -2696,7 +2698,9 @@ static void free_coherent(struct device *dev, size_t size,
>  
>  	dma_dom = to_dma_ops_domain(domain);
>  
> +	mutex_lock(&domain->api_lock);
>  	__unmap_single(dma_dom, dma_addr, size, DMA_BIDIRECTIONAL);
> +	mutex_unlock(&domain->api_lock);

But I think the right fix is to lock the operation in
increase_address_space() directly, and not the calls around it, like in
the diff below. It is untested, so can you please try it and report back
if it fixes your issue?

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index b607a92791d3..1ff705f16239 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1424,18 +1424,21 @@ static void free_pagetable(struct protection_domain *domain)
  * another level increases the size of the address space by 9 bits to a size up
  * to 64 bits.
  */
-static bool increase_address_space(struct protection_domain *domain,
+static void increase_address_space(struct protection_domain *domain,
 				   gfp_t gfp)
 {
+	unsigned long flags;
 	u64 *pte;
 
-	if (domain->mode == PAGE_MODE_6_LEVEL)
+	spin_lock_irqsave(&domain->lock, flags);
+
+	if (WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
 		/* address space already 64 bit large */
-		return false;
+		goto out;
 
 	pte = (void *)get_zeroed_page(gfp);
 	if (!pte)
-		return false;
+		goto out;
 
 	*pte             = PM_LEVEL_PDE(domain->mode,
 					iommu_virt_to_phys(domain->pt_root));
@@ -1443,7 +1446,10 @@ static bool increase_address_space(struct protection_domain *domain,
 	domain->mode    += 1;
 	domain->updated  = true;
 
-	return true;
+out:
+	spin_unlock_irqrestore(&domain->lock, flags);
+
+	return;
 }
 
 static u64 *alloc_pte(struct protection_domain *domain,

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] iommu/amd: fix a race in increase_address_space()
  2019-09-05 11:43 ` Joerg Roedel
@ 2019-09-05 20:57   ` Qian Cai
  2019-09-06  8:56     ` [PATCH] iommu/amd: Fix " Joerg Roedel
  0 siblings, 1 reply; 4+ messages in thread
From: Qian Cai @ 2019-09-05 20:57 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: hch, iommu, don.brace, esc.storagedev, linux-kernel

On Thu, 2019-09-05 at 13:43 +0200, Joerg Roedel wrote:
> Hi Qian,
> 
> On Wed, Sep 04, 2019 at 05:24:22PM -0400, Qian Cai wrote:
> > 	if (domain->mode == PAGE_MODE_6_LEVEL)
> > 		/* address space already 64 bit large */
> > 		return false;
> > 
> > This gives a clue that there must be a race between multiple concurrent
> > threads in increase_address_space().
> 
> Thanks for tracking this down, there is a race indeed.
> 
> > +	mutex_lock(&domain->api_lock);
> >  	*dma_addr = __map_single(dev, dma_dom, page_to_phys(page),
> >  				 size, DMA_BIDIRECTIONAL, dma_mask);
> > +	mutex_unlock(&domain->api_lock);
> >  
> >  	if (*dma_addr == DMA_MAPPING_ERROR)
> >  		goto out_free;
> > @@ -2696,7 +2698,9 @@ static void free_coherent(struct device *dev, size_t size,
> >  
> >  	dma_dom = to_dma_ops_domain(domain);
> >  
> > +	mutex_lock(&domain->api_lock);
> >  	__unmap_single(dma_dom, dma_addr, size, DMA_BIDIRECTIONAL);
> > +	mutex_unlock(&domain->api_lock);
> 
> But I think the right fix is to lock the operation in
> increase_address_space() directly, and not the calls around it, like in
> the diff below. It is untested, so can you please try it and report back
> if it fixes your issue?

Yes, it works great so far.

> 
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index b607a92791d3..1ff705f16239 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -1424,18 +1424,21 @@ static void free_pagetable(struct protection_domain *domain)
>   * another level increases the size of the address space by 9 bits to a size up
>   * to 64 bits.
>   */
> -static bool increase_address_space(struct protection_domain *domain,
> +static void increase_address_space(struct protection_domain *domain,
>  				   gfp_t gfp)
>  {
> +	unsigned long flags;
>  	u64 *pte;
>  
> -	if (domain->mode == PAGE_MODE_6_LEVEL)
> +	spin_lock_irqsave(&domain->lock, flags);
> +
> +	if (WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
>  		/* address space already 64 bit large */
> -		return false;
> +		goto out;
>  
>  	pte = (void *)get_zeroed_page(gfp);
>  	if (!pte)
> -		return false;
> +		goto out;
>  
>  	*pte             = PM_LEVEL_PDE(domain->mode,
>  					iommu_virt_to_phys(domain->pt_root));
> @@ -1443,7 +1446,10 @@ static bool increase_address_space(struct protection_domain *domain,
>  	domain->mode    += 1;
>  	domain->updated  = true;
>  
> -	return true;
> +out:
> +	spin_unlock_irqrestore(&domain->lock, flags);
> +
> +	return;
>  }
>  
>  static u64 *alloc_pte(struct protection_domain *domain,

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] iommu/amd: Fix race in increase_address_space()
  2019-09-05 20:57   ` Qian Cai
@ 2019-09-06  8:56     ` Joerg Roedel
  0 siblings, 0 replies; 4+ messages in thread
From: Joerg Roedel @ 2019-09-06  8:56 UTC (permalink / raw)
  To: Qian Cai; +Cc: hch, iommu, don.brace, esc.storagedev, linux-kernel

On Thu, Sep 05, 2019 at 04:57:21PM -0400, Qian Cai wrote:
> On Thu, 2019-09-05 at 13:43 +0200, Joerg Roedel wrote:
> > But I think the right fix is to lock the operation in
> > increase_address_space() directly, and not the calls around it, like in
> > the diff below. It is untested, so can you please try it and report back
> > if it fixes your issue?
> 
> Yes, it works great so far.

Thanks for testing! I queued the patch below to the IOMMU tree and will
send it upstream for v5.3 today.

Thanks,

	Joerg

From 754265bcab78a9014f0f99cd35e0d610fcd7dfa7 Mon Sep 17 00:00:00 2001
From: Joerg Roedel <jroedel@suse.de>
Date: Fri, 6 Sep 2019 10:39:54 +0200
Subject: [PATCH] iommu/amd: Fix race in increase_address_space()

After the conversion to lock-less dma-api call the
increase_address_space() function can be called without any
locking. Multiple CPUs could potentially race for increasing
the address space, leading to invalid domain->mode settings
and invalid page-tables. This has been happening in the wild
under high IO load and memory pressure.

Fix the race by locking this operation. The function is
called infrequently so that this does not introduce
a performance regression in the dma-api path again.

Reported-by: Qian Cai <cai@lca.pw>
Fixes: 256e4621c21a ('iommu/amd: Make use of the generic IOVA allocator')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index f853b96ee547..61de81965c44 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1435,18 +1435,21 @@ static void free_pagetable(struct protection_domain *domain)
  * another level increases the size of the address space by 9 bits to a size up
  * to 64 bits.
  */
-static bool increase_address_space(struct protection_domain *domain,
+static void increase_address_space(struct protection_domain *domain,
 				   gfp_t gfp)
 {
+	unsigned long flags;
 	u64 *pte;
 
-	if (domain->mode == PAGE_MODE_6_LEVEL)
+	spin_lock_irqsave(&domain->lock, flags);
+
+	if (WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
 		/* address space already 64 bit large */
-		return false;
+		goto out;
 
 	pte = (void *)get_zeroed_page(gfp);
 	if (!pte)
-		return false;
+		goto out;
 
 	*pte             = PM_LEVEL_PDE(domain->mode,
 					iommu_virt_to_phys(domain->pt_root));
@@ -1454,7 +1457,10 @@ static bool increase_address_space(struct protection_domain *domain,
 	domain->mode    += 1;
 	domain->updated  = true;
 
-	return true;
+out:
+	spin_unlock_irqrestore(&domain->lock, flags);
+
+	return;
 }
 
 static u64 *alloc_pte(struct protection_domain *domain,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-09-06  8:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-04 21:24 [RFC PATCH] iommu/amd: fix a race in increase_address_space() Qian Cai
2019-09-05 11:43 ` Joerg Roedel
2019-09-05 20:57   ` Qian Cai
2019-09-06  8:56     ` [PATCH] iommu/amd: Fix " Joerg Roedel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).