RE: arm-smmu-v3 high cpu usage for NVMe

From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: John Garry <john.garry@huawei.com>, Will Deacon <will@kernel.org>,
	"Robin Murphy" <robin.murphy@arm.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>,
	Marc Zyngier <maz@kernel.org>, Linuxarm <linuxarm@huawei.com>,
	Ming Lei <ming.lei@redhat.com>,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	"alexandru.elisei@arm.com" <alexandru.elisei@arm.com>
Subject: RE: arm-smmu-v3 high cpu usage for NVMe
Date: Mon, 25 May 2020 05:57:03 +0000	[thread overview]
Message-ID: <B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com> (raw)
In-Reply-To: <4aacbd65-f1fd-55e0-c8bb-468badc9f600@huawei.com>

> Subject: Re: arm-smmu-v3 high cpu usage for NVMe
> 
> On 20/03/2020 10:41, John Garry wrote:
> 
> + Barry, Alexandru
> 
> >>>>>     PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost:
> >>>>> 0/34434 drop:
> >>>>> 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> >>>>>
> ----------------------------------------------------------------------------------------------------------------
> ----------
> >>>>>
> >>>>>
> >>>>>       27.43%  [kernel]          [k]
> arm_smmu_cmdq_issue_cmdlist
> >>>>>       11.71%  [kernel]          [k]
> _raw_spin_unlock_irqrestore
> >>>>>        6.35%  [kernel]          [k] _raw_spin_unlock_irq
> >>>>>        2.65%  [kernel]          [k] get_user_pages_fast
> >>>>>        2.03%  [kernel]          [k] __slab_free
> >>>>>        1.55%  [kernel]          [k] tick_nohz_idle_exit
> >>>>>        1.47%  [kernel]          [k] arm_lpae_map
> >>>>>        1.39%  [kernel]          [k] __fget
> >>>>>        1.14%  [kernel]          [k] __lock_text_start
> >>>>>        1.09%  [kernel]          [k] _raw_spin_lock
> >>>>>        1.08%  [kernel]          [k] bio_release_pages.part.42
> >>>>>        1.03%  [kernel]          [k] __sbitmap_get_word
> >>>>>        0.97%  [kernel]          [k]
> >>>>> arm_smmu_atc_inv_domain.constprop.42
> >>>>>        0.91%  [kernel]          [k] fput_many
> >>>>>        0.88%  [kernel]          [k] __arm_lpae_map
> >>>>>
> 
> Hi Will, Robin,
> 
> I'm just getting around to look at this topic again. Here's the current
> picture for my NVMe test:
> 
> perf top -C 0 *
> Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
> Overhead Shared Object Symbol
> 75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
> 3.28% [kernel] [k] arm_smmu_tlb_inv_range
> 2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
> 2.35% [kernel] [k] _raw_spin_unlock_irqrestore
> 1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
> 1.20% [kernel] [k] aio_complete_rw
> 0.96% [kernel] [k] enqueue_task_fair
> 0.93% [kernel] [k] gic_handle_irq
> 0.86% [kernel] [k] _raw_spin_lock_irqsave
> 0.72% [kernel] [k] put_reqs_available
> 0.72% [kernel] [k] sbitmap_queue_clear
> 
> * only certain CPUs run the dma unmap for my scenario, cpu0 being one of
> them.
> 
> Colleague Barry has similar findings for some other scenarios.

I wrote a test module and use the parameter "ways" to simulate how busy SMMU is and
compare the latency under different degrees of contentions.
1.	static int ways=16;  
2.	module_param(ways, int, S_IRUGO);  
3.	  
4.	static int seconds=120;  
5.	module_param(seconds, int, S_IRUGO);  
6.	  
7.	extern struct device *get_zip_dev(void);  
8.	  
9.	static noinline void test_mapsingle(struct device *dev, void *buf, int size)  
10.	{  
11.	    dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE);  
12.	    dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);  
13.	}  
14.	  
15.	static noinline void test_memcpy(void *out, void *in, int size)  
16.	{  
17.	    memcpy(out, in, size);  
18.	}  
19.	  
20.	static int testthread(void *data)  
21.	{  
22.	    unsigned long stop = jiffies +seconds*HZ;  
23.	    struct device *dev = get_zip_dev();  
24.	  
25.	    char *input = kzalloc(4096, GFP_KERNEL);  
26.	    if (!input)  
27.	        return -ENOMEM;  
28.	  
29.	    char *output = kzalloc(4096, GFP_KERNEL);  
30.	    if (!output)  
31.	        return -ENOMEM;  
32.	  
33.	    while (time_before(jiffies, stop)) {  
34.	        test_mapsingle(dev, input, 4096);  
35.	        test_memcpy(output, input, 4096);  
36.	    }  
37.	  
38.	    kfree(output);  
39.	    kfree(input);  
40.	  
41.	    return 0;  
42.	}  
43.	  
44.	static int __init test_init(void)  
45.	{  
46.	    struct task_struct *tsk;  
47.	    int i;  
50.	  
51.	    for(i=0;i<ways;i++) {  
52.	        tsk = kthread_run(testthread, &ways, "map_test-%d", i);  
53.	        if (IS_ERR(tsk))   
54.	            printk(KERN_ERR "create test thread failed\n");  
55.	    }  
56.	  
57.	    return 0;  
58.	}  
59.	  
60.	static void __exit test_exit(void)  
61.	{  
62.	}  
63.	  
64.	module_init(test_init);  
65.	module_exit(test_exit);  
66.	MODULE_LICENSE("GPL");

While ways=1, smmu is quite free with only one user,  arm_smmu_cmdq_issue_cmdlist() will spend more than 60% time
on arm_smmu_cmdq_poll_until_sync(). It seems SMMU reports the completion of CMD_SYNC quite slowly.

When I increased "ways", I found the contention would increase rapidly. When ways=16, more than 40% time will be on:
cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val)

when ways=64, more than 60% time will be on:
cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val)

here is a table for dma_unmap, arm_smmu_cmdq_issue_cmdlist() and CMD_SYNC with different ways:
	 whole unmap(ns)   arm_smmu_cmdq_issue_cmdlist()ns  	wait CMD_SYNC(ns) 
Ways=1	   1956	           1328	                                883	 
Ways=16	   8891	           7474	                                4000
Ways=32	   22043	           19519	                        6879
Ways=64	   60842	           55895	                        16746 
Ways=96	   101880	           93649	                        24429

As you can see, while ways=1, we still need 2us to unmap, and arm_smmu_cmdq_issue_cmdlist() takes 60% time of the dma_unmap, CMD_SNC
takes more than 60% time of arm_smmu_cmdq_issue_cmdlist().

When SMMU is very busy, dma_unmap latency can be very large due to contention, more than 100us.

Thanks
Barry

> 
> So we tried the latest perf NMI support wip patches, and noticed a few
> hotspots (see
> https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3
> 912703cfcd4985a00f6bbb/perf%20annotate
> and
> https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3
> 912703cfcd4985a00f6bbb/report.txt)
> when running some NVMe traffic:
> 
> - initial cmpxchg to get a place in the queue
> 	- when more CPUs get involved, we start failing at an exponential rate
> 0.00 :        ffff8000107a3500:       cas     x4, x2, [x27]
> 26.52 :        ffff8000107a3504:       mov     x0, x4 :
> arm_smmu_cmdq_issue_cmdlist():
> 
> - the queue locking
> - polling cmd_sync
> 
> Some ideas to optimise:
> 
> a. initial cmpxchg
> So this cmpxchg could be considered unfair. In addition, with all the
> contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged
> around the system.
> Maybe we can implement something similar to the idea of queued/ticketed
> spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after
> initial cmpxchg fails, released by its leader, and releasing subsequent
> followers
> 
> b. Drop the queue_full checking in certain circumstances
> If we cannot theoretically fill the queue, then stop the checking for
> queue full or similar. This should also help current problem of a., as
> the less time between cmpxchg, the less chance of failing (as we check
> queue available space between cmpxchg attempts).
> 
> So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we
> always issue a cmd_sync for a batch (regardless of whether requested),
> then we should never fill (I think).
> 
> c. Don't do queue locking in certain circumstances
> If we implement (and support) b. and support MSI polling, then I don't
> think that this is required.
> 
> d. More minor ideas are to move forward when the "owner" stops gathering
> to reduce time of advancing the prod, hopefully reducing cmd_sync
> polling time; and also use a smaller word size for the valid bitmap
> operations, maybe 32b atomic operations are overall more efficient (than
> 64b) - mostly valid range check is < 16 bits from my observation.
> 
> Let me know your thoughts or any other ideas.
> 
> Thanks,
> John

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu