iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: John Garry <john.garry@huawei.com>, Will Deacon <will@kernel.org>,
	"Robin Murphy" <robin.murphy@arm.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>,
	Marc Zyngier <maz@kernel.org>, Linuxarm <linuxarm@huawei.com>,
	Ming Lei <ming.lei@redhat.com>,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	"alexandru.elisei@arm.com" <alexandru.elisei@arm.com>
Subject: RE: arm-smmu-v3 high cpu usage for NVMe
Date: Mon, 25 May 2020 05:57:03 +0000	[thread overview]
Message-ID: <B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com> (raw)
In-Reply-To: <4aacbd65-f1fd-55e0-c8bb-468badc9f600@huawei.com>

> Subject: Re: arm-smmu-v3 high cpu usage for NVMe
> 
> On 20/03/2020 10:41, John Garry wrote:
> 
> + Barry, Alexandru
> 
> >>>>>     PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost:
> >>>>> 0/34434 drop:
> >>>>> 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> >>>>>
> ----------------------------------------------------------------------------------------------------------------
> ----------
> >>>>>
> >>>>>
> >>>>>       27.43%  [kernel]          [k]
> arm_smmu_cmdq_issue_cmdlist
> >>>>>       11.71%  [kernel]          [k]
> _raw_spin_unlock_irqrestore
> >>>>>        6.35%  [kernel]          [k] _raw_spin_unlock_irq
> >>>>>        2.65%  [kernel]          [k] get_user_pages_fast
> >>>>>        2.03%  [kernel]          [k] __slab_free
> >>>>>        1.55%  [kernel]          [k] tick_nohz_idle_exit
> >>>>>        1.47%  [kernel]          [k] arm_lpae_map
> >>>>>        1.39%  [kernel]          [k] __fget
> >>>>>        1.14%  [kernel]          [k] __lock_text_start
> >>>>>        1.09%  [kernel]          [k] _raw_spin_lock
> >>>>>        1.08%  [kernel]          [k] bio_release_pages.part.42
> >>>>>        1.03%  [kernel]          [k] __sbitmap_get_word
> >>>>>        0.97%  [kernel]          [k]
> >>>>> arm_smmu_atc_inv_domain.constprop.42
> >>>>>        0.91%  [kernel]          [k] fput_many
> >>>>>        0.88%  [kernel]          [k] __arm_lpae_map
> >>>>>
> 
> Hi Will, Robin,
> 
> I'm just getting around to look at this topic again. Here's the current
> picture for my NVMe test:
> 
> perf top -C 0 *
> Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
> Overhead Shared Object Symbol
> 75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
> 3.28% [kernel] [k] arm_smmu_tlb_inv_range
> 2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
> 2.35% [kernel] [k] _raw_spin_unlock_irqrestore
> 1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
> 1.20% [kernel] [k] aio_complete_rw
> 0.96% [kernel] [k] enqueue_task_fair
> 0.93% [kernel] [k] gic_handle_irq
> 0.86% [kernel] [k] _raw_spin_lock_irqsave
> 0.72% [kernel] [k] put_reqs_available
> 0.72% [kernel] [k] sbitmap_queue_clear
> 
> * only certain CPUs run the dma unmap for my scenario, cpu0 being one of
> them.
> 
> Colleague Barry has similar findings for some other scenarios.

I wrote a test module and use the parameter "ways" to simulate how busy SMMU is and
compare the latency under different degrees of contentions.
1.	static int ways=16;  
2.	module_param(ways, int, S_IRUGO);  
3.	  
4.	static int seconds=120;  
5.	module_param(seconds, int, S_IRUGO);  
6.	  
7.	extern struct device *get_zip_dev(void);  
8.	  
9.	static noinline void test_mapsingle(struct device *dev, void *buf, int size)  
10.	{  
11.	    dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE);  
12.	    dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);  
13.	}  
14.	  
15.	static noinline void test_memcpy(void *out, void *in, int size)  
16.	{  
17.	    memcpy(out, in, size);  
18.	}  
19.	  
20.	static int testthread(void *data)  
21.	{  
22.	    unsigned long stop = jiffies +seconds*HZ;  
23.	    struct device *dev = get_zip_dev();  
24.	  
25.	    char *input = kzalloc(4096, GFP_KERNEL);  
26.	    if (!input)  
27.	        return -ENOMEM;  
28.	  
29.	    char *output = kzalloc(4096, GFP_KERNEL);  
30.	    if (!output)  
31.	        return -ENOMEM;  
32.	  
33.	    while (time_before(jiffies, stop)) {  
34.	        test_mapsingle(dev, input, 4096);  
35.	        test_memcpy(output, input, 4096);  
36.	    }  
37.	  
38.	    kfree(output);  
39.	    kfree(input);  
40.	  
41.	    return 0;  
42.	}  
43.	  
44.	static int __init test_init(void)  
45.	{  
46.	    struct task_struct *tsk;  
47.	    int i;  
50.	  
51.	    for(i=0;i<ways;i++) {  
52.	        tsk = kthread_run(testthread, &ways, "map_test-%d", i);  
53.	        if (IS_ERR(tsk))   
54.	            printk(KERN_ERR "create test thread failed\n");  
55.	    }  
56.	  
57.	    return 0;  
58.	}  
59.	  
60.	static void __exit test_exit(void)  
61.	{  
62.	}  
63.	  
64.	module_init(test_init);  
65.	module_exit(test_exit);  
66.	MODULE_LICENSE("GPL");

While ways=1, smmu is quite free with only one user,  arm_smmu_cmdq_issue_cmdlist() will spend more than 60% time
on arm_smmu_cmdq_poll_until_sync(). It seems SMMU reports the completion of CMD_SYNC quite slowly.

When I increased "ways", I found the contention would increase rapidly. When ways=16, more than 40% time will be on:
cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val)

when ways=64, more than 60% time will be on:
cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val)

here is a table for dma_unmap, arm_smmu_cmdq_issue_cmdlist() and CMD_SYNC with different ways:
	 whole unmap(ns)   arm_smmu_cmdq_issue_cmdlist()ns  	wait CMD_SYNC(ns) 
Ways=1	   1956	           1328	                                883	 
Ways=16	   8891	           7474	                                4000
Ways=32	   22043	           19519	                        6879
Ways=64	   60842	           55895	                        16746 
Ways=96	   101880	           93649	                        24429

As you can see, while ways=1, we still need 2us to unmap, and arm_smmu_cmdq_issue_cmdlist() takes 60% time of the dma_unmap, CMD_SNC
takes more than 60% time of arm_smmu_cmdq_issue_cmdlist().

When SMMU is very busy, dma_unmap latency can be very large due to contention, more than 100us.


Thanks
Barry

> 
> So we tried the latest perf NMI support wip patches, and noticed a few
> hotspots (see
> https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3
> 912703cfcd4985a00f6bbb/perf%20annotate
> and
> https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3
> 912703cfcd4985a00f6bbb/report.txt)
> when running some NVMe traffic:
> 
> - initial cmpxchg to get a place in the queue
> 	- when more CPUs get involved, we start failing at an exponential rate
> 0.00 :        ffff8000107a3500:       cas     x4, x2, [x27]
> 26.52 :        ffff8000107a3504:       mov     x0, x4 :
> arm_smmu_cmdq_issue_cmdlist():
> 
> - the queue locking
> - polling cmd_sync
> 
> Some ideas to optimise:
> 
> a. initial cmpxchg
> So this cmpxchg could be considered unfair. In addition, with all the
> contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged
> around the system.
> Maybe we can implement something similar to the idea of queued/ticketed
> spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after
> initial cmpxchg fails, released by its leader, and releasing subsequent
> followers
> 
> b. Drop the queue_full checking in certain circumstances
> If we cannot theoretically fill the queue, then stop the checking for
> queue full or similar. This should also help current problem of a., as
> the less time between cmpxchg, the less chance of failing (as we check
> queue available space between cmpxchg attempts).
> 
> So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we
> always issue a cmd_sync for a batch (regardless of whether requested),
> then we should never fill (I think).
> 
> c. Don't do queue locking in certain circumstances
> If we implement (and support) b. and support MSI polling, then I don't
> think that this is required.
> 
> d. More minor ideas are to move forward when the "owner" stops gathering
> to reduce time of advancing the prod, hopefully reducing cmd_sync
> polling time; and also use a smaller word size for the valid bitmap
> operations, maybe 32b atomic operations are overall more efficient (than
> 64b) - mostly valid range check is < 16 bits from my observation.
> 
> Let me know your thoughts or any other ideas.
> 
> Thanks,
> John

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

  reply	other threads:[~2020-05-25  5:57 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
2019-08-21 15:17 ` [PATCH v2 1/8] iommu/arm-smmu-v3: Document ordering guarantees of command insertion Will Deacon
2019-08-21 15:17 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI Will Deacon
2019-08-21 15:36   ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 3/8] iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag Will Deacon
2019-08-21 15:17 ` [PATCH v2 4/8] iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations Will Deacon
2019-08-21 15:17 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters Will Deacon
2019-08-21 15:50   ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs Will Deacon
2019-08-21 16:25   ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS Will Deacon
2019-08-22 12:36   ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 8/8] Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI" Will Deacon
2020-01-02 17:44 ` arm-smmu-v3 high cpu usage for NVMe John Garry
2020-03-18 20:53   ` Will Deacon
2020-03-19 12:54     ` John Garry
2020-03-19 18:43       ` Jean-Philippe Brucker
2020-03-20 10:41         ` John Garry
2020-03-20 11:18           ` Jean-Philippe Brucker
2020-03-20 16:20             ` John Garry
2020-03-20 16:33               ` Marc Zyngier
2020-03-23  9:03                 ` John Garry
2020-03-23  9:16                   ` Marc Zyngier
2020-03-24  9:18                     ` John Garry
2020-03-24 10:43                       ` Marc Zyngier
2020-03-24 11:55                         ` John Garry
2020-03-24 12:07                           ` Robin Murphy
2020-03-24 12:37                             ` John Garry
2020-03-25 15:31                               ` John Garry
2020-05-22 14:52           ` John Garry
2020-05-25  5:57             ` Song Bao Hua (Barry Song) [this message]
     [not found]     ` <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>
2020-04-06 15:11       ` John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com \
    --to=song.bao.hua@hisilicon.com \
    --cc=alexandru.elisei@arm.com \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jean-philippe@linaro.org \
    --cc=john.garry@huawei.com \
    --cc=linuxarm@huawei.com \
    --cc=maz@kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=robin.murphy@arm.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).