From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: John Garry <john.garry@huawei.com>, Will Deacon <will@kernel.org>,
"Robin Murphy" <robin.murphy@arm.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>,
Marc Zyngier <maz@kernel.org>, Linuxarm <linuxarm@huawei.com>,
Ming Lei <ming.lei@redhat.com>,
"iommu@lists.linux-foundation.org"
<iommu@lists.linux-foundation.org>,
"alexandru.elisei@arm.com" <alexandru.elisei@arm.com>
Subject: RE: arm-smmu-v3 high cpu usage for NVMe
Date: Mon, 25 May 2020 05:57:03 +0000 [thread overview]
Message-ID: <B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com> (raw)
In-Reply-To: <4aacbd65-f1fd-55e0-c8bb-468badc9f600@huawei.com>
> Subject: Re: arm-smmu-v3 high cpu usage for NVMe
>
> On 20/03/2020 10:41, John Garry wrote:
>
> + Barry, Alexandru
>
> >>>>> PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost:
> >>>>> 0/34434 drop:
> >>>>> 0/40116 [4000Hz cycles], (all, 96 CPUs)
> >>>>>
> ----------------------------------------------------------------------------------------------------------------
> ----------
> >>>>>
> >>>>>
> >>>>> 27.43% [kernel] [k]
> arm_smmu_cmdq_issue_cmdlist
> >>>>> 11.71% [kernel] [k]
> _raw_spin_unlock_irqrestore
> >>>>> 6.35% [kernel] [k] _raw_spin_unlock_irq
> >>>>> 2.65% [kernel] [k] get_user_pages_fast
> >>>>> 2.03% [kernel] [k] __slab_free
> >>>>> 1.55% [kernel] [k] tick_nohz_idle_exit
> >>>>> 1.47% [kernel] [k] arm_lpae_map
> >>>>> 1.39% [kernel] [k] __fget
> >>>>> 1.14% [kernel] [k] __lock_text_start
> >>>>> 1.09% [kernel] [k] _raw_spin_lock
> >>>>> 1.08% [kernel] [k] bio_release_pages.part.42
> >>>>> 1.03% [kernel] [k] __sbitmap_get_word
> >>>>> 0.97% [kernel] [k]
> >>>>> arm_smmu_atc_inv_domain.constprop.42
> >>>>> 0.91% [kernel] [k] fput_many
> >>>>> 0.88% [kernel] [k] __arm_lpae_map
> >>>>>
>
> Hi Will, Robin,
>
> I'm just getting around to look at this topic again. Here's the current
> picture for my NVMe test:
>
> perf top -C 0 *
> Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
> Overhead Shared Object Symbol
> 75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
> 3.28% [kernel] [k] arm_smmu_tlb_inv_range
> 2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
> 2.35% [kernel] [k] _raw_spin_unlock_irqrestore
> 1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
> 1.20% [kernel] [k] aio_complete_rw
> 0.96% [kernel] [k] enqueue_task_fair
> 0.93% [kernel] [k] gic_handle_irq
> 0.86% [kernel] [k] _raw_spin_lock_irqsave
> 0.72% [kernel] [k] put_reqs_available
> 0.72% [kernel] [k] sbitmap_queue_clear
>
> * only certain CPUs run the dma unmap for my scenario, cpu0 being one of
> them.
>
> Colleague Barry has similar findings for some other scenarios.
I wrote a test module and use the parameter "ways" to simulate how busy SMMU is and
compare the latency under different degrees of contentions.
1. static int ways=16;
2. module_param(ways, int, S_IRUGO);
3.
4. static int seconds=120;
5. module_param(seconds, int, S_IRUGO);
6.
7. extern struct device *get_zip_dev(void);
8.
9. static noinline void test_mapsingle(struct device *dev, void *buf, int size)
10. {
11. dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE);
12. dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);
13. }
14.
15. static noinline void test_memcpy(void *out, void *in, int size)
16. {
17. memcpy(out, in, size);
18. }
19.
20. static int testthread(void *data)
21. {
22. unsigned long stop = jiffies +seconds*HZ;
23. struct device *dev = get_zip_dev();
24.
25. char *input = kzalloc(4096, GFP_KERNEL);
26. if (!input)
27. return -ENOMEM;
28.
29. char *output = kzalloc(4096, GFP_KERNEL);
30. if (!output)
31. return -ENOMEM;
32.
33. while (time_before(jiffies, stop)) {
34. test_mapsingle(dev, input, 4096);
35. test_memcpy(output, input, 4096);
36. }
37.
38. kfree(output);
39. kfree(input);
40.
41. return 0;
42. }
43.
44. static int __init test_init(void)
45. {
46. struct task_struct *tsk;
47. int i;
50.
51. for(i=0;i<ways;i++) {
52. tsk = kthread_run(testthread, &ways, "map_test-%d", i);
53. if (IS_ERR(tsk))
54. printk(KERN_ERR "create test thread failed\n");
55. }
56.
57. return 0;
58. }
59.
60. static void __exit test_exit(void)
61. {
62. }
63.
64. module_init(test_init);
65. module_exit(test_exit);
66. MODULE_LICENSE("GPL");
While ways=1, smmu is quite free with only one user, arm_smmu_cmdq_issue_cmdlist() will spend more than 60% time
on arm_smmu_cmdq_poll_until_sync(). It seems SMMU reports the completion of CMD_SYNC quite slowly.
When I increased "ways", I found the contention would increase rapidly. When ways=16, more than 40% time will be on:
cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val)
when ways=64, more than 60% time will be on:
cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val)
here is a table for dma_unmap, arm_smmu_cmdq_issue_cmdlist() and CMD_SYNC with different ways:
whole unmap(ns) arm_smmu_cmdq_issue_cmdlist()ns wait CMD_SYNC(ns)
Ways=1 1956 1328 883
Ways=16 8891 7474 4000
Ways=32 22043 19519 6879
Ways=64 60842 55895 16746
Ways=96 101880 93649 24429
As you can see, while ways=1, we still need 2us to unmap, and arm_smmu_cmdq_issue_cmdlist() takes 60% time of the dma_unmap, CMD_SNC
takes more than 60% time of arm_smmu_cmdq_issue_cmdlist().
When SMMU is very busy, dma_unmap latency can be very large due to contention, more than 100us.
Thanks
Barry
>
> So we tried the latest perf NMI support wip patches, and noticed a few
> hotspots (see
> https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3
> 912703cfcd4985a00f6bbb/perf%20annotate
> and
> https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3
> 912703cfcd4985a00f6bbb/report.txt)
> when running some NVMe traffic:
>
> - initial cmpxchg to get a place in the queue
> - when more CPUs get involved, we start failing at an exponential rate
> 0.00 : ffff8000107a3500: cas x4, x2, [x27]
> 26.52 : ffff8000107a3504: mov x0, x4 :
> arm_smmu_cmdq_issue_cmdlist():
>
> - the queue locking
> - polling cmd_sync
>
> Some ideas to optimise:
>
> a. initial cmpxchg
> So this cmpxchg could be considered unfair. In addition, with all the
> contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged
> around the system.
> Maybe we can implement something similar to the idea of queued/ticketed
> spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after
> initial cmpxchg fails, released by its leader, and releasing subsequent
> followers
>
> b. Drop the queue_full checking in certain circumstances
> If we cannot theoretically fill the queue, then stop the checking for
> queue full or similar. This should also help current problem of a., as
> the less time between cmpxchg, the less chance of failing (as we check
> queue available space between cmpxchg attempts).
>
> So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we
> always issue a cmd_sync for a batch (regardless of whether requested),
> then we should never fill (I think).
>
> c. Don't do queue locking in certain circumstances
> If we implement (and support) b. and support MSI polling, then I don't
> think that this is required.
>
> d. More minor ideas are to move forward when the "owner" stops gathering
> to reduce time of advancing the prod, hopefully reducing cmd_sync
> polling time; and also use a smaller word size for the valid bitmap
> operations, maybe 32b atomic operations are overall more efficient (than
> 64b) - mostly valid range check is < 16 bits from my observation.
>
> Let me know your thoughts or any other ideas.
>
> Thanks,
> John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
next prev parent reply other threads:[~2020-05-25 5:57 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
2019-08-21 15:17 ` [PATCH v2 1/8] iommu/arm-smmu-v3: Document ordering guarantees of command insertion Will Deacon
2019-08-21 15:17 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI Will Deacon
2019-08-21 15:36 ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 3/8] iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag Will Deacon
2019-08-21 15:17 ` [PATCH v2 4/8] iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations Will Deacon
2019-08-21 15:17 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters Will Deacon
2019-08-21 15:50 ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs Will Deacon
2019-08-21 16:25 ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS Will Deacon
2019-08-22 12:36 ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 8/8] Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI" Will Deacon
2020-01-02 17:44 ` arm-smmu-v3 high cpu usage for NVMe John Garry
2020-03-18 20:53 ` Will Deacon
2020-03-19 12:54 ` John Garry
2020-03-19 18:43 ` Jean-Philippe Brucker
2020-03-20 10:41 ` John Garry
2020-03-20 11:18 ` Jean-Philippe Brucker
2020-03-20 16:20 ` John Garry
2020-03-20 16:33 ` Marc Zyngier
2020-03-23 9:03 ` John Garry
2020-03-23 9:16 ` Marc Zyngier
2020-03-24 9:18 ` John Garry
2020-03-24 10:43 ` Marc Zyngier
2020-03-24 11:55 ` John Garry
2020-03-24 12:07 ` Robin Murphy
2020-03-24 12:37 ` John Garry
2020-03-25 15:31 ` John Garry
2020-05-22 14:52 ` John Garry
2020-05-25 5:57 ` Song Bao Hua (Barry Song) [this message]
[not found] ` <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>
2020-04-06 15:11 ` John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=B926444035E5E2439431908E3842AFD24B86DB@DGGEMI525-MBS.china.huawei.com \
--to=song.bao.hua@hisilicon.com \
--cc=alexandru.elisei@arm.com \
--cc=iommu@lists.linux-foundation.org \
--cc=jean-philippe@linaro.org \
--cc=john.garry@huawei.com \
--cc=linuxarm@huawei.com \
--cc=maz@kernel.org \
--cc=ming.lei@redhat.com \
--cc=robin.murphy@arm.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).