From: Kashyap Desai <kashyap.desai@broadcom.com>
To: John Garry <john.garry@huawei.com>,
axboe@kernel.dk, jejb@linux.ibm.com, martin.petersen@oracle.com,
ming.lei@redhat.com, bvanassche@acm.org, hare@suse.de,
don.brace@microsemi.com, Sumit Saxena <sumit.saxena@broadcom.com>,
hch@infradead.org,
Shivasharan Srikanteshwara
<shivasharan.srikanteshwara@broadcom.com>
Cc: chenxiang66@hisilicon.com, linux-block@vger.kernel.org,
linux-scsi@vger.kernel.org, esc.storagedev@microsemi.com,
Hannes Reinecke <hare@suse.com>
Subject: RE: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
Date: Fri, 24 Apr 2020 22:01:32 +0530 [thread overview]
Message-ID: <d2ae343770a83466b870a33ffae5fa23@mail.gmail.com> (raw)
In-Reply-To: <e0c5a076-9fe5-4401-fd41-97f457888ad3@huawei.com>
> >
> >>> So I tested this on hisi_sas with x12 SAS SSDs, and performance with
> >>> "mq-
> >>> deadline" is comparable with "none" @ ~ 2M IOPs. But after a while
> >>> performance drops alot, to maybe 700K IOPS. Do you have a similar
> >>> experience?
> >>
> >> I am using mq-deadline only for HDD. I have not tried on SSD since it
> >> is not useful scheduler for SSDs.
> >>
> >
> > I ask as I only have SAS SSDs to test.
> >
> >> I noticed that when I used mq-deadline, performance drop starts if I
> >> have
> >> more number of drives.
> >> I am running <fio> script which has 64 Drives, 64 thread and all
> >> treads are
> >> bound to local numa node which has 36 logical cores.
> >> I noticed that lock contention is in " dd_dispatch_request". I am not
> >> sure
> >> why there is a no penalty of same lock in nr_hw_queue = 1 mode.
> >
> > So this could be just pre-existing issue of exposing multiple queues for
> > SCSI HBAs combined with mq-deadline iosched. I mean, that's really the
> > only significant change in this series, apart from the shared sbitmap,
> > and, at this point, I don't think that is the issue.
>
> As an experiment, I modified hisi_sas mainline driver to expose hw
> queues and manage tags itself, and I see the same issue I mentioned:
>
> Jobs: 12 (f=12): [R(12)] [14.8% done] [7592MB/0KB/0KB /s] [1943K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [16.4% done] [7949MB/0KB/0KB /s] [2035K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [18.0% done] [7940MB/0KB/0KB /s] [2033K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [19.7% done] [7984MB/0KB/0KB /s] [2044K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [21.3% done] [7984MB/0KB/0KB /s] [2044K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [23.0% done] [2964MB/0KB/0KB /s] [759K/0/0
> iops] [eta 0
> Jobs: 12 (f=12): [R(12)] [24.6% done] [2417MB/0KB/0KB /s] [619K/0/0
> iops] [eta 0
> Jobs: 12 (f=12): [R(12)] [26.2% done] [2909MB/0KB/0KB /s] [745K/0/0
> iops] [eta 0
> Jobs: 12 (f=12): [R(12)] [27.9% done] [2366MB/0KB/0KB /s] [606K/0/0
> iops] [eta 0
>
> The odd time I see "sched: RT throttling activated" around the time the
> throughput falls. I think issue is the per-queue threaded irq threaded
> handlers consuming too many cycles. With "none" io scheduler, IOPS is
> flat at around 2M.
>
> >
> >>
> >> static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> >> {
> >> struct deadline_data *dd =
> >> hctx->queue->elevator->elevator_data;
> >> struct request *rq;
> >>
> >> spin_lock(&dd->lock);
> >
> > So if multiple hctx's are accessing this lock, then much contention
> > possible.
> >
> >> rq = __dd_dispatch_request(dd);
> >> spin_unlock(&dd->lock);
> >>
> >> return rq;
> >> }
> >>
> >> Here is perf report -
> >>
> >> - 1.04% 0.99% kworker/18:1H+k [kernel.vmlinux] [k]
> >> native_queued_spin_lock_slowpath
> >> 0.99% ret_from_fork
> >> - kthread
> >> - worker_thread
> >> - 0.98% process_one_work
> >> - 0.98% __blk_mq_run_hw_queue
> >> - blk_mq_sched_dispatch_requests
> >> - 0.98% blk_mq_do_dispatch_sched
> >> - 0.97% dd_dispatch_request
> >> + 0.97% queued_spin_lock_slowpath
> >> + 1.04% 0.00% kworker/18:1H+k [kernel.vmlinux] [k]
> >> queued_spin_lock_slowpath
> >> + 1.03% 0.95% kworker/19:1H-k [kernel.vmlinux] [k]
> >> native_queued_spin_lock_slowpath
> >> + 1.03% 0.00% kworker/19:1H-k [kernel.vmlinux] [k]
> >> queued_spin_lock_slowpath
> >> + 1.02% 0.97% kworker/20:1H+k [kernel.vmlinux] [k]
> >> native_queued_spin_lock_slowpath
> >> + 1.02% 0.00% kworker/20:1H+k [kernel.vmlinux] [k]
> >> queued_spin_lock_slowpath
> >> + 1.01% 0.96% kworker/21:1H+k [kernel.vmlinux] [k]
> >> native_queued_spin_lock_slowpath
> >>
> >
> > I'll try to capture a perf report and compare to mine.
>
> Mine is spending a huge amount of time (circa 33% on a cpu servicing
> completion irqs) in mod_delayed_work_on():
>
> --79.89%--sas_scsi_task_done |
> |--76.72%--scsi_mq_done
> | |
> | --76.53%--blk_mq_complete_request
> | |
> | |--74.81%--scsi_softirq_done
> | | |
> | | --73.91%--scsi_finish_command
> | | |
> | | |--72.11%--scsi_io_completion
> | | | |
> | | | --71.89%--scsi_end_request
> | | | |
> | | | |--40.82%--blk_mq_run_hw_queues
> | | | | |
> | | | | |--35.86%--blk_mq_run_hw_queue
> | | | | | |
> | | | | | --33.59%--__blk_mq_delay_run_hw_queue
> | | | | | |
> | | | | | --33.38%--kblockd_mod_delayed_work_on
> | | | | | |
> | | | | | --33.31%--mod_delayed_work_on
>
> hmmmm...
I did some more experiments. It looks like issue is with both <none> and
<mq-deadline> scheduler. Let me simplify what happens with ioscheduler =
<none>.
Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio> queue depth
= 128. We get 3.1M IOPS in this config. This eventually exhaust host
can_queue.
Note - Very low contention in sbitmap_get()
- 23.58% 0.25% fio [kernel.vmlinux] [k]
blk_mq_make_request
- 23.33% blk_mq_make_request
- 21.68% blk_mq_get_request
- 20.19% blk_mq_get_tag
+ 10.08% prepare_to_wait_exclusive
+ 4.51% io_schedule
- 3.59% __sbitmap_queue_get
- 2.82% sbitmap_get
0.86% __sbitmap_get_word
0.75% _raw_spin_lock_irqsave
0.55% _raw_spin_unlock_irqrestore
Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio> queue
depth = 128. We get 2.3 M IOPS in this config. This eventually exhaust host
can_queue.
Note - Very high contention in sbitmap_get()
- 42.39% 0.12% fio [kernel.vmlinux] [k]
generic_make_request
- 42.27% generic_make_request
- 41.00% blk_mq_make_request
- 38.28% blk_mq_get_request
- 33.76% blk_mq_get_tag
- 30.25% __sbitmap_queue_get
- 29.90% sbitmap_get
+ 9.06% _raw_spin_lock_irqsave
+ 7.94% _raw_spin_unlock_irqrestore
+ 3.86% __sbitmap_get_word
+ 1.78% call_function_single_interrupt
+ 0.67% ret_from_intr
+ 1.69% io_schedule
0.59% prepare_to_wait_exclusive
0.55% __blk_mq_get_tag
In this particular case, I observed alloc_hint = zeros which means,
sbitmap_get is not able to find free tags from hint. That may lead to
contention.
This condition is not happening with nr_hw_queue=1 (without RFC) driver.
alloc_hint=
{663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212, 836,
1892, 1669, 2420,
3415, 1904, 512, 3027, 4810, 2845, 4690, 712, 3105, 0, 0, 0, 3268, 4915,
3897, 1349, 547, 4, 733, 1765, 2068, 979, 51, 880, 0, 370, 3520, 2877, 4097,
418, 4501, 3717,
2893, 604, 508, 759, 3329, 4038, 4829, 715, 842, 1443, 556}
Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio> queue
depth = 32. We get 3.1M IOPS in this config. This workload does *not*
exhaust host can_queue.
- 5.07% 0.14% fio [kernel.vmlinux] [k]
generic_make_request
- 4.93% generic_make_request
- 3.61% blk_mq_make_request
- 2.04% blk_mq_get_request
- 1.08% blk_mq_get_tag
- 0.70% __sbitmap_queue_get
0.67% sbitmap_get
In summary, RFC has some performance bottleneck in sbitmap_get () if
outstanding per shost is about to exhaust. Without this RFC also driver
works in nr_hw_queue = 1, but that case is managed very well.
I am not sure why it happens only with shared host tag ? Theoretically all
the hctx is sharing the same bitmaptag which is same as nr_hw_queue=1, so
why contention is only visible in shared host tag case.
If you want to reproduce this issue, may be you have to reduce the can_queue
in hisi_sas driver.
Kashyap
>
> Thanks,
> John
next prev parent reply other threads:[~2020-04-24 16:31 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
2020-03-05 11:54 ` [PATCH RFC v6 01/10] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
2020-03-05 11:54 ` [PATCH RFC v6 02/10] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
2020-03-05 11:54 ` [PATCH RFC v6 03/10] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
2020-03-05 12:42 ` Hannes Reinecke
2020-03-05 11:54 ` [PATCH RFC v6 04/10] blk-mq: Facilitate a shared sbitmap per tagset John Garry
2020-03-05 12:49 ` Hannes Reinecke
2020-03-05 13:52 ` John Garry
2020-03-05 11:54 ` [PATCH RFC v6 05/10] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
2020-03-05 12:52 ` Hannes Reinecke
2020-03-05 11:54 ` [PATCH RFC v6 06/10] scsi: Add template flag 'host_tagset' John Garry
2020-03-06 11:12 ` John Garry
2020-03-05 11:54 ` [PATCH RFC v6 07/10] scsi: hisi_sas: Switch v3 hw to MQ John Garry
2020-03-05 12:52 ` Hannes Reinecke
2020-03-05 11:54 ` [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters " John Garry
2020-04-07 11:14 ` Kashyap Desai
2020-04-08 9:33 ` John Garry
2020-04-08 9:59 ` Kashyap Desai
2020-04-17 16:46 ` John Garry
2020-04-20 17:47 ` Kashyap Desai
2020-04-21 12:35 ` John Garry
2020-04-22 18:59 ` Kashyap Desai
2020-04-22 21:28 ` John Garry
2020-04-23 16:31 ` John Garry
2020-04-24 16:31 ` Kashyap Desai [this message]
2020-04-27 17:06 ` John Garry
2020-04-27 18:58 ` Kashyap Desai
2020-04-28 15:55 ` John Garry
2020-04-29 11:29 ` John Garry
2020-04-29 15:50 ` Kashyap Desai
2020-04-29 17:55 ` John Garry
2020-04-30 17:40 ` John Garry
2020-04-30 19:18 ` Kashyap Desai
2020-03-05 11:54 ` [PATCH RFC v6 09/10] smartpqi: enable host tagset John Garry
2020-03-05 11:54 ` [PATCH RFC v6 10/10] hpsa: enable host_tagset and switch to MQ John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d2ae343770a83466b870a33ffae5fa23@mail.gmail.com \
--to=kashyap.desai@broadcom.com \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=chenxiang66@hisilicon.com \
--cc=don.brace@microsemi.com \
--cc=esc.storagedev@microsemi.com \
--cc=hare@suse.com \
--cc=hare@suse.de \
--cc=hch@infradead.org \
--cc=jejb@linux.ibm.com \
--cc=john.garry@huawei.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=ming.lei@redhat.com \
--cc=shivasharan.srikanteshwara@broadcom.com \
--cc=sumit.saxena@broadcom.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).