* [RFC] bypass scheduler for no sched is set @ 2018-07-04 7:59 Kashyap Desai 2018-07-04 9:08 ` Ming Lei 0 siblings, 1 reply; 6+ messages in thread From: Kashyap Desai @ 2018-07-04 7:59 UTC (permalink / raw) To: linux-block, Jens Axboe, Ming Lei, Omar Sandoval, Christoph Hellwig, Hannes Reinecke, linux-scsi [-- Attachment #1: Type: text/plain, Size: 6307 bytes --] Hi, Ming Lei posted below patch series and performance improved for megaraid_sas driver. I used the same kernel base and figure out some more possible performance improvement in block layer. This RFC improves performance as well as CPU utilization. If this patch fits the design aspect of the blk-mq and scsi-mq, I can convert it into PATCH and submit the same/modified version. https://marc.info/?l=linux-block&m=153062994403732&w=2 Description of change - Do not insert request into software queue if BLK_MQ_F_NO_SCHED is set. Submit request from blk_mq_make_request to low level driver directly as depicted through below function call. blk_mq_try_issue_directly __blk_mq_try_issue_directly scsi_queue_rq Low level driver attached to scsi.mq can set BLK_MQ_F_NO_SCHED, If they do not want benefit from io scheduler (e.a in case of SSDs connected to IT/MR controller). In case of HDD drives connected to HBA, driver can avoid setting BLK_MQ_F_NO_SCHED so that default elevator is set to mq-deadline. Setup and performance number detail listed below - I have created one R0 VD consist of 8 SSD on MegarRaid adapter. Without RFC - IOPS goes 840K and CPU utilization goes upto 11%. Below is perf top output 5.17% [kernel] [k] _raw_spin_lock 4.62% [kernel] [k] try_to_grab_pending 2.29% [kernel] [k] syscall_return_via_sysret 1.37% [kernel] [k] blk_mq_flush_busy_ctxs 1.29% [kernel] [k] kobject_get 1.27% fio [.] axmap_isset 1.25% [kernel] [k] flush_busy_ctx 1.20% [kernel] [k] scsi_dispatch_cmd 1.18% [kernel] [k] blk_mq_get_request 1.16% [kernel] [k] blk_mq_hctx_mark_pending.isra.45 1.09% [kernel] [k] irq_entries_start 0.94% [kernel] [k] del_timer 0.91% [kernel] [k] scsi_softirq_done 0.90% [kernel] [k] sbitmap_any_bit_set 0.83% [kernel] [k] blk_mq_free_request 0.82% [kernel] [k] kobject_put 0.81% [sd_mod] [k] sd_setup_read_write_cmnd 0.80% [kernel] [k] scsi_mq_get_budget 0.79% [kernel] [k] blk_mq_get_tag 0.70% [kernel] [k] blk_mq_dispatch_rq_list 0.61% [kernel] [k] bt_iter 0.60% fio [.] __fio_gettime 0.59% [kernel] [k] blk_mq_complete_request 0.59% [kernel] [k] gup_pgd_range 0.57% [kernel] [k] scsi_queue_rq After applying RFC - IOPS goes 1066K and CPU utilization goes up to 6%. 2.56% [kernel] [k] syscall_return_via_sysret 2.46% [kernel] [k] irq_entries_start 2.43% [kernel] [k] kobject_get 2.40% [kernel] [k] bt_iter 2.16% fio [.] axmap_isset 2.06% [kernel] [k] _raw_spin_lock 1.76% [kernel] [k] __audit_syscall_exit 1.51% [kernel] [k] scsi_dispatch_cmd 1.49% [kernel] [k] blk_mq_free_request 1.49% [sd_mod] [k] sd_setup_read_write_cmnd 1.45% [kernel] [k] scsi_softirq_done 1.32% [kernel] [k] switch_mm_irqs_off 1.28% [kernel] [k] scsi_mq_get_budget 1.22% [kernel] [k] blk_mq_check_inflight 1.13% [kernel] [k] kobject_put 1.11% fio [.] __fio_gettime 0.95% [kernel] [k] gup_pgd_range 0.90% [kernel] [k] blk_mq_get_tag 0.88% [kernel] [k] read_tsc 0.85% [kernel] [k] scsi_end_request 0.85% fio [.] get_io_u 0.84% [kernel] [k] lookup_ioctx 0.81% [kernel] [k] blk_mq_complete_request 0.80% [kernel] [k] blk_mq_get_request Signed-off-by: Kashyap Desai < kashyap.desai@broadcom.com> --- diff --git a/block/blk-mq.c b/block/blk-mq.c index 4d1c048..ab27788 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1811,32 +1811,35 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) blk_insert_flush(rq); blk_mq_run_hw_queue(data.hctx, true); } else if (plug && q->nr_hw_queues == 1) { - struct request *last = NULL; - blk_mq_put_ctx(data.ctx); blk_mq_bio_to_request(rq, bio); + /* bypass scheduler for no sched flag set */ + if (q->tag_set->flags & BLK_MQ_F_NO_SCHED) + blk_mq_try_issue_directly(data.hctx, rq, &cookie); + else { + struct request *last = NULL; + /* + * @request_count may become stale because of schedule + * out, so check the list again. + */ + if (list_empty(&plug->mq_list)) + request_count = 0; + else if (blk_queue_nomerges(q)) + request_count = blk_plug_queued_count(q); + + if (!request_count) + trace_block_plug(q); + else + last = list_entry_rq(plug->mq_list.prev); + + if (request_count >= BLK_MAX_REQUEST_COUNT || (last && + blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) { + blk_flush_plug_list(plug, false); + trace_block_plug(q); + } - /* - * @request_count may become stale because of schedule - * out, so check the list again. - */ - if (list_empty(&plug->mq_list)) - request_count = 0; - else if (blk_queue_nomerges(q)) - request_count = blk_plug_queued_count(q); - - if (!request_count) - trace_block_plug(q); - else - last = list_entry_rq(plug->mq_list.prev); - - if (request_count >= BLK_MAX_REQUEST_COUNT || (last && - blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) { - blk_flush_plug_list(plug, false); - trace_block_plug(q); + list_add_tail(&rq->queuelist, &plug->mq_list); } - - list_add_tail(&rq->queuelist, &plug->mq_list); } else if (plug && !blk_queue_nomerges(q)) { blk_mq_bio_to_request(rq, bio); [-- Attachment #2: Type: text/html, Size: 8995 bytes --] ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC] bypass scheduler for no sched is set 2018-07-04 7:59 [RFC] bypass scheduler for no sched is set Kashyap Desai @ 2018-07-04 9:08 ` Ming Lei 2018-07-04 10:37 ` Kashyap Desai 2018-07-10 7:11 ` Kashyap Desai 0 siblings, 2 replies; 6+ messages in thread From: Ming Lei @ 2018-07-04 9:08 UTC (permalink / raw) To: Kashyap Desai Cc: linux-block, Jens Axboe, Omar Sandoval, Christoph Hellwig, Hannes Reinecke, linux-scsi, Laurence Oberman On Wed, Jul 04, 2018 at 01:29:50PM +0530, Kashyap Desai wrote: > Hi, > > Ming Lei posted below patch series and performance improved for > megaraid_sas driver. I used the same kernel base and figure out some more > possible performance improvement in block layer. This RFC improves > performance as well as CPU utilization. If this patch fits the design > aspect of the blk-mq and scsi-mq, I can convert it into PATCH and submit > the same/modified version. > > https://marc.info/?l=linux-block&m=153062994403732&w=2 > > Description of change - > > Do not insert request into software queue if BLK_MQ_F_NO_SCHED is set. > Submit request from blk_mq_make_request to low level driver directly as > depicted through below function call. > > blk_mq_try_issue_directly > > > __blk_mq_try_issue_directly > > scsi_queue_rq Hi Kashyap, When I sent you the patches[1] which include 'global tags' support, MegaRAID is converted to per-node queues and becomes real MQ(q->nr_hw_queues > 1), IO should have been issued direclty, please see the branch of '(plug && !blk_queue_nomerges(q))' and '(q->nr_hw_queues > 1 && is_sync)' in blk_mq_make_request(). But looks not see any improvement from your test result, even though huge improvement can be observed on null_blk/scsi_debug. Maybe somewhere is wrong, and days ago I talked with Laurence about doing this kind of test again. I will double check the 'global tags' patches, meantime could you or Laurence help to check if global tags[2] works in expected way if you'd like to? [1] https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v5 [2] https://github.com/ming1/linux/commits/v4.18-rc-host-tags-v8 > > Low level driver attached to scsi.mq can set BLK_MQ_F_NO_SCHED, If they do > not want benefit from io scheduler (e.a in case of SSDs connected to IT/MR > controller). In case of HDD drives connected to HBA, driver can avoid > setting BLK_MQ_F_NO_SCHED so that default elevator is set to mq-deadline. That might be one way for improving megaraid_sas. However, even for modern high performance device, such as NVMe, turns out IO scheduler(such as kyber) may play a big role for improving latency or throughput. Thanks, Ming ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [RFC] bypass scheduler for no sched is set 2018-07-04 9:08 ` Ming Lei @ 2018-07-04 10:37 ` Kashyap Desai 2018-07-05 2:28 ` Ming Lei 2018-07-10 7:11 ` Kashyap Desai 1 sibling, 1 reply; 6+ messages in thread From: Kashyap Desai @ 2018-07-04 10:37 UTC (permalink / raw) To: Ming Lei Cc: linux-block, Jens Axboe, Omar Sandoval, Christoph Hellwig, Hannes Reinecke, linux-scsi, Laurence Oberman > -----Original Message----- > From: Ming Lei [mailto:ming.lei@redhat.com] > Sent: Wednesday, July 4, 2018 2:38 PM > To: Kashyap Desai > Cc: linux-block@vger.kernel.org; Jens Axboe; Omar Sandoval; Christoph > Hellwig; Hannes Reinecke; linux-scsi; Laurence Oberman > Subject: Re: [RFC] bypass scheduler for no sched is set > > On Wed, Jul 04, 2018 at 01:29:50PM +0530, Kashyap Desai wrote: > > Hi, > > > > Ming Lei posted below patch series and performance improved for > > megaraid_sas driver. I used the same kernel base and figure out some more > > possible performance improvement in block layer. This RFC improves > > performance as well as CPU utilization. If this patch fits the design > > aspect of the blk-mq and scsi-mq, I can convert it into PATCH and submit > > the same/modified version. > > > > https://marc.info/?l=linux-block&m=153062994403732&w=2 > > > > Description of change - > > > > Do not insert request into software queue if BLK_MQ_F_NO_SCHED is set. > > Submit request from blk_mq_make_request to low level driver directly as > > depicted through below function call. > > > > blk_mq_try_issue_directly > > > > > > __blk_mq_try_issue_directly > > > > scsi_queue_rq > > Hi Kashyap, > > When I sent you the patches[1] which include 'global tags' support, MegaRAID > is converted to per-node queues and becomes real MQ(q->nr_hw_queues > > 1), IO > should have been issued direclty, please see the branch of '(plug && > !blk_queue_nomerges(q))' and '(q->nr_hw_queues > 1 && is_sync)' in > blk_mq_make_request(). I observed this while working on performance issue with you and from your earlier patch 'global tags'. > > But looks not see any improvement from your test result, even though huge > improvement can be observed on null_blk/scsi_debug. Maybe somewhere is > wrong, > and days ago I talked with Laurence about doing this kind of test again. > Ming, Your patch was trigger for me to review block layer changes as I did not expected performance boost having multiple submission queue for IT/MR HBA due to pseudo parallelism via more hctx. Performance bottleneck is obvious, if we have *one* single scsi_device which can goes upto 1M IOPS. If we have more number of drives in topology which requires more number of outstanding IOs to hit max Performance, we will see gloab tag[2] will be a bottleneck. In case of global tag [2], hctx to cpu mapping was just round robin since we can use blk-mq-pci APIs. There are a benefit of keeping nr_hw_queue = 1 as explained below. More than one nr_hw_queue will reduce tags per hardware context (higher the physical socket we will have more trouble of distributing HBA can_queue) and also it will not allow any IO scheduler to be attached. We will end up seeing performance issue for HDD based setup w.r.t sequential profile. I already worked with upstream and block layer fix was part of 4.11 kernel. See below link for more detail. https://lkml.org/lkml/2017/1/30/381 - To have this fix, we need mq-deadline scheduler. This scheduler is not available if we call our self as multi-hardware queue. I reconfirm once again that above mentioned issue (IO sorting issue) is only resolved if I use <mq-deadline> scheduler. It means using nr_hw_queue > 1 will reintroduce IO sorting issue. Ideally, we need nr_hw_queue = 1 to get use of io scheduler. MR and IT controller of Broadcom do not want to by-pass IO scheduler all the time. If we mark nr_hw_queue > 1 for IT/MR controller, we will not find any IO scheduler due to below code @ elevator_init_mq and we need io scheduler for HDD based storages. int elevator_init_mq(struct request_queue *q) { struct elevator_type *e; int err = 0; if (q->nr_hw_queues != 1) return 0; Using request_queue->tag_set->flags method, we can cherry pick IO scheduler. Block layer will not attach any IO scheduler due to below code @ blk_mq_init_allocated_queue(). Eventually, it looks better not to go through IO scheduler in submission path based on same flag settings. if (!(set->flags & BLK_MQ_F_NO_SCHED)) { int ret; ret = blk_mq_sched_init(q); if (ret) return ERR_PTR(ret); } > I will double check the 'global tags' patches, meantime could you or > Laurence help to check if global tags[2] works in expected way if > you'd like to? > > [1] https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v5 > [2] https://github.com/ming1/linux/commits/v4.18-rc-host-tags-v8 Yesterday I manually did this merging your v4.16-rc-host-tags-v5 to 4.18 branch. For one particular test run, impact of global tags [2] and RFC was same. RFC and global tags 2] uses new path via blk_mq_try_issue_directly. Performance drop of global tags [2] will be visible if we have more physical sockets and single numa node exhaust all nr_tags. Most likely negative performance if we have large HDD based setup using global tag[2]. Performance drop due to reduced nr_tags can be completely avoided if we use RFC. Kashyap > > > > Low level driver attached to scsi.mq can set BLK_MQ_F_NO_SCHED, If they > do > > not want benefit from io scheduler (e.a in case of SSDs connected to IT/MR > > controller). In case of HDD drives connected to HBA, driver can avoid > > setting BLK_MQ_F_NO_SCHED so that default elevator is set to mq- > deadline. > > That might be one way for improving megaraid_sas. > > However, even for modern high performance device, such as NVMe, turns out > IO scheduler(such as kyber) may play a big role for improving latency or > throughput. > > > Thanks, > Ming ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] bypass scheduler for no sched is set 2018-07-04 10:37 ` Kashyap Desai @ 2018-07-05 2:28 ` Ming Lei 2018-07-05 9:55 ` Kashyap Desai 0 siblings, 1 reply; 6+ messages in thread From: Ming Lei @ 2018-07-05 2:28 UTC (permalink / raw) To: Kashyap Desai Cc: linux-block, Jens Axboe, Omar Sandoval, Christoph Hellwig, Hannes Reinecke, linux-scsi, Laurence Oberman Hi Kashyap, On Wed, Jul 04, 2018 at 04:07:45PM +0530, Kashyap Desai wrote: > > -----Original Message----- > > From: Ming Lei [mailto:ming.lei@redhat.com] > > Sent: Wednesday, July 4, 2018 2:38 PM > > To: Kashyap Desai > > Cc: linux-block@vger.kernel.org; Jens Axboe; Omar Sandoval; Christoph > > Hellwig; Hannes Reinecke; linux-scsi; Laurence Oberman > > Subject: Re: [RFC] bypass scheduler for no sched is set > > > > On Wed, Jul 04, 2018 at 01:29:50PM +0530, Kashyap Desai wrote: > > > Hi, > > > > > > Ming Lei posted below patch series and performance improved for > > > megaraid_sas driver. I used the same kernel base and figure out some > more > > > possible performance improvement in block layer. This RFC improves > > > performance as well as CPU utilization. If this patch fits the design > > > aspect of the blk-mq and scsi-mq, I can convert it into PATCH and > submit > > > the same/modified version. > > > > > > https://marc.info/?l=linux-block&m=153062994403732&w=2 > > > > > > Description of change - > > > > > > Do not insert request into software queue if BLK_MQ_F_NO_SCHED is set. > > > Submit request from blk_mq_make_request to low level driver directly > as > > > depicted through below function call. > > > > > > blk_mq_try_issue_directly > > > > > > > > > __blk_mq_try_issue_directly > > > > > > scsi_queue_rq > > > > Hi Kashyap, > > > > When I sent you the patches[1] which include 'global tags' support, > MegaRAID > > is converted to per-node queues and becomes real MQ(q->nr_hw_queues > > > 1), IO > > should have been issued direclty, please see the branch of '(plug && > > !blk_queue_nomerges(q))' and '(q->nr_hw_queues > 1 && is_sync)' in > > blk_mq_make_request(). > > I observed this while working on performance issue with you and from your > earlier patch 'global tags'. > > > > > But looks not see any improvement from your test result, even though > huge > > improvement can be observed on null_blk/scsi_debug. Maybe somewhere is > > wrong, > > and days ago I talked with Laurence about doing this kind of test again. > > > > Ming, > > Your patch was trigger for me to review block layer changes as I did not > expected performance boost having multiple submission queue for IT/MR HBA > due to pseudo parallelism via more hctx. OK, I guess the driver may not support to submit requests concurrently, is it right? > > Performance bottleneck is obvious, if we have *one* single scsi_device > which can goes upto 1M IOPS. If we have more number of drives in topology > which requires more number of outstanding IOs to hit max Performance, we > will see gloab tag[2] will be a bottleneck. In case of global tag [2], > hctx to cpu mapping was just round robin since we can use blk-mq-pci APIs. If I remember correctly, the whole tags in this megaraid_sas is ~5K, and in your test there are 8 SSD drives, so in case of dual socket system, you still get 2.5K tags for all 8 SSDs. In theory, it is quite enough to reach each SSD's top performance if the driver .queuecommand() doesn't take too much time. There are at least two benefits with global tags: 1) hctx is NUMA locality, and ctx is accessed in NUMA locality too 2) issue directly in case of none > > There are a benefit of keeping nr_hw_queue = 1 as explained below. > > More than one nr_hw_queue will reduce tags per hardware context (higher > the physical socket we will have more trouble of distributing HBA > can_queue) and also it will not allow any IO scheduler to be attached. We Right, if there is too many NUMA nodes, don't expect this HBA works efficiently, since it only has single tags among all nodes & CPUs. And 2 or 4 nodes should be more popular, you still get >1K tags for one single hw queue in case of 4 nodes, which looks not too low. > will end up seeing performance issue for HDD based setup w.r.t sequential > profile. I already worked with upstream and block layer fix was part of > 4.11 kernel. See below link for more detail. > https://lkml.org/lkml/2017/1/30/381 - To have this fix, we need > mq-deadline scheduler. This scheduler is not available if we call our self > as multi-hardware queue. > > I reconfirm once again that above mentioned issue (IO sorting issue) is > only resolved if I use <mq-deadline> scheduler. It means using nr_hw_queue > > 1 will reintroduce IO sorting issue. > But all your current test is on none IO scheduler instead of mq-deadline, right? > Ideally, we need nr_hw_queue = 1 to get use of io scheduler. MR and IT > controller of Broadcom do not want to by-pass IO scheduler all the time. You may set io scheduler in case of 'nr_hw_queue > 1', please see __blk_mq_try_issue_directly(), in which request will be inserted to scheduler queue if 'q->elevator' isn't NULL. > > If we mark nr_hw_queue > 1 for IT/MR controller, we will not find any IO > scheduler due to below code @ elevator_init_mq and we need io scheduler > for HDD based storages. > > int elevator_init_mq(struct request_queue *q) > { > struct elevator_type *e; > int err = 0; > > if (q->nr_hw_queues != 1) > return 0; You may switch io scheduler via /sys/block/sdN/queue/scheduler in real MQ case. > > Using request_queue->tag_set->flags method, we can cherry pick IO > scheduler. Block layer will not attach any IO scheduler due to below code > @ blk_mq_init_allocated_queue(). > Eventually, it looks better not to go through IO scheduler in submission > path based on same flag settings. > > if (!(set->flags & BLK_MQ_F_NO_SCHED)) { > int ret; > > ret = blk_mq_sched_init(q); > if (ret) > return ERR_PTR(ret); > } Usually BLK_MQ_F_NO_SCHED is set for admin queues, and if you take this approach, no any IO schedulers can be applied on this queue any more. > > > > > I will double check the 'global tags' patches, meantime could you or > > Laurence help to check if global tags[2] works in expected way if > > you'd like to? > > > > [1] https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v5 > > [2] https://github.com/ming1/linux/commits/v4.18-rc-host-tags-v8 > > Yesterday I manually did this merging your v4.16-rc-host-tags-v5 to 4.18 > branch. For one particular test run, impact of global tags [2] and RFC > was same. RFC and global tags > 2] uses new path via blk_mq_try_issue_directly. Performance drop of global > tags [2] will be visible if we have more physical sockets and single numa > node exhaust all nr_tags. > Most likely negative performance if we have large HDD based setup using > global tag[2]. Global tags should be fine for HDD since small tags is enough for HDD, for example, SATA often has 32 tags. Number of tags should be important for SSD which need to apply parallelism on the internal NAND chip. > > Performance drop due to reduced nr_tags can be completely avoided if we > use RFC. If each drive's average tags is more than 256, and you still may not get good performance, I suggest to investigate driver's IO path, maybe somewhere takes too long. Because from SSD's view, 256 should be enough to reach its top performance. Thanks, Ming ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [RFC] bypass scheduler for no sched is set 2018-07-05 2:28 ` Ming Lei @ 2018-07-05 9:55 ` Kashyap Desai 0 siblings, 0 replies; 6+ messages in thread From: Kashyap Desai @ 2018-07-05 9:55 UTC (permalink / raw) To: Ming Lei Cc: linux-block, Jens Axboe, Omar Sandoval, Christoph Hellwig, Hannes Reinecke, linux-scsi, Laurence Oberman > > > > Ming, > > > > Your patch was trigger for me to review block layer changes as I did not > > expected performance boost having multiple submission queue for IT/MR > HBA > > due to pseudo parallelism via more hctx. > > OK, I guess the driver may not support to submit requests concurrently, is > it right? Driver support concurrent processing but it eventually submit to one h/w queue only. IT and MR HBA is single h/w submission queue. > > > > > Performance bottleneck is obvious, if we have *one* single scsi_device > > which can goes upto 1M IOPS. If we have more number of drives in topology > > which requires more number of outstanding IOs to hit max Performance, we > > will see gloab tag[2] will be a bottleneck. In case of global tag [2], > > hctx to cpu mapping was just round robin since we can use blk-mq-pci APIs. > > If I remember correctly, the whole tags in this megaraid_sas is ~5K, and in > your test there are 8 SSD drives, so in case of dual socket system, you still > get 2.5K tags for all 8 SSDs. In theory, it is quite enough to reach each SSD's > top performance if the driver .queuecommand() doesn't take too much time. We have iMR and MR version of controllers. iMR supports 1.6K queue depth for Ventura Family controllers. Same iMR Invader family controller supports 1K queue depth. > > There are at least two benefits with global tags: > > 1) hctx is NUMA locality, and ctx is accessed in NUMA locality too As of now hctx is not NUMA locality. It is doing round robin CPU assignment. Am I missing anything ? See below output. # cat /sys/devices/pci0000:85/0000:85:00.0/0000:86:00.0/host14/target14:2:63/14: 2:63:0/block/sdd/mq/0/cpu_list 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70 # cat /sys/devices/pci0000:85/0000:85:00.0/0000:86:00.0/host14/target14:2:63/14: 2:63:0/block/sdd/mq/1/cpu_list 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71 # numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 > > 2) issue directly in case of none Agree. This is what we want for SSDs based VD as default scheduler. Currently I am doing this through manual settings. > > > > > There are a benefit of keeping nr_hw_queue = 1 as explained below. > > > > More than one nr_hw_queue will reduce tags per hardware context (higher > > the physical socket we will have more trouble of distributing HBA > > can_queue) and also it will not allow any IO scheduler to be attached. We > > Right, if there is too many NUMA nodes, don't expect this HBA works > efficiently, since it only has single tags among all nodes & CPUs. > > And 2 or 4 nodes should be more popular, you still get >1K tags for > one single hw queue in case of 4 nodes, which looks not too low. My current testing is on higher HBA queue depth Controllers, but as I commented above..we have controller (iMR) which works in lower HBA QD. On 4 Socket server + iMR controller case, driver will assign 256 nr_tags per hctx context. One more thing, in case of MR we create N drives VD and for that we need accumulated per device queue depth. 8 drive R0 VD at least need 256 queue depth. > > > will end up seeing performance issue for HDD based setup w.r.t sequential > > profile. I already worked with upstream and block layer fix was part of > > 4.11 kernel. See below link for more detail. > > https://lkml.org/lkml/2017/1/30/381 - To have this fix, we need > > mq-deadline scheduler. This scheduler is not available if we call our self > > as multi-hardware queue. > > > > I reconfirm once again that above mentioned issue (IO sorting issue) is > > only resolved if I use <mq-deadline> scheduler. It means using nr_hw_queue > > > 1 will reintroduce IO sorting issue. > > > > But all your current test is on none IO scheduler instead of mq-deadline, > right? Correct. I am currently checking SSD based test case, but soon I will be doing some HDD based test as well. > > > Ideally, we need nr_hw_queue = 1 to get use of io scheduler. MR and IT > > controller of Broadcom do not want to by-pass IO scheduler all the time. > > You may set io scheduler in case of 'nr_hw_queue > 1', please see > __blk_mq_try_issue_directly(), in which request will be inserted to > scheduler queue if 'q->elevator' isn't NULL. > > > > > If we mark nr_hw_queue > 1 for IT/MR controller, we will not find any IO > > scheduler due to below code @ elevator_init_mq and we need io scheduler > > for HDD based storages. > > > > int elevator_init_mq(struct request_queue *q) > > { > > struct elevator_type *e; > > int err = 0; > > > > if (q->nr_hw_queues != 1) > > return 0; > > You may switch io scheduler via /sys/block/sdN/queue/scheduler in real > MQ case. Got it. Kernel will not call blk_mq_init_sched if we are nr_hw_queue > 1, but still we can switch through sysfs since elevator_switch() is not checking about nr_hw_queue. > > > > > Using request_queue->tag_set->flags method, we can cherry pick IO > > scheduler. Block layer will not attach any IO scheduler due to below code > > @ blk_mq_init_allocated_queue(). > > Eventually, it looks better not to go through IO scheduler in submission > > path based on same flag settings. > > > > if (!(set->flags & BLK_MQ_F_NO_SCHED)) { > > int ret; > > > > ret = blk_mq_sched_init(q); > > if (ret) > > return ERR_PTR(ret); > > } > > Usually BLK_MQ_F_NO_SCHED is set for admin queues, and if you take this > approach, no any IO schedulers can be applied on this queue any more. > > > > > > > > > > I will double check the 'global tags' patches, meantime could you or > > > Laurence help to check if global tags[2] works in expected way if > > > you'd like to? > > > > > > [1] https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v5 > > > [2] https://github.com/ming1/linux/commits/v4.18-rc-host-tags-v8 > > > > Yesterday I manually did this merging your v4.16-rc-host-tags-v5 to 4.18 > > branch. For one particular test run, impact of global tags [2] and RFC > > was same. RFC and global tags > > 2] uses new path via blk_mq_try_issue_directly. Performance drop of global > > tags [2] will be visible if we have more physical sockets and single numa > > node exhaust all nr_tags. > > Most likely negative performance if we have large HDD based setup using > > global tag[2]. > > Global tags should be fine for HDD since small tags is enough for HDD, for > example, SATA often has 32 tags. Number of tags should be important > for SSD which need to apply parallelism on the internal NAND chip. > > > > > Performance drop due to reduced nr_tags can be completely avoided if we > > use RFC. > > If each drive's average tags is more than 256, and you still may not get > good performance, I suggest to investigate driver's IO path, maybe > somewhere takes too long. Because from SSD's view, 256 should be enough > to reach its top performance. One of the case I am trying now is using iMR controller (HBA QD = 1000.) OLTP work load. ( I will send you the fio script.) Using global tags[2] patch - 12 SSDs in single drive R0 mode IOPS read 890K/ write 440K (host_busy = ~490 because nr_tags = 494) 24 SSDs in single drive R0 mode 1312k/ 649k IOPS read 1312K/ write 649K (host_busy = ~490 because nr_tags = 494) Using RFC - 12 SSDs in single drive R0 mode IOPS read 1050K/ write 510K (host_busy = ~750 because nr_tags = 988) 24 SSDs in single drive R0 mode IOPS read 1650K/ write 855K (host_busy = ~988 because nr_tags = 988) ~25% performance drop is contributed just because of not having enough tags. Most likely similar drop can be easily visible whenever we have large topology and Max IOPs saturated @HBA level ( host_busy reaching to can_queue.) I agree that mq-deadline selection is still possible in global tags[2] patch, but major concern is dividing can_queue to nr_tags per ctx. I don't think we will need any scheduler operation for SSD (having said that "none" scheduler for SSD case should be not an issue),. Using BLK_MQ_F_NO_SCHED *only* for nonrotatioal media is still good choice. In summary, We need interface to use blk_mq_try_issue_directly for device connected to scsi stack with nr_hw_queue = 1. We can achieve that using your global tag[2] patch, but that is dividing can_queue and we may see high performance issue if performance really need max HBA queue depth should be outstanding. RFC patch is making thing simple and serving the same purpose of calling blk_mq_try_issue_directly, if low level driver wants. It will continue working in same hctx context without dividing can_queue. I see that not dividing can_queue is much needed. Kashyap > > Thanks, > Ming ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [RFC] bypass scheduler for no sched is set 2018-07-04 9:08 ` Ming Lei 2018-07-04 10:37 ` Kashyap Desai @ 2018-07-10 7:11 ` Kashyap Desai 1 sibling, 0 replies; 6+ messages in thread From: Kashyap Desai @ 2018-07-10 7:11 UTC (permalink / raw) To: Ming Lei Cc: linux-block, Jens Axboe, Omar Sandoval, Christoph Hellwig, Hannes Reinecke, linux-scsi, Laurence Oberman Ming, Thanks for checking this request and sending appropriate changes. This RFC changes are settled down with proper block layer changes (below patch) sent by Ming. https://www.spinics.net/lists/linux-block/msg28279.html Kashyap > -----Original Message----- > From: Ming Lei [mailto:ming.lei@redhat.com] > Sent: Wednesday, July 4, 2018 2:38 PM > To: Kashyap Desai > Cc: linux-block@vger.kernel.org; Jens Axboe; Omar Sandoval; Christoph > Hellwig; Hannes Reinecke; linux-scsi; Laurence Oberman > Subject: Re: [RFC] bypass scheduler for no sched is set > > On Wed, Jul 04, 2018 at 01:29:50PM +0530, Kashyap Desai wrote: > > Hi, > > > > Ming Lei posted below patch series and performance improved for > > megaraid_sas driver. I used the same kernel base and figure out some more > > possible performance improvement in block layer. This RFC improves > > performance as well as CPU utilization. If this patch fits the design > > aspect of the blk-mq and scsi-mq, I can convert it into PATCH and submit > > the same/modified version. > > > > https://marc.info/?l=linux-block&m=153062994403732&w=2 > > > > Description of change - > > > > Do not insert request into software queue if BLK_MQ_F_NO_SCHED is set. > > Submit request from blk_mq_make_request to low level driver directly as > > depicted through below function call. > > > > blk_mq_try_issue_directly > > > > > > __blk_mq_try_issue_directly > > > > scsi_queue_rq > > Hi Kashyap, > > When I sent you the patches[1] which include 'global tags' support, MegaRAID > is converted to per-node queues and becomes real MQ(q->nr_hw_queues > > 1), IO > should have been issued direclty, please see the branch of '(plug && > !blk_queue_nomerges(q))' and '(q->nr_hw_queues > 1 && is_sync)' in > blk_mq_make_request(). > > But looks not see any improvement from your test result, even though huge > improvement can be observed on null_blk/scsi_debug. Maybe somewhere is > wrong, > and days ago I talked with Laurence about doing this kind of test again. > > I will double check the 'global tags' patches, meantime could you or > Laurence help to check if global tags[2] works in expected way if > you'd like to? > > [1] https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v5 > [2] https://github.com/ming1/linux/commits/v4.18-rc-host-tags-v8 > > > > Low level driver attached to scsi.mq can set BLK_MQ_F_NO_SCHED, If they > do > > not want benefit from io scheduler (e.a in case of SSDs connected to IT/MR > > controller). In case of HDD drives connected to HBA, driver can avoid > > setting BLK_MQ_F_NO_SCHED so that default elevator is set to mq- > deadline. > > That might be one way for improving megaraid_sas. > > However, even for modern high performance device, such as NVMe, turns out > IO scheduler(such as kyber) may play a big role for improving latency or > throughput. > > > Thanks, > Ming ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2018-07-10 7:11 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-04 7:59 [RFC] bypass scheduler for no sched is set Kashyap Desai 2018-07-04 9:08 ` Ming Lei 2018-07-04 10:37 ` Kashyap Desai 2018-07-05 2:28 ` Ming Lei 2018-07-05 9:55 ` Kashyap Desai 2018-07-10 7:11 ` Kashyap Desai
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.