All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] bypass scheduler for no sched is set
@ 2018-07-04  7:59 Kashyap Desai
  2018-07-04  9:08 ` Ming Lei
  0 siblings, 1 reply; 6+ messages in thread
From: Kashyap Desai @ 2018-07-04  7:59 UTC (permalink / raw)
  To: linux-block, Jens Axboe, Ming Lei, Omar Sandoval,
	Christoph Hellwig, Hannes Reinecke, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 6307 bytes --]

Hi,

Ming Lei posted below patch series and performance improved for
megaraid_sas driver. I used the same kernel base and figure out some more
possible performance improvement in block layer. This RFC improves
performance as well as CPU utilization. If this patch fits the design
aspect of the blk-mq and scsi-mq, I can convert it into PATCH and submit
the same/modified version.

https://marc.info/?l=linux-block&m=153062994403732&w=2

Description of change -

Do not insert request into software queue if BLK_MQ_F_NO_SCHED is set.
Submit request from blk_mq_make_request to low level driver directly as
depicted through below function call.

blk_mq_try_issue_directly


__blk_mq_try_issue_directly

                 scsi_queue_rq

Low level driver attached to scsi.mq can set BLK_MQ_F_NO_SCHED,  If they do
not want benefit from io scheduler (e.a in case of SSDs connected to IT/MR
controller). In case of HDD drives connected to HBA, driver can avoid
setting BLK_MQ_F_NO_SCHED so that default elevator is set to mq-deadline.

Setup and performance number detail listed below -

I have created one R0 VD consist of 8 SSD on MegarRaid adapter.

Without RFC - IOPS goes 840K and CPU utilization goes upto 11%. Below is
perf top output

   5.17%  [kernel]                 [k] _raw_spin_lock
   4.62%  [kernel]                 [k] try_to_grab_pending
   2.29%  [kernel]                 [k] syscall_return_via_sysret
   1.37%  [kernel]                 [k] blk_mq_flush_busy_ctxs
   1.29%  [kernel]                 [k] kobject_get
   1.27%  fio                      [.] axmap_isset
   1.25%  [kernel]                 [k] flush_busy_ctx
   1.20%  [kernel]                 [k] scsi_dispatch_cmd
   1.18%  [kernel]                 [k] blk_mq_get_request
   1.16%  [kernel]                 [k] blk_mq_hctx_mark_pending.isra.45
   1.09%  [kernel]                 [k] irq_entries_start
   0.94%  [kernel]                 [k] del_timer
   0.91%  [kernel]                 [k] scsi_softirq_done
   0.90%  [kernel]                 [k] sbitmap_any_bit_set
   0.83%  [kernel]                 [k] blk_mq_free_request
   0.82%  [kernel]                 [k] kobject_put
   0.81%  [sd_mod]                 [k] sd_setup_read_write_cmnd
   0.80%  [kernel]                 [k] scsi_mq_get_budget
   0.79%  [kernel]                 [k] blk_mq_get_tag
   0.70%  [kernel]                 [k] blk_mq_dispatch_rq_list
   0.61%  [kernel]                 [k] bt_iter
   0.60%  fio                      [.] __fio_gettime
   0.59%  [kernel]                 [k] blk_mq_complete_request
   0.59%  [kernel]                 [k] gup_pgd_range
   0.57%  [kernel]                 [k] scsi_queue_rq


After applying RFC - IOPS goes 1066K and CPU utilization goes up to 6%.

   2.56%  [kernel]             [k] syscall_return_via_sysret
   2.46%  [kernel]             [k] irq_entries_start
   2.43%  [kernel]             [k] kobject_get
   2.40%  [kernel]             [k] bt_iter
   2.16%  fio                  [.] axmap_isset
   2.06%  [kernel]             [k] _raw_spin_lock
   1.76%  [kernel]             [k] __audit_syscall_exit
   1.51%  [kernel]             [k] scsi_dispatch_cmd
   1.49%  [kernel]             [k] blk_mq_free_request
   1.49%  [sd_mod]             [k] sd_setup_read_write_cmnd
   1.45%  [kernel]             [k] scsi_softirq_done
   1.32%  [kernel]             [k] switch_mm_irqs_off
   1.28%  [kernel]             [k] scsi_mq_get_budget
   1.22%  [kernel]             [k] blk_mq_check_inflight
   1.13%  [kernel]             [k] kobject_put
   1.11%  fio                  [.] __fio_gettime
   0.95%  [kernel]             [k] gup_pgd_range
   0.90%  [kernel]             [k] blk_mq_get_tag
   0.88%  [kernel]             [k] read_tsc
   0.85%  [kernel]             [k] scsi_end_request
   0.85%  fio                  [.] get_io_u
   0.84%  [kernel]             [k] lookup_ioctx
   0.81%  [kernel]             [k] blk_mq_complete_request
   0.80%  [kernel]             [k] blk_mq_get_request

Signed-off-by: Kashyap Desai < kashyap.desai@broadcom.com>
---

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4d1c048..ab27788 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1811,32 +1811,35 @@ static blk_qc_t blk_mq_make_request(struct
request_queue *q, struct bio *bio)
         blk_insert_flush(rq);
         blk_mq_run_hw_queue(data.hctx, true);
     } else if (plug && q->nr_hw_queues == 1) {
-        struct request *last = NULL;
-
         blk_mq_put_ctx(data.ctx);
         blk_mq_bio_to_request(rq, bio);
+        /* bypass scheduler for no sched flag set */
+        if (q->tag_set->flags & BLK_MQ_F_NO_SCHED)
+            blk_mq_try_issue_directly(data.hctx, rq, &cookie);
+        else {
+            struct request *last = NULL;
+            /*
+             * @request_count may become stale because of schedule
+             * out, so check the list again.
+             */
+            if (list_empty(&plug->mq_list))
+                request_count = 0;
+            else if (blk_queue_nomerges(q))
+                request_count = blk_plug_queued_count(q);
+
+            if (!request_count)
+                trace_block_plug(q);
+            else
+                last = list_entry_rq(plug->mq_list.prev);
+
+            if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
+                blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
+                blk_flush_plug_list(plug, false);
+                trace_block_plug(q);
+            }

-        /*
-         * @request_count may become stale because of schedule
-         * out, so check the list again.
-         */
-        if (list_empty(&plug->mq_list))
-            request_count = 0;
-        else if (blk_queue_nomerges(q))
-            request_count = blk_plug_queued_count(q);
-
-        if (!request_count)
-            trace_block_plug(q);
-        else
-            last = list_entry_rq(plug->mq_list.prev);
-
-        if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
-            blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
-            blk_flush_plug_list(plug, false);
-            trace_block_plug(q);
+            list_add_tail(&rq->queuelist, &plug->mq_list);
         }
-
-        list_add_tail(&rq->queuelist, &plug->mq_list);
     } else if (plug && !blk_queue_nomerges(q)) {
         blk_mq_bio_to_request(rq, bio);

[-- Attachment #2: Type: text/html, Size: 8995 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC] bypass scheduler for no sched is set
  2018-07-04  7:59 [RFC] bypass scheduler for no sched is set Kashyap Desai
@ 2018-07-04  9:08 ` Ming Lei
  2018-07-04 10:37   ` Kashyap Desai
  2018-07-10  7:11   ` Kashyap Desai
  0 siblings, 2 replies; 6+ messages in thread
From: Ming Lei @ 2018-07-04  9:08 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: linux-block, Jens Axboe, Omar Sandoval, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, Laurence Oberman

On Wed, Jul 04, 2018 at 01:29:50PM +0530, Kashyap Desai wrote:
> Hi,
> 
> Ming Lei posted below patch series and performance improved for
> megaraid_sas driver. I used the same kernel base and figure out some more
> possible performance improvement in block layer. This RFC improves
> performance as well as CPU utilization. If this patch fits the design
> aspect of the blk-mq and scsi-mq, I can convert it into PATCH and submit
> the same/modified version.
> 
> https://marc.info/?l=linux-block&m=153062994403732&w=2
> 
> Description of change -
> 
> Do not insert request into software queue if BLK_MQ_F_NO_SCHED is set.
> Submit request from blk_mq_make_request to low level driver directly as
> depicted through below function call.
> 
> blk_mq_try_issue_directly
> 
> 
> __blk_mq_try_issue_directly
> 
>                  scsi_queue_rq

Hi Kashyap,

When I sent you the patches[1] which include 'global tags' support, MegaRAID
is converted to per-node queues and becomes real MQ(q->nr_hw_queues > 1), IO
should have been issued direclty, please see the branch of '(plug &&
!blk_queue_nomerges(q))' and '(q->nr_hw_queues > 1 && is_sync)' in
blk_mq_make_request().

But looks not see any improvement from your test result, even though huge
improvement can be observed on null_blk/scsi_debug. Maybe somewhere is wrong,
and days ago I talked with Laurence about doing this kind of test again.

I will double check the 'global tags' patches, meantime could you or
Laurence help to check if global tags[2] works in expected way if
you'd like to?

[1] https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v5
[2] https://github.com/ming1/linux/commits/v4.18-rc-host-tags-v8
> 
> Low level driver attached to scsi.mq can set BLK_MQ_F_NO_SCHED,  If they do
> not want benefit from io scheduler (e.a in case of SSDs connected to IT/MR
> controller). In case of HDD drives connected to HBA, driver can avoid
> setting BLK_MQ_F_NO_SCHED so that default elevator is set to mq-deadline.

That might be one way for improving megaraid_sas.

However, even for modern high performance device, such as NVMe, turns out
IO scheduler(such as kyber) may play a big role for improving latency or
throughput.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [RFC] bypass scheduler for no sched is set
  2018-07-04  9:08 ` Ming Lei
@ 2018-07-04 10:37   ` Kashyap Desai
  2018-07-05  2:28     ` Ming Lei
  2018-07-10  7:11   ` Kashyap Desai
  1 sibling, 1 reply; 6+ messages in thread
From: Kashyap Desai @ 2018-07-04 10:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Jens Axboe, Omar Sandoval, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, Laurence Oberman

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Wednesday, July 4, 2018 2:38 PM
> To: Kashyap Desai
> Cc: linux-block@vger.kernel.org; Jens Axboe; Omar Sandoval; Christoph
> Hellwig; Hannes Reinecke; linux-scsi; Laurence Oberman
> Subject: Re: [RFC] bypass scheduler for no sched is set
>
> On Wed, Jul 04, 2018 at 01:29:50PM +0530, Kashyap Desai wrote:
> > Hi,
> >
> > Ming Lei posted below patch series and performance improved for
> > megaraid_sas driver. I used the same kernel base and figure out some
more
> > possible performance improvement in block layer. This RFC improves
> > performance as well as CPU utilization. If this patch fits the design
> > aspect of the blk-mq and scsi-mq, I can convert it into PATCH and
submit
> > the same/modified version.
> >
> > https://marc.info/?l=linux-block&m=153062994403732&w=2
> >
> > Description of change -
> >
> > Do not insert request into software queue if BLK_MQ_F_NO_SCHED is set.
> > Submit request from blk_mq_make_request to low level driver directly
as
> > depicted through below function call.
> >
> > blk_mq_try_issue_directly
> >
> >
> > __blk_mq_try_issue_directly
> >
> >                  scsi_queue_rq
>
> Hi Kashyap,
>
> When I sent you the patches[1] which include 'global tags' support,
MegaRAID
> is converted to per-node queues and becomes real MQ(q->nr_hw_queues >
> 1), IO
> should have been issued direclty, please see the branch of '(plug &&
> !blk_queue_nomerges(q))' and '(q->nr_hw_queues > 1 && is_sync)' in
> blk_mq_make_request().

I observed this while working on performance issue with you and from your
earlier patch 'global tags'.

>
> But looks not see any improvement from your test result, even though
huge
> improvement can be observed on null_blk/scsi_debug. Maybe somewhere is
> wrong,
> and days ago I talked with Laurence about doing this kind of test again.
>

Ming,

Your patch was trigger for me to review block layer changes as I did not
expected performance boost having multiple submission queue for IT/MR HBA
due to pseudo parallelism via more hctx.

Performance bottleneck is obvious, if we have *one* single scsi_device
which can goes upto 1M IOPS.  If we have more number of drives in topology
which requires more number of outstanding IOs to hit max Performance, we
will see gloab tag[2] will be a bottleneck.  In case of global tag [2],
hctx to cpu mapping was just round robin since we can use blk-mq-pci APIs.

There are a benefit of keeping nr_hw_queue = 1 as explained below.

More than  one nr_hw_queue will reduce tags per hardware context (higher
the physical socket we will have more trouble of distributing HBA
can_queue) and also it will not allow any IO scheduler to be attached.  We
will end up seeing performance issue for HDD based setup w.r.t sequential
profile. I already worked with upstream and block layer fix was part of
4.11 kernel. See below link for more detail.
https://lkml.org/lkml/2017/1/30/381 - To have this fix, we need
mq-deadline scheduler. This scheduler is not available if we call our self
as multi-hardware queue.

I reconfirm once again that above mentioned issue (IO sorting issue) is
only resolved if I use <mq-deadline> scheduler. It means using nr_hw_queue
>  1 will reintroduce IO sorting issue.

Ideally, we need nr_hw_queue = 1 to get use of io scheduler. MR and IT
controller of Broadcom do not want to by-pass IO scheduler all the time.

If we mark nr_hw_queue > 1 for IT/MR controller, we will not find any IO
scheduler due to below code @ elevator_init_mq and we need io scheduler
for HDD based storages.

int elevator_init_mq(struct request_queue *q)
{
        struct elevator_type *e;
        int err = 0;

        if (q->nr_hw_queues != 1)
                return 0;

Using request_queue->tag_set->flags method, we can cherry pick IO
scheduler.  Block layer will not attach any IO scheduler due to below code
@ blk_mq_init_allocated_queue().
Eventually, it looks better not to go through IO scheduler in submission
path based on same flag settings.

if (!(set->flags & BLK_MQ_F_NO_SCHED)) {
		int ret;

		ret = blk_mq_sched_init(q);
		if (ret)
			return ERR_PTR(ret);
}



> I will double check the 'global tags' patches, meantime could you or
> Laurence help to check if global tags[2] works in expected way if
> you'd like to?
>
> [1] https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v5
> [2] https://github.com/ming1/linux/commits/v4.18-rc-host-tags-v8

Yesterday I manually did this merging your v4.16-rc-host-tags-v5 to 4.18
branch.  For one particular test run, impact of global tags [2] and RFC
was same.  RFC and global tags
2] uses new path via blk_mq_try_issue_directly. Performance drop of global
tags [2] will be visible if we have more physical sockets and single numa
node exhaust all nr_tags.
Most likely negative performance if we have large HDD based setup using
global tag[2].

Performance drop due to reduced nr_tags can be completely avoided if we
use RFC.

Kashyap

> >
> > Low level driver attached to scsi.mq can set BLK_MQ_F_NO_SCHED,  If
they
> do
> > not want benefit from io scheduler (e.a in case of SSDs connected to
IT/MR
> > controller). In case of HDD drives connected to HBA, driver can avoid
> > setting BLK_MQ_F_NO_SCHED so that default elevator is set to mq-
> deadline.
>
> That might be one way for improving megaraid_sas.
>
> However, even for modern high performance device, such as NVMe, turns
out
> IO scheduler(such as kyber) may play a big role for improving latency or
> throughput.
>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] bypass scheduler for no sched is set
  2018-07-04 10:37   ` Kashyap Desai
@ 2018-07-05  2:28     ` Ming Lei
  2018-07-05  9:55       ` Kashyap Desai
  0 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2018-07-05  2:28 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: linux-block, Jens Axboe, Omar Sandoval, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, Laurence Oberman

Hi Kashyap,

On Wed, Jul 04, 2018 at 04:07:45PM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Ming Lei [mailto:ming.lei@redhat.com]
> > Sent: Wednesday, July 4, 2018 2:38 PM
> > To: Kashyap Desai
> > Cc: linux-block@vger.kernel.org; Jens Axboe; Omar Sandoval; Christoph
> > Hellwig; Hannes Reinecke; linux-scsi; Laurence Oberman
> > Subject: Re: [RFC] bypass scheduler for no sched is set
> >
> > On Wed, Jul 04, 2018 at 01:29:50PM +0530, Kashyap Desai wrote:
> > > Hi,
> > >
> > > Ming Lei posted below patch series and performance improved for
> > > megaraid_sas driver. I used the same kernel base and figure out some
> more
> > > possible performance improvement in block layer. This RFC improves
> > > performance as well as CPU utilization. If this patch fits the design
> > > aspect of the blk-mq and scsi-mq, I can convert it into PATCH and
> submit
> > > the same/modified version.
> > >
> > > https://marc.info/?l=linux-block&m=153062994403732&w=2
> > >
> > > Description of change -
> > >
> > > Do not insert request into software queue if BLK_MQ_F_NO_SCHED is set.
> > > Submit request from blk_mq_make_request to low level driver directly
> as
> > > depicted through below function call.
> > >
> > > blk_mq_try_issue_directly
> > >
> > >
> > > __blk_mq_try_issue_directly
> > >
> > >                  scsi_queue_rq
> >
> > Hi Kashyap,
> >
> > When I sent you the patches[1] which include 'global tags' support,
> MegaRAID
> > is converted to per-node queues and becomes real MQ(q->nr_hw_queues >
> > 1), IO
> > should have been issued direclty, please see the branch of '(plug &&
> > !blk_queue_nomerges(q))' and '(q->nr_hw_queues > 1 && is_sync)' in
> > blk_mq_make_request().
> 
> I observed this while working on performance issue with you and from your
> earlier patch 'global tags'.
> 
> >
> > But looks not see any improvement from your test result, even though
> huge
> > improvement can be observed on null_blk/scsi_debug. Maybe somewhere is
> > wrong,
> > and days ago I talked with Laurence about doing this kind of test again.
> >
> 
> Ming,
> 
> Your patch was trigger for me to review block layer changes as I did not
> expected performance boost having multiple submission queue for IT/MR HBA
> due to pseudo parallelism via more hctx.

OK, I guess the driver may not support to submit requests concurrently, is
it right?

> 
> Performance bottleneck is obvious, if we have *one* single scsi_device
> which can goes upto 1M IOPS.  If we have more number of drives in topology
> which requires more number of outstanding IOs to hit max Performance, we
> will see gloab tag[2] will be a bottleneck.  In case of global tag [2],
> hctx to cpu mapping was just round robin since we can use blk-mq-pci APIs.

If I remember correctly, the whole tags in this megaraid_sas is ~5K, and in
your test there are 8 SSD drives, so in case of dual socket system, you still
get 2.5K tags for all 8 SSDs. In theory, it is quite enough to reach each SSD's
top performance if the driver .queuecommand() doesn't take too much time.

There are at least two benefits with global tags:

1) hctx is NUMA locality, and ctx is accessed in NUMA locality too

2) issue directly in case of none

> 
> There are a benefit of keeping nr_hw_queue = 1 as explained below.
> 
> More than  one nr_hw_queue will reduce tags per hardware context (higher
> the physical socket we will have more trouble of distributing HBA
> can_queue) and also it will not allow any IO scheduler to be attached.  We

Right, if there is too many NUMA nodes, don't expect this HBA works
efficiently, since it only has single tags among all nodes & CPUs.

And 2 or 4 nodes should be more popular, you still get >1K tags for
one single hw queue in case of 4 nodes, which looks not too low.

> will end up seeing performance issue for HDD based setup w.r.t sequential
> profile. I already worked with upstream and block layer fix was part of
> 4.11 kernel. See below link for more detail.
> https://lkml.org/lkml/2017/1/30/381 - To have this fix, we need
> mq-deadline scheduler. This scheduler is not available if we call our self
> as multi-hardware queue.
> 
> I reconfirm once again that above mentioned issue (IO sorting issue) is
> only resolved if I use <mq-deadline> scheduler. It means using nr_hw_queue
> >  1 will reintroduce IO sorting issue.
> 

But all your current test is on none IO scheduler instead of mq-deadline,
right?

> Ideally, we need nr_hw_queue = 1 to get use of io scheduler. MR and IT
> controller of Broadcom do not want to by-pass IO scheduler all the time.

You may set io scheduler in case of 'nr_hw_queue > 1', please see
__blk_mq_try_issue_directly(), in which request will be inserted to
scheduler queue if 'q->elevator' isn't NULL.

> 
> If we mark nr_hw_queue > 1 for IT/MR controller, we will not find any IO
> scheduler due to below code @ elevator_init_mq and we need io scheduler
> for HDD based storages.
> 
> int elevator_init_mq(struct request_queue *q)
> {
>         struct elevator_type *e;
>         int err = 0;
> 
>         if (q->nr_hw_queues != 1)
>                 return 0;

You may switch io scheduler via /sys/block/sdN/queue/scheduler in real
MQ case.

> 
> Using request_queue->tag_set->flags method, we can cherry pick IO
> scheduler.  Block layer will not attach any IO scheduler due to below code
> @ blk_mq_init_allocated_queue().
> Eventually, it looks better not to go through IO scheduler in submission
> path based on same flag settings.
> 
> if (!(set->flags & BLK_MQ_F_NO_SCHED)) {
> 		int ret;
> 
> 		ret = blk_mq_sched_init(q);
> 		if (ret)
> 			return ERR_PTR(ret);
> }

Usually BLK_MQ_F_NO_SCHED is set for admin queues, and if you take this
approach, no any IO schedulers can be applied on this queue any more.

> 
> 
> 
> > I will double check the 'global tags' patches, meantime could you or
> > Laurence help to check if global tags[2] works in expected way if
> > you'd like to?
> >
> > [1] https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v5
> > [2] https://github.com/ming1/linux/commits/v4.18-rc-host-tags-v8
> 
> Yesterday I manually did this merging your v4.16-rc-host-tags-v5 to 4.18
> branch.  For one particular test run, impact of global tags [2] and RFC
> was same.  RFC and global tags
> 2] uses new path via blk_mq_try_issue_directly. Performance drop of global
> tags [2] will be visible if we have more physical sockets and single numa
> node exhaust all nr_tags.
> Most likely negative performance if we have large HDD based setup using
> global tag[2].

Global tags should be fine for HDD since small tags is enough for HDD, for
example, SATA often has 32 tags. Number of tags should be important
for SSD which need to apply parallelism on the internal NAND chip.

> 
> Performance drop due to reduced nr_tags can be completely avoided if we
> use RFC.

If each drive's average tags is more than 256, and you still may not get
good performance, I suggest to investigate driver's IO path, maybe
somewhere takes too long. Because from SSD's view, 256 should be enough
to reach its top performance.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [RFC] bypass scheduler for no sched is set
  2018-07-05  2:28     ` Ming Lei
@ 2018-07-05  9:55       ` Kashyap Desai
  0 siblings, 0 replies; 6+ messages in thread
From: Kashyap Desai @ 2018-07-05  9:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Jens Axboe, Omar Sandoval, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, Laurence Oberman

> >
> > Ming,
> >
> > Your patch was trigger for me to review block layer changes as I did
not
> > expected performance boost having multiple submission queue for IT/MR
> HBA
> > due to pseudo parallelism via more hctx.
>
> OK, I guess the driver may not support to submit requests concurrently,
is
> it right?

Driver support concurrent processing but it eventually submit to one h/w
queue only. IT and MR HBA is single h/w submission queue.

>
> >
> > Performance bottleneck is obvious, if we have *one* single scsi_device
> > which can goes upto 1M IOPS.  If we have more number of drives in
topology
> > which requires more number of outstanding IOs to hit max Performance,
we
> > will see gloab tag[2] will be a bottleneck.  In case of global tag
[2],
> > hctx to cpu mapping was just round robin since we can use blk-mq-pci
APIs.
>
> If I remember correctly, the whole tags in this megaraid_sas is ~5K, and
in
> your test there are 8 SSD drives, so in case of dual socket system, you
still
> get 2.5K tags for all 8 SSDs. In theory, it is quite enough to reach
each SSD's
> top performance if the driver .queuecommand() doesn't take too much
time.

We have iMR and MR version of controllers. iMR supports 1.6K queue depth
for Ventura Family controllers.
Same iMR Invader family controller supports 1K queue depth.

>
> There are at least two benefits with global tags:
>
> 1) hctx is NUMA locality, and ctx is accessed in NUMA locality too

As of now hctx is not NUMA locality. It is doing round robin CPU
assignment. Am I missing anything ? See below output.

# cat
/sys/devices/pci0000:85/0000:85:00.0/0000:86:00.0/host14/target14:2:63/14:
2:63:0/block/sdd/mq/0/cpu_list
0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38,
40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70

# cat
/sys/devices/pci0000:85/0000:85:00.0/0000:86:00.0/host14/target14:2:63/14:
2:63:0/block/sdd/mq/1/cpu_list
1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39,
41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 40 41
42 43 44 45 46 47 48 49 50 51 52 53
node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54 55
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71


>
> 2) issue directly in case of none

Agree. This is what we want for SSDs based VD as default scheduler.
Currently I am doing this through manual settings.

>
> >
> > There are a benefit of keeping nr_hw_queue = 1 as explained below.
> >
> > More than  one nr_hw_queue will reduce tags per hardware context
(higher
> > the physical socket we will have more trouble of distributing HBA
> > can_queue) and also it will not allow any IO scheduler to be attached.
We
>
> Right, if there is too many NUMA nodes, don't expect this HBA works
> efficiently, since it only has single tags among all nodes & CPUs.
>
> And 2 or 4 nodes should be more popular, you still get >1K tags for
> one single hw queue in case of 4 nodes, which looks not too low.

My current testing is on higher HBA queue depth Controllers, but as I
commented above..we have controller (iMR) which works in lower HBA QD.
On 4 Socket server +  iMR controller case, driver will assign 256 nr_tags
per hctx context.

One more thing, in case of MR we create N drives VD and for that we need
accumulated  per device queue depth.  8 drive R0 VD at least need 256
queue depth.

>
> > will end up seeing performance issue for HDD based setup w.r.t
sequential
> > profile. I already worked with upstream and block layer fix was part
of
> > 4.11 kernel. See below link for more detail.
> > https://lkml.org/lkml/2017/1/30/381 - To have this fix, we need
> > mq-deadline scheduler. This scheduler is not available if we call our
self
> > as multi-hardware queue.
> >
> > I reconfirm once again that above mentioned issue (IO sorting issue)
is
> > only resolved if I use <mq-deadline> scheduler. It means using
nr_hw_queue
> > >  1 will reintroduce IO sorting issue.
> >
>
> But all your current test is on none IO scheduler instead of
mq-deadline,
> right?

Correct. I am currently checking SSD based test case, but soon I will be
doing some HDD based test as well.

>
> > Ideally, we need nr_hw_queue = 1 to get use of io scheduler. MR and IT
> > controller of Broadcom do not want to by-pass IO scheduler all the
time.
>
> You may set io scheduler in case of 'nr_hw_queue > 1', please see
> __blk_mq_try_issue_directly(), in which request will be inserted to
> scheduler queue if 'q->elevator' isn't NULL.
>
> >
> > If we mark nr_hw_queue > 1 for IT/MR controller, we will not find any
IO
> > scheduler due to below code @ elevator_init_mq and we need io
scheduler
> > for HDD based storages.
> >
> > int elevator_init_mq(struct request_queue *q)
> > {
> >         struct elevator_type *e;
> >         int err = 0;
> >
> >         if (q->nr_hw_queues != 1)
> >                 return 0;
>
> You may switch io scheduler via /sys/block/sdN/queue/scheduler in real
> MQ case.

Got it. Kernel will not call blk_mq_init_sched if we are nr_hw_queue > 1,
but still we can switch through sysfs since elevator_switch() is not
checking about nr_hw_queue.
>
> >
> > Using request_queue->tag_set->flags method, we can cherry pick IO
> > scheduler.  Block layer will not attach any IO scheduler due to below
code
> > @ blk_mq_init_allocated_queue().
> > Eventually, it looks better not to go through IO scheduler in
submission
> > path based on same flag settings.
> >
> > if (!(set->flags & BLK_MQ_F_NO_SCHED)) {
> > 		int ret;
> >
> > 		ret = blk_mq_sched_init(q);
> > 		if (ret)
> > 			return ERR_PTR(ret);
> > }
>
> Usually BLK_MQ_F_NO_SCHED is set for admin queues, and if you take this
> approach, no any IO schedulers can be applied on this queue any more.
>
> >
> >
> >
> > > I will double check the 'global tags' patches, meantime could you or
> > > Laurence help to check if global tags[2] works in expected way if
> > > you'd like to?
> > >
> > > [1] https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v5
> > > [2] https://github.com/ming1/linux/commits/v4.18-rc-host-tags-v8
> >
> > Yesterday I manually did this merging your v4.16-rc-host-tags-v5 to
4.18
> > branch.  For one particular test run, impact of global tags [2] and
RFC
> > was same.  RFC and global tags
> > 2] uses new path via blk_mq_try_issue_directly. Performance drop of
global
> > tags [2] will be visible if we have more physical sockets and single
numa
> > node exhaust all nr_tags.
> > Most likely negative performance if we have large HDD based setup
using
> > global tag[2].
>
> Global tags should be fine for HDD since small tags is enough for HDD,
for
> example, SATA often has 32 tags. Number of tags should be important
> for SSD which need to apply parallelism on the internal NAND chip.
>
> >
> > Performance drop due to reduced nr_tags can be completely avoided if
we
> > use RFC.
>
> If each drive's average tags is more than 256, and you still may not get
> good performance, I suggest to investigate driver's IO path, maybe
> somewhere takes too long. Because from SSD's view, 256 should be enough
> to reach its top performance.

One of the case I am trying now is using iMR controller (HBA QD = 1000.)

OLTP work load. ( I will send you the fio script.)

Using global tags[2] patch -
12 SSDs in single drive R0 mode
IOPS read 890K/ write 440K (host_busy = ~490 because nr_tags = 494)
24 SSDs in single drive R0 mode
1312k/ 649k
IOPS read 1312K/ write 649K (host_busy = ~490 because nr_tags = 494)

Using RFC -
12 SSDs in single drive R0 mode
IOPS read 1050K/ write 510K (host_busy = ~750 because nr_tags = 988)
24 SSDs in single drive R0 mode
IOPS read 1650K/ write 855K (host_busy = ~988 because nr_tags = 988)

~25% performance drop is contributed just because of not having enough
tags. Most likely similar drop can be easily visible whenever we have
large topology and Max IOPs saturated @HBA level ( host_busy reaching to
can_queue.)

I agree that mq-deadline selection is still possible in global tags[2]
patch, but major concern is dividing can_queue to nr_tags per ctx.
I don't think we will need any scheduler operation for SSD (having said
that "none" scheduler for SSD case should be not an issue),.
Using BLK_MQ_F_NO_SCHED *only* for nonrotatioal media is still good
choice.

In summary, We need interface to use blk_mq_try_issue_directly for device
connected to scsi stack with nr_hw_queue  = 1.  We can achieve that using
your global tag[2] patch, but that is dividing can_queue and  we may see
high performance issue if performance really need max HBA queue depth
should be outstanding.  RFC patch is making thing simple and serving the
same purpose of calling blk_mq_try_issue_directly, if low level driver
wants. It will continue working in same hctx context without dividing
can_queue. I see that not dividing can_queue is much needed.

Kashyap

>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [RFC] bypass scheduler for no sched is set
  2018-07-04  9:08 ` Ming Lei
  2018-07-04 10:37   ` Kashyap Desai
@ 2018-07-10  7:11   ` Kashyap Desai
  1 sibling, 0 replies; 6+ messages in thread
From: Kashyap Desai @ 2018-07-10  7:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Jens Axboe, Omar Sandoval, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, Laurence Oberman

Ming, Thanks for checking this request and sending appropriate changes.
This RFC changes are settled down with proper block layer changes (below
patch) sent by Ming.

https://www.spinics.net/lists/linux-block/msg28279.html

Kashyap

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Wednesday, July 4, 2018 2:38 PM
> To: Kashyap Desai
> Cc: linux-block@vger.kernel.org; Jens Axboe; Omar Sandoval; Christoph
> Hellwig; Hannes Reinecke; linux-scsi; Laurence Oberman
> Subject: Re: [RFC] bypass scheduler for no sched is set
>
> On Wed, Jul 04, 2018 at 01:29:50PM +0530, Kashyap Desai wrote:
> > Hi,
> >
> > Ming Lei posted below patch series and performance improved for
> > megaraid_sas driver. I used the same kernel base and figure out some
more
> > possible performance improvement in block layer. This RFC improves
> > performance as well as CPU utilization. If this patch fits the design
> > aspect of the blk-mq and scsi-mq, I can convert it into PATCH and
submit
> > the same/modified version.
> >
> > https://marc.info/?l=linux-block&m=153062994403732&w=2
> >
> > Description of change -
> >
> > Do not insert request into software queue if BLK_MQ_F_NO_SCHED is set.
> > Submit request from blk_mq_make_request to low level driver directly
as
> > depicted through below function call.
> >
> > blk_mq_try_issue_directly
> >
> >
> > __blk_mq_try_issue_directly
> >
> >                  scsi_queue_rq
>
> Hi Kashyap,
>
> When I sent you the patches[1] which include 'global tags' support,
MegaRAID
> is converted to per-node queues and becomes real MQ(q->nr_hw_queues >
> 1), IO
> should have been issued direclty, please see the branch of '(plug &&
> !blk_queue_nomerges(q))' and '(q->nr_hw_queues > 1 && is_sync)' in
> blk_mq_make_request().
>
> But looks not see any improvement from your test result, even though
huge
> improvement can be observed on null_blk/scsi_debug. Maybe somewhere is
> wrong,
> and days ago I talked with Laurence about doing this kind of test again.
>
> I will double check the 'global tags' patches, meantime could you or
> Laurence help to check if global tags[2] works in expected way if
> you'd like to?
>
> [1] https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v5
> [2] https://github.com/ming1/linux/commits/v4.18-rc-host-tags-v8
> >
> > Low level driver attached to scsi.mq can set BLK_MQ_F_NO_SCHED,  If
they
> do
> > not want benefit from io scheduler (e.a in case of SSDs connected to
IT/MR
> > controller). In case of HDD drives connected to HBA, driver can avoid
> > setting BLK_MQ_F_NO_SCHED so that default elevator is set to mq-
> deadline.
>
> That might be one way for improving megaraid_sas.
>
> However, even for modern high performance device, such as NVMe, turns
out
> IO scheduler(such as kyber) may play a big role for improving latency or
> throughput.
>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-07-10  7:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-04  7:59 [RFC] bypass scheduler for no sched is set Kashyap Desai
2018-07-04  9:08 ` Ming Lei
2018-07-04 10:37   ` Kashyap Desai
2018-07-05  2:28     ` Ming Lei
2018-07-05  9:55       ` Kashyap Desai
2018-07-10  7:11   ` Kashyap Desai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.