All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: Device or HBA level QD throttling creates randomness in sequetial workload
@ 2016-10-24 18:54 Kashyap Desai
  2016-10-26 20:56 ` Omar Sandoval
  2016-10-31 17:24 ` Jens Axboe
  0 siblings, 2 replies; 17+ messages in thread
From: Kashyap Desai @ 2016-10-24 18:54 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-scsi, linux-kernel, linux-block, axboe, Christoph Hellwig,
	paolo.valente

> -----Original Message-----
> From: Omar Sandoval [mailto:osandov@osandov.com]
> Sent: Monday, October 24, 2016 9:11 PM
> To: Kashyap Desai
> Cc: linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> block@vger.kernel.org; axboe@kernel.dk; Christoph Hellwig;
> paolo.valente@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
sequetial
> workload
>
> On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote:
> > >
> > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > > > Hi -
> > > >
> > > > I found below conversation and it is on the same line as I wanted
> > > > some input from mailing list.
> > > >
> > > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2
> > > >
> > > > I can do testing on any WIP item as Omar mentioned in above
> > discussion.
> > > > https://github.com/osandov/linux/tree/blk-mq-iosched
> >
> > I tried build kernel using this repo, but looks like it is not allowed
> > to reboot due to some changes in <block> layer.
>
> Did you build the most up-to-date version of that branch? I've been
force
> pushing to it, so the commit id that you built would be useful.
> What boot failure are you seeing?

Below  is latest commit on repo.
commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda
Author: Omar Sandoval <osandov@fb.com>
Date:   Tue Sep 20 11:20:03 2016 -0700

    [WIP] blk-mq: limit bio queue depth

I have latest repo from 4.9/scsi-next maintained by Martin which boots
fine.  Only Delta is  " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched
branch. I could not see any meaningful data on boot hang, so going to try
one more time tomorrow.


>
> > >
> > > Are you using blk-mq for this disk? If not, then the work there
> > > won't
> > affect you.
> >
> > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
> > disable, Sequential work load issue is not seen and <cfq> scheduling
> > works well.
>
> Ah, okay, perfect. Can you send the fio job file you're using? Hard to
tell exactly
> what's going on without the details. A sequential workload with just one
> submitter is about as easy as it gets, so this _should_ be behaving
nicely.

<FIO script>

; setup numa policy for each thread
; 'numactl --show' to determine the maximum numa nodes
[global]
ioengine=libaio
buffered=0
rw=write
bssplit=4K/100
iodepth=256
numjobs=1
direct=1
runtime=60s
allow_mounted_write=0

[job1]
filename=/dev/sdd
..
[job24]
filename=/dev/sdaa

When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a
ioscheduler detail. (It is in blk-mq mode. )
/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/host10/target10:2:13/10:
2:13:0/block/sdq/queue/scheduler:none

When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0,
ioscheduler picked by SML is <cfq>.
/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/host10/target10:2:13/10:
2:13:0/block/sdq/queue/scheduler:noop deadline [cfq]

I see in blk-mq performance is very low for Sequential Write work load and
I confirm that blk-mq convert Sequential work load into random stream due
to  io-scheduler change in blk-mq vs legacy block layer.

>
> > >
> > > > Is there any workaround/alternative in latest upstream kernel, if
> > > > user wants to see limited penalty  for Sequential Work load on HDD
?
> > > >
> > > > ` Kashyap
> > > >
>
> P.S., your emails are being marked as spam by Gmail. Actually, Gmail
seems to
> mark just about everything I get from Broadcom as spam due to failed
DMARC.
>
> --
> Omar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Device or HBA level QD throttling creates randomness in sequetial workload
  2016-10-24 18:54 Device or HBA level QD throttling creates randomness in sequetial workload Kashyap Desai
@ 2016-10-26 20:56 ` Omar Sandoval
  2016-10-31 17:24 ` Jens Axboe
  1 sibling, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-10-26 20:56 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: linux-scsi, linux-kernel, linux-block, axboe, Christoph Hellwig,
	paolo.valente

On Tue, Oct 25, 2016 at 12:24:24AM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Omar Sandoval [mailto:osandov@osandov.com]
> > Sent: Monday, October 24, 2016 9:11 PM
> > To: Kashyap Desai
> > Cc: linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> > block@vger.kernel.org; axboe@kernel.dk; Christoph Hellwig;
> > paolo.valente@linaro.org
> > Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial
> > workload
> >
> > On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote:
> > > >
> > > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > > > > Hi -
> > > > >
> > > > > I found below conversation and it is on the same line as I wanted
> > > > > some input from mailing list.
> > > > >
> > > > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2
> > > > >
> > > > > I can do testing on any WIP item as Omar mentioned in above
> > > discussion.
> > > > > https://github.com/osandov/linux/tree/blk-mq-iosched
> > >
> > > I tried build kernel using this repo, but looks like it is not allowed
> > > to reboot due to some changes in <block> layer.
> >
> > Did you build the most up-to-date version of that branch? I've been
> force
> > pushing to it, so the commit id that you built would be useful.
> > What boot failure are you seeing?
> 
> Below  is latest commit on repo.
> commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda
> Author: Omar Sandoval <osandov@fb.com>
> Date:   Tue Sep 20 11:20:03 2016 -0700
> 
>     [WIP] blk-mq: limit bio queue depth
> 
> I have latest repo from 4.9/scsi-next maintained by Martin which boots
> fine.  Only Delta is  " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched
> branch. I could not see any meaningful data on boot hang, so going to try
> one more time tomorrow.

The blk-mq-bio-queueing branch has the latest work there separated out.
Not sure that it'll help in this case.

> >
> > > >
> > > > Are you using blk-mq for this disk? If not, then the work there
> > > > won't
> > > affect you.
> > >
> > > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
> > > disable, Sequential work load issue is not seen and <cfq> scheduling
> > > works well.
> >
> > Ah, okay, perfect. Can you send the fio job file you're using? Hard to
> tell exactly
> > what's going on without the details. A sequential workload with just one
> > submitter is about as easy as it gets, so this _should_ be behaving
> nicely.
> 
> <FIO script>
> 
> ; setup numa policy for each thread
> ; 'numactl --show' to determine the maximum numa nodes
> [global]
> ioengine=libaio
> buffered=0
> rw=write
> bssplit=4K/100
> iodepth=256
> numjobs=1
> direct=1
> runtime=60s
> allow_mounted_write=0
> 
> [job1]
> filename=/dev/sdd
> ..
> [job24]
> filename=/dev/sdaa

Okay, so you have one high-iodepth job per disk, got it.

> When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a
> ioscheduler detail. (It is in blk-mq mode. )
> /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/host10/target10:2:13/10:
> 2:13:0/block/sdq/queue/scheduler:none
> 
> When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0,
> ioscheduler picked by SML is <cfq>.
> /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/host10/target10:2:13/10:
> 2:13:0/block/sdq/queue/scheduler:noop deadline [cfq]
> 
> I see in blk-mq performance is very low for Sequential Write work load and
> I confirm that blk-mq convert Sequential work load into random stream due
> to  io-scheduler change in blk-mq vs legacy block layer.

Since this happens when the fio iodepth exceeds the per-device QD, my
best guess is that this is that requests are getting requeued and
scrambled when that happens. Do you have the blktrace lying around?

> > > > > Is there any workaround/alternative in latest upstream kernel, if
> > > > > user wants to see limited penalty  for Sequential Work load on HDD
> ?
> > > > >
> > > > > ` Kashyap
> > > > >
> >
> > P.S., your emails are being marked as spam by Gmail. Actually, Gmail
> seems to
> > mark just about everything I get from Broadcom as spam due to failed
> DMARC.
> >
> > --
> > Omar

-- 
Omar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Device or HBA level QD throttling creates randomness in sequetial workload
  2016-10-24 18:54 Device or HBA level QD throttling creates randomness in sequetial workload Kashyap Desai
  2016-10-26 20:56 ` Omar Sandoval
@ 2016-10-31 17:24 ` Jens Axboe
  2016-11-01  5:40   ` Kashyap Desai
  2017-01-30 13:52   ` Kashyap Desai
  1 sibling, 2 replies; 17+ messages in thread
From: Jens Axboe @ 2016-10-31 17:24 UTC (permalink / raw)
  To: Kashyap Desai, Omar Sandoval
  Cc: linux-scsi, linux-kernel, linux-block, Christoph Hellwig, paolo.valente

Hi,

One guess would be that this isn't around a requeue condition, but
rather the fact that we don't really guarantee any sort of hard FIFO
behavior between the software queues. Can you try this test patch to see
if it changes the behavior for you? Warning: untested...

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f3d27a6dee09..5404ca9c71b2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -772,6 +772,14 @@ static inline unsigned int queued_to_index(unsigned 
int queued)
  	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
  }

+static int rq_pos_cmp(void *priv, struct list_head *a, struct list_head *b)
+{
+	struct request *rqa = container_of(a, struct request, queuelist);
+	struct request *rqb = container_of(b, struct request, queuelist);
+
+	return blk_rq_pos(rqa) < blk_rq_pos(rqb);
+}
+
  /*
   * Run this hardware queue, pulling any software queues mapped to it in.
   * Note that this function currently has various problems around ordering
@@ -812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct 
blk_mq_hw_ctx *hctx)
  	}

  	/*
+	 * If the device is rotational, sort the list sanely to avoid
+	 * unecessary seeks. The software queues are roughly FIFO, but
+	 * only roughly, there are no hard guarantees.
+	 */
+	if (!blk_queue_nonrot(q))
+		list_sort(NULL, &rq_list, rq_pos_cmp);
+
+	/*
  	 * Start off with dptr being NULL, so we start the first request
  	 * immediately, even if we have more pending.
  	 */

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* RE: Device or HBA level QD throttling creates randomness in sequetial workload
  2016-10-31 17:24 ` Jens Axboe
@ 2016-11-01  5:40   ` Kashyap Desai
  2017-01-30 13:52   ` Kashyap Desai
  1 sibling, 0 replies; 17+ messages in thread
From: Kashyap Desai @ 2016-11-01  5:40 UTC (permalink / raw)
  To: Jens Axboe, Omar Sandoval
  Cc: linux-scsi, linux-kernel, linux-block, Christoph Hellwig, paolo.valente

Jens- Replied inline.


Omar -  I tested your WIP repo and figure out System hangs only if I pass "
scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I am
looking for scsi_mod.use_blk_mq=Y.

Also below is snippet of blktrace. In case of higher per device QD, I see
Requeue request in blktrace.

65,128 10     6268     2.432404509 18594  P   N [fio]
 65,128 10     6269     2.432405013 18594  U   N [fio] 1
 65,128 10     6270     2.432405143 18594  I  WS 148800 + 8 [fio]
 65,128 10     6271     2.432405740 18594  R  WS 148800 + 8 [0]
 65,128 10     6272     2.432409794 18594  Q  WS 148808 + 8 [fio]
 65,128 10     6273     2.432410234 18594  G  WS 148808 + 8 [fio]
 65,128 10     6274     2.432410424 18594  S  WS 148808 + 8 [fio]
 65,128 23     3626     2.432432595 16232  D  WS 148800 + 8 [kworker/23:1H]
 65,128 22     3279     2.432973482     0  C  WS 147432 + 8 [0]
 65,128  7     6126     2.433032637 18594  P   N [fio]
 65,128  7     6127     2.433033204 18594  U   N [fio] 1
 65,128  7     6128     2.433033346 18594  I  WS 148808 + 8 [fio]
 65,128  7     6129     2.433033871 18594  D  WS 148808 + 8 [fio]
 65,128  7     6130     2.433034559 18594  R  WS 148808 + 8 [0]
 65,128  7     6131     2.433039796 18594  Q  WS 148816 + 8 [fio]
 65,128  7     6132     2.433040206 18594  G  WS 148816 + 8 [fio]
 65,128  7     6133     2.433040351 18594  S  WS 148816 + 8 [fio]
 65,128  9     6392     2.433133729     0  C  WS 147240 + 8 [0]
 65,128  9     6393     2.433138166   905  D  WS 148808 + 8 [kworker/9:1H]
 65,128  7     6134     2.433167450 18594  P   N [fio]
 65,128  7     6135     2.433167911 18594  U   N [fio] 1
 65,128  7     6136     2.433168074 18594  I  WS 148816 + 8 [fio]
 65,128  7     6137     2.433168492 18594  D  WS 148816 + 8 [fio]
 65,128  7     6138     2.433174016 18594  Q  WS 148824 + 8 [fio]
 65,128  7     6139     2.433174282 18594  G  WS 148824 + 8 [fio]
 65,128  7     6140     2.433174613 18594  S  WS 148824 + 8 [fio]
CPU0 (sdy):
 Reads Queued:           0,        0KiB  Writes Queued:          79,
316KiB
 Read Dispatches:        0,        0KiB  Write Dispatches:       67,
18,446,744,073PiB
 Reads Requeued:         0               Writes Requeued:        86
 Reads Completed:        0,        0KiB  Writes Completed:       98,
392KiB
 Read Merges:            0,        0KiB  Write Merges:            0,
0KiB
 Read depth:             0               Write depth:             5
 IO unplugs:            79               Timer unplugs:           0



` Kashyap

> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Monday, October 31, 2016 10:54 PM
> To: Kashyap Desai; Omar Sandoval
> Cc: linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> block@vger.kernel.org; Christoph Hellwig; paolo.valente@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial
> workload
>
> Hi,
>
> One guess would be that this isn't around a requeue condition, but rather
> the
> fact that we don't really guarantee any sort of hard FIFO behavior between
> the
> software queues. Can you try this test patch to see if it changes the
> behavior for
> you? Warning: untested...

Jens - I tested the patch, but I still see random IO pattern for expected
Sequential Run. I am intentionally running case of Re-queue  and seeing
issue at the time of Re-queue.
If there is no Requeue, I see no issue at LLD.


>
> diff --git a/block/blk-mq.c b/block/blk-mq.c index
> f3d27a6dee09..5404ca9c71b2
> 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -772,6 +772,14 @@ static inline unsigned int queued_to_index(unsigned
> int
> queued)
>   	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
>   }
>
> +static int rq_pos_cmp(void *priv, struct list_head *a, struct list_head
> +*b) {
> +	struct request *rqa = container_of(a, struct request, queuelist);
> +	struct request *rqb = container_of(b, struct request, queuelist);
> +
> +	return blk_rq_pos(rqa) < blk_rq_pos(rqb); }
> +
>   /*
>    * Run this hardware queue, pulling any software queues mapped to it in.
>    * Note that this function currently has various problems around
> ordering @@ -
> 812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx
> *hctx)
>   	}
>
>   	/*
> +	 * If the device is rotational, sort the list sanely to avoid
> +	 * unecessary seeks. The software queues are roughly FIFO, but
> +	 * only roughly, there are no hard guarantees.
> +	 */
> +	if (!blk_queue_nonrot(q))
> +		list_sort(NULL, &rq_list, rq_pos_cmp);
> +
> +	/*
>   	 * Start off with dptr being NULL, so we start the first request
>   	 * immediately, even if we have more pending.
>   	 */
>
> --
> Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Device or HBA level QD throttling creates randomness in sequetial workload
  2016-10-31 17:24 ` Jens Axboe
  2016-11-01  5:40   ` Kashyap Desai
@ 2017-01-30 13:52   ` Kashyap Desai
  2017-01-30 16:30       ` Bart Van Assche
  1 sibling, 1 reply; 17+ messages in thread
From: Kashyap Desai @ 2017-01-30 13:52 UTC (permalink / raw)
  To: Jens Axboe, Omar Sandoval
  Cc: linux-scsi, linux-kernel, linux-block, Christoph Hellwig, paolo.valente

Hi Jens/Omar,

I used git.kernel.dk/linux-block branch - blk-mq-sched (commit
0efe27068ecf37ece2728a99b863763286049ab5) and confirm that issue reported in
this thread is resolved.

Now I am seeing MQ and  SQ mode both are resulting in sequential IO pattern
while IO is getting re-queued in block layer.

To make similar performance without blk-mq-sched feature, is it good to
pause IO for few usec in LLD?
I mean, I want to avoid driver asking SML/Block layer to re-queue the IO (if
it is Sequential on Rotational media.)

Explaining w.r.t megaraid_sas driver.  This driver expose can_queue, but it
internally consume commands for raid 1, fast  path.
In worst case, can_queue/2 will consume all firmware resources and driver
will re-queue further IOs to SML as below -

   if (atomic_inc_return(&instance->fw_outstanding) >
           instance->host->can_queue) {
       atomic_dec(&instance->fw_outstanding);
       return SCSI_MLQUEUE_HOST_BUSY;
   }

I want to avoid above SCSI_MLQUEUE_HOST_BUSY.

Need your suggestion for below changes -

diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 9a9c84f..a683eb0 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -54,6 +54,7 @@
 #include <scsi/scsi_host.h>
 #include <scsi/scsi_dbg.h>
 #include <linux/dmi.h>
+#include <linux/cpumask.h>

 #include "megaraid_sas_fusion.h"
 #include "megaraid_sas.h"
@@ -2572,7 +2573,15 @@ void megasas_prepare_secondRaid1_IO(struct
megasas_instance *instance,
    struct megasas_cmd_fusion *cmd, *r1_cmd = NULL;
    union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc;
    u32 index;
-   struct fusion_context *fusion;
+   bool    is_nonrot;
+   u32 safe_can_queue;
+   u32 num_cpus;
+   struct fusion_context *fusion;
+
+   fusion = instance->ctrl_context;
+
+   num_cpus = num_online_cpus();
+   safe_can_queue = instance->cur_can_queue - num_cpus;

    fusion = instance->ctrl_context;

@@ -2584,11 +2593,15 @@ void megasas_prepare_secondRaid1_IO(struct
megasas_instance *instance,
        return SCSI_MLQUEUE_DEVICE_BUSY;
    }

-   if (atomic_inc_return(&instance->fw_outstanding) >
-           instance->host->can_queue) {
-       atomic_dec(&instance->fw_outstanding);
-       return SCSI_MLQUEUE_HOST_BUSY;
-   }
+   if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) {
+       is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
+       /* For rotational device wait for sometime to get fusion command
from pool.
+        * This is just to reduce proactive re-queue at mid layer which is
not
+        * sending sorted IO in SCSI.MQ mode.
+        */
+       if (!is_nonrot)
+           udelay(100);
+   }

    cmd = megasas_get_cmd_fusion(instance, scmd->request->tag);

` Kashyap

> -----Original Message-----
> From: Kashyap Desai [mailto:kashyap.desai@broadcom.com]
> Sent: Tuesday, November 01, 2016 11:11 AM
> To: 'Jens Axboe'; 'Omar Sandoval'
> Cc: 'linux-scsi@vger.kernel.org'; 'linux-kernel@vger.kernel.org'; 'linux-
> block@vger.kernel.org'; 'Christoph Hellwig'; 'paolo.valente@linaro.org'
> Subject: RE: Device or HBA level QD throttling creates randomness in
> sequetial workload
>
> Jens- Replied inline.
>
>
> Omar -  I tested your WIP repo and figure out System hangs only if I pass
> "
> scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I
> am looking for scsi_mod.use_blk_mq=Y.
>
> Also below is snippet of blktrace. In case of higher per device QD, I see
> Requeue request in blktrace.
>
> 65,128 10     6268     2.432404509 18594  P   N [fio]
>  65,128 10     6269     2.432405013 18594  U   N [fio] 1
>  65,128 10     6270     2.432405143 18594  I  WS 148800 + 8 [fio]
>  65,128 10     6271     2.432405740 18594  R  WS 148800 + 8 [0]
>  65,128 10     6272     2.432409794 18594  Q  WS 148808 + 8 [fio]
>  65,128 10     6273     2.432410234 18594  G  WS 148808 + 8 [fio]
>  65,128 10     6274     2.432410424 18594  S  WS 148808 + 8 [fio]
>  65,128 23     3626     2.432432595 16232  D  WS 148800 + 8
> [kworker/23:1H]
>  65,128 22     3279     2.432973482     0  C  WS 147432 + 8 [0]
>  65,128  7     6126     2.433032637 18594  P   N [fio]
>  65,128  7     6127     2.433033204 18594  U   N [fio] 1
>  65,128  7     6128     2.433033346 18594  I  WS 148808 + 8 [fio]
>  65,128  7     6129     2.433033871 18594  D  WS 148808 + 8 [fio]
>  65,128  7     6130     2.433034559 18594  R  WS 148808 + 8 [0]
>  65,128  7     6131     2.433039796 18594  Q  WS 148816 + 8 [fio]
>  65,128  7     6132     2.433040206 18594  G  WS 148816 + 8 [fio]
>  65,128  7     6133     2.433040351 18594  S  WS 148816 + 8 [fio]
>  65,128  9     6392     2.433133729     0  C  WS 147240 + 8 [0]
>  65,128  9     6393     2.433138166   905  D  WS 148808 + 8 [kworker/9:1H]
>  65,128  7     6134     2.433167450 18594  P   N [fio]
>  65,128  7     6135     2.433167911 18594  U   N [fio] 1
>  65,128  7     6136     2.433168074 18594  I  WS 148816 + 8 [fio]
>  65,128  7     6137     2.433168492 18594  D  WS 148816 + 8 [fio]
>  65,128  7     6138     2.433174016 18594  Q  WS 148824 + 8 [fio]
>  65,128  7     6139     2.433174282 18594  G  WS 148824 + 8 [fio]
>  65,128  7     6140     2.433174613 18594  S  WS 148824 + 8 [fio]
> CPU0 (sdy):
>  Reads Queued:           0,        0KiB  Writes Queued:          79,
> 316KiB
>  Read Dispatches:        0,        0KiB  Write Dispatches:       67,
> 18,446,744,073PiB
>  Reads Requeued:         0               Writes Requeued:        86
>  Reads Completed:        0,        0KiB  Writes Completed:       98,
> 392KiB
>  Read Merges:            0,        0KiB  Write Merges:            0,
> 0KiB
>  Read depth:             0               Write depth:             5
>  IO unplugs:            79               Timer unplugs:           0
>
>
>
> ` Kashyap
>
> > -----Original Message-----
> > From: Jens Axboe [mailto:axboe@kernel.dk]
> > Sent: Monday, October 31, 2016 10:54 PM
> > To: Kashyap Desai; Omar Sandoval
> > Cc: linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> > block@vger.kernel.org; Christoph Hellwig; paolo.valente@linaro.org
> > Subject: Re: Device or HBA level QD throttling creates randomness in
> > sequetial workload
> >
> > Hi,
> >
> > One guess would be that this isn't around a requeue condition, but
> > rather the fact that we don't really guarantee any sort of hard FIFO
> > behavior between the software queues. Can you try this test patch to
> > see if it changes the behavior for you? Warning: untested...
>
> Jens - I tested the patch, but I still see random IO pattern for expected
> Sequential Run. I am intentionally running case of Re-queue  and seeing
> issue at the time of Re-queue.
> If there is no Requeue, I see no issue at LLD.
>
>
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c index
> > f3d27a6dee09..5404ca9c71b2
> > 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -772,6 +772,14 @@ static inline unsigned int
> > queued_to_index(unsigned int
> > queued)
> >   	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
> >   }
> >
> > +static int rq_pos_cmp(void *priv, struct list_head *a, struct
> > +list_head
> > +*b) {
> > +	struct request *rqa = container_of(a, struct request, queuelist);
> > +	struct request *rqb = container_of(b, struct request, queuelist);
> > +
> > +	return blk_rq_pos(rqa) < blk_rq_pos(rqb); }
> > +
> >   /*
> >    * Run this hardware queue, pulling any software queues mapped to it
> > in.
> >    * Note that this function currently has various problems around
> > ordering @@ -
> > 812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct
> > blk_mq_hw_ctx
> > *hctx)
> >   	}
> >
> >   	/*
> > +	 * If the device is rotational, sort the list sanely to avoid
> > +	 * unecessary seeks. The software queues are roughly FIFO, but
> > +	 * only roughly, there are no hard guarantees.
> > +	 */
> > +	if (!blk_queue_nonrot(q))
> > +		list_sort(NULL, &rq_list, rq_pos_cmp);
> > +
> > +	/*
> >   	 * Start off with dptr being NULL, so we start the first request
> >   	 * immediately, even if we have more pending.
> >   	 */
> >
> > --
> > Jens Axboe

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Device or HBA level QD throttling creates randomness in sequetial workload
  2017-01-30 13:52   ` Kashyap Desai
@ 2017-01-30 16:30       ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2017-01-30 16:30 UTC (permalink / raw)
  To: osandov, kashyap.desai, axboe
  Cc: linux-scsi, linux-kernel, hch, linux-block, paolo.valente

On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
> -   if (atomic_inc_return(&instance->fw_outstanding) >
> -           instance->host->can_queue) {
> -       atomic_dec(&instance->fw_outstanding);
> -       return SCSI_MLQUEUE_HOST_BUSY;
> -   }
> +   if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) {
> +       is_nonrot =3D blk_queue_nonrot(scmd->device->request_queue);
> +       /* For rotational device wait for sometime to get fusion command
> from pool.
> +        * This is just to reduce proactive re-queue at mid layer which i=
s
> not
> +        * sending sorted IO in SCSI.MQ mode.
> +        */
> +       if (!is_nonrot)
> +           udelay(100);
> +   }

The SCSI core does not allow to sleep inside the queuecommand() callback
function.

Bart.=

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Device or HBA level QD throttling creates randomness in sequetial workload
@ 2017-01-30 16:30       ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2017-01-30 16:30 UTC (permalink / raw)
  To: osandov, kashyap.desai, axboe
  Cc: linux-scsi, linux-kernel, hch, linux-block, paolo.valente

On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
> -   if (atomic_inc_return(&instance->fw_outstanding) >
> -           instance->host->can_queue) {
> -       atomic_dec(&instance->fw_outstanding);
> -       return SCSI_MLQUEUE_HOST_BUSY;
> -   }
> +   if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) {
> +       is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
> +       /* For rotational device wait for sometime to get fusion command
> from pool.
> +        * This is just to reduce proactive re-queue at mid layer which is
> not
> +        * sending sorted IO in SCSI.MQ mode.
> +        */
> +       if (!is_nonrot)
> +           udelay(100);
> +   }

The SCSI core does not allow to sleep inside the queuecommand() callback
function.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Device or HBA level QD throttling creates randomness in sequetial workload
  2017-01-30 16:30       ` Bart Van Assche
  (?)
@ 2017-01-30 16:32       ` Jens Axboe
  2017-01-30 18:28         ` Kashyap Desai
  -1 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2017-01-30 16:32 UTC (permalink / raw)
  To: Bart Van Assche, osandov, kashyap.desai
  Cc: linux-scsi, linux-kernel, hch, linux-block, paolo.valente

On 01/30/2017 09:30 AM, Bart Van Assche wrote:
> On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
>> -   if (atomic_inc_return(&instance->fw_outstanding) >
>> -           instance->host->can_queue) {
>> -       atomic_dec(&instance->fw_outstanding);
>> -       return SCSI_MLQUEUE_HOST_BUSY;
>> -   }
>> +   if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) {
>> +       is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
>> +       /* For rotational device wait for sometime to get fusion command
>> from pool.
>> +        * This is just to reduce proactive re-queue at mid layer which is
>> not
>> +        * sending sorted IO in SCSI.MQ mode.
>> +        */
>> +       if (!is_nonrot)
>> +           udelay(100);
>> +   }
> 
> The SCSI core does not allow to sleep inside the queuecommand() callback
> function.

udelay() is a busy loop, so it's not sleeping. That said, it's obviously
NOT a great idea. We want to fix the reordering due to requeues, not
introduce random busy delays to work around it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Device or HBA level QD throttling creates randomness in sequetial workload
  2017-01-30 16:32       ` Jens Axboe
@ 2017-01-30 18:28         ` Kashyap Desai
  2017-01-30 18:29           ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Kashyap Desai @ 2017-01-30 18:28 UTC (permalink / raw)
  To: Jens Axboe, Bart Van Assche, osandov
  Cc: linux-scsi, linux-kernel, hch, linux-block, paolo.valente

> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Monday, January 30, 2017 10:03 PM
> To: Bart Van Assche; osandov@osandov.com; kashyap.desai@broadcom.com
> Cc: linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org;
> hch@infradead.org; linux-block@vger.kernel.org; paolo.valente@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial workload
>
> On 01/30/2017 09:30 AM, Bart Van Assche wrote:
> > On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
> >> -   if (atomic_inc_return(&instance->fw_outstanding) >
> >> -           instance->host->can_queue) {
> >> -       atomic_dec(&instance->fw_outstanding);
> >> -       return SCSI_MLQUEUE_HOST_BUSY;
> >> -   }
> >> +   if (atomic_inc_return(&instance->fw_outstanding) >
safe_can_queue) {
> >> +       is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
> >> +       /* For rotational device wait for sometime to get fusion
> >> + command
> >> from pool.
> >> +        * This is just to reduce proactive re-queue at mid layer
> >> + which is
> >> not
> >> +        * sending sorted IO in SCSI.MQ mode.
> >> +        */
> >> +       if (!is_nonrot)
> >> +           udelay(100);
> >> +   }
> >
> > The SCSI core does not allow to sleep inside the queuecommand()
> > callback function.
>
> udelay() is a busy loop, so it's not sleeping. That said, it's obviously
NOT a
> great idea. We want to fix the reordering due to requeues, not introduce
> random busy delays to work around it.

Thanks for feedback. I do realize that udelay() is going to be very odd
in queue_command call back.   I will keep this note. Preferred solution is
blk mq scheduler patches.
>
> --
> Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Device or HBA level QD throttling creates randomness in sequetial workload
  2017-01-30 18:28         ` Kashyap Desai
@ 2017-01-30 18:29           ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2017-01-30 18:29 UTC (permalink / raw)
  To: Kashyap Desai, Bart Van Assche, osandov
  Cc: linux-scsi, linux-kernel, hch, linux-block, paolo.valente

On 01/30/2017 11:28 AM, Kashyap Desai wrote:
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Monday, January 30, 2017 10:03 PM
>> To: Bart Van Assche; osandov@osandov.com; kashyap.desai@broadcom.com
>> Cc: linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org;
>> hch@infradead.org; linux-block@vger.kernel.org; paolo.valente@linaro.org
>> Subject: Re: Device or HBA level QD throttling creates randomness in
>> sequetial workload
>>
>> On 01/30/2017 09:30 AM, Bart Van Assche wrote:
>>> On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
>>>> -   if (atomic_inc_return(&instance->fw_outstanding) >
>>>> -           instance->host->can_queue) {
>>>> -       atomic_dec(&instance->fw_outstanding);
>>>> -       return SCSI_MLQUEUE_HOST_BUSY;
>>>> -   }
>>>> +   if (atomic_inc_return(&instance->fw_outstanding) >
> safe_can_queue) {
>>>> +       is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
>>>> +       /* For rotational device wait for sometime to get fusion
>>>> + command
>>>> from pool.
>>>> +        * This is just to reduce proactive re-queue at mid layer
>>>> + which is
>>>> not
>>>> +        * sending sorted IO in SCSI.MQ mode.
>>>> +        */
>>>> +       if (!is_nonrot)
>>>> +           udelay(100);
>>>> +   }
>>>
>>> The SCSI core does not allow to sleep inside the queuecommand()
>>> callback function.
>>
>> udelay() is a busy loop, so it's not sleeping. That said, it's obviously
> NOT a
>> great idea. We want to fix the reordering due to requeues, not introduce
>> random busy delays to work around it.
> 
> Thanks for feedback. I do realize that udelay() is going to be very odd
> in queue_command call back.   I will keep this note. Preferred solution is
> blk mq scheduler patches.

It's coming in 4.11, so you don't have to wait long.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Device or HBA level QD throttling creates randomness in sequetial workload
  2016-10-24 13:05   ` Kashyap Desai
@ 2016-10-24 15:41     ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-10-24 15:41 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: linux-scsi, linux-kernel, linux-block, axboe, Christoph Hellwig,
	paolo.valente

On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote:
> >
> > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > > Hi -
> > >
> > > I found below conversation and it is on the same line as I wanted some
> > > input from mailing list.
> > >
> > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2
> > >
> > > I can do testing on any WIP item as Omar mentioned in above
> discussion.
> > > https://github.com/osandov/linux/tree/blk-mq-iosched
> 
> I tried build kernel using this repo, but looks like it is not allowed to
> reboot due to some changes in <block> layer.

Did you build the most up-to-date version of that branch? I've been
force pushing to it, so the commit id that you built would be useful.
What boot failure are you seeing?

> >
> > Are you using blk-mq for this disk? If not, then the work there won't
> affect you.
> 
> YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
> disable, Sequential work load issue is not seen and <cfq> scheduling works
> well.

Ah, okay, perfect. Can you send the fio job file you're using? Hard to
tell exactly what's going on without the details. A sequential workload
with just one submitter is about as easy as it gets, so this _should_ be
behaving nicely.

> >
> > > Is there any workaround/alternative in latest upstream kernel, if user
> > > wants to see limited penalty  for Sequential Work load on HDD ?
> > >
> > > ` Kashyap
> > >

P.S., your emails are being marked as spam by Gmail. Actually, Gmail
seems to mark just about everything I get from Broadcom as spam due to
failed DMARC.

-- 
Omar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Device or HBA level QD throttling creates randomness in sequetial workload
  2016-10-21 21:31 ` Omar Sandoval
  2016-10-22 15:04   ` Kashyap Desai
@ 2016-10-24 13:05   ` Kashyap Desai
  2016-10-24 15:41     ` Omar Sandoval
  1 sibling, 1 reply; 17+ messages in thread
From: Kashyap Desai @ 2016-10-24 13:05 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-scsi, linux-kernel, linux-block, axboe, Christoph Hellwig,
	paolo.valente

>
> On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > Hi -
> >
> > I found below conversation and it is on the same line as I wanted some
> > input from mailing list.
> >
> > http://marc.info/?l=linux-kernel&m=147569860526197&w=2
> >
> > I can do testing on any WIP item as Omar mentioned in above
discussion.
> > https://github.com/osandov/linux/tree/blk-mq-iosched

I tried build kernel using this repo, but looks like it is not allowed to
reboot due to some changes in <block> layer.

>
> Are you using blk-mq for this disk? If not, then the work there won't
affect you.

YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
disable, Sequential work load issue is not seen and <cfq> scheduling works
well.

>
> > Is there any workaround/alternative in latest upstream kernel, if user
> > wants to see limited penalty  for Sequential Work load on HDD ?
> >
> > ` Kashyap
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Device or HBA level QD throttling creates randomness in sequetial workload
  2016-10-21 21:31 ` Omar Sandoval
@ 2016-10-22 15:04   ` Kashyap Desai
  2016-10-24 13:05   ` Kashyap Desai
  1 sibling, 0 replies; 17+ messages in thread
From: Kashyap Desai @ 2016-10-22 15:04 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: paolo.valente, Christoph Hellwig, axboe, linux-kernel,
	linux-scsi, linux-block

[-- Attachment #1: Type: text/plain, Size: 3904 bytes --]

On 22-Oct-2016 3:01 AM, "Omar Sandoval" <osandov@osandov.com> wrote:
>
> On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > Hi -
> >
> > I found below conversation and it is on the same line as I wanted some
> > input from mailing list.
> >
> > http://marc.info/?l=linux-kernel&m=147569860526197&w=2
> >
> > I can do testing on any WIP item as Omar mentioned in above discussion.
> > https://github.com/osandov/linux/tree/blk-mq-iosched
>
> Are you using blk-mq for this disk? If not, then the work there won't
> affect you.
Yes I am using blk-mq.
>
> > Is there any workaround/alternative in latest upstream kernel, if user
> > wants to see limited penalty  for Sequential Work load on HDD ?
> >
> > ` Kashyap
> >
> > > -----Original Message-----
> > > From: Kashyap Desai [mailto:kashyap.desai@broadcom.com]
> > > Sent: Thursday, October 20, 2016 3:39 PM
> > > To: linux-scsi@vger.kernel.org
> > > Subject: Device or HBA level QD throttling creates randomness in
> > sequetial
> > > workload
> > >
> > > [ Apologize, if you find more than one instance of my email.
> > > Web based email client has some issue, so now trying git send mail.]
> > >
> > > Hi,
> > >
> > > I am doing some performance tuning in MR driver to understand how sdev
> > > queue depth and hba queue depth play role in IO submission from above
> > layer.
> > > I have 24 JBOD connected to MR 12GB controller and I can see
performance
> > for
> > > 4K Sequential work load as below.
> > >
> > > HBA QD for MR controller is 4065 and Per device QD is set to 32
> > >
> > > queue depth from <fio> 256 reports 300K IOPS queue depth from <fio>
128
> > > reports 330K IOPS queue depth from <fio> 64 reports 360K IOPS queue
> > depth
> > > from <fio> 32 reports 510K IOPS
> > >
> > > In MR driver I added debug print and confirm that more IO come to
driver
> > as
> > > random IO whenever I have <fio> queue depth more than 32.
> > >
> > > I have debug using scsi logging level and blktrace as well. Below is
> > snippet of
> > > logs using scsi logging level.  In summary, if SML do flow control of
IO
> > due to
> > > Device QD or HBA QD, IO coming to LLD is more random pattern.
> > >
> > > I see IO coming to driver is not sequential.
> > >
> > > [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03
c0
> > 3b 00
> > > 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a
00
> > 00 03
> > > c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB:
> > Write(10) 2a
> > > 00 00 03 c0 5b 00 00 01 00
> > >
> > > <KD> After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b".
> > > Two Sequence are overlapped due to sdev QD throttling.
> > >
> > > [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03
c0
> > 5c 00
> > > 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a
00
> > 00 03
> > > c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB:
> > Write(10) 2a
> > > 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857
> > CDB:
> > > Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0:
> > [sdy]
> > > tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00
> > >
> > >  If scsi_request_fn() breaks due to unavailability of device queue
(due
> > to below
> > > check), will there be any side defect as I observe ?
> > >                 if (!scsi_dev_queue_ready(q, sdev))
> > >                              break;
> > >
> > > If I reduce HBA QD and make sure IO from above layer is throttled due
to
> > HBA
> > > QD, there is a same impact.
> > > MR driver use host wide shared tag map.
> > >
> > > Can someone help me if this can be tunable in LLD providing additional
> > settings
> > > or it is expected behavior ? Problem I am facing is, I am not able to
> > figure out
> > > optimal device queue depth for different configuration and work load.
> > >
> > > Thanks, Kashyap
>
> --
> Omar

[-- Attachment #2: Type: text/html, Size: 5631 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Device or HBA level QD throttling creates randomness in sequetial workload
  2016-10-21 12:13 Kashyap Desai
@ 2016-10-21 21:31 ` Omar Sandoval
  2016-10-22 15:04   ` Kashyap Desai
  2016-10-24 13:05   ` Kashyap Desai
  0 siblings, 2 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-10-21 21:31 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: linux-scsi, linux-kernel, linux-block, axboe, Christoph Hellwig,
	paolo.valente

On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> Hi -
> 
> I found below conversation and it is on the same line as I wanted some
> input from mailing list.
> 
> http://marc.info/?l=linux-kernel&m=147569860526197&w=2
> 
> I can do testing on any WIP item as Omar mentioned in above discussion.
> https://github.com/osandov/linux/tree/blk-mq-iosched

Are you using blk-mq for this disk? If not, then the work there won't
affect you.

> Is there any workaround/alternative in latest upstream kernel, if user
> wants to see limited penalty  for Sequential Work load on HDD ?
> 
> ` Kashyap
> 
> > -----Original Message-----
> > From: Kashyap Desai [mailto:kashyap.desai@broadcom.com]
> > Sent: Thursday, October 20, 2016 3:39 PM
> > To: linux-scsi@vger.kernel.org
> > Subject: Device or HBA level QD throttling creates randomness in
> sequetial
> > workload
> >
> > [ Apologize, if you find more than one instance of my email.
> > Web based email client has some issue, so now trying git send mail.]
> >
> > Hi,
> >
> > I am doing some performance tuning in MR driver to understand how sdev
> > queue depth and hba queue depth play role in IO submission from above
> layer.
> > I have 24 JBOD connected to MR 12GB controller and I can see performance
> for
> > 4K Sequential work load as below.
> >
> > HBA QD for MR controller is 4065 and Per device QD is set to 32
> >
> > queue depth from <fio> 256 reports 300K IOPS queue depth from <fio> 128
> > reports 330K IOPS queue depth from <fio> 64 reports 360K IOPS queue
> depth
> > from <fio> 32 reports 510K IOPS
> >
> > In MR driver I added debug print and confirm that more IO come to driver
> as
> > random IO whenever I have <fio> queue depth more than 32.
> >
> > I have debug using scsi logging level and blktrace as well. Below is
> snippet of
> > logs using scsi logging level.  In summary, if SML do flow control of IO
> due to
> > Device QD or HBA QD, IO coming to LLD is more random pattern.
> >
> > I see IO coming to driver is not sequential.
> >
> > [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0
> 3b 00
> > 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00
> 00 03
> > c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB:
> Write(10) 2a
> > 00 00 03 c0 5b 00 00 01 00
> >
> > <KD> After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b".
> > Two Sequence are overlapped due to sdev QD throttling.
> >
> > [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0
> 5c 00
> > 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00
> 00 03
> > c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB:
> Write(10) 2a
> > 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857
> CDB:
> > Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0:
> [sdy]
> > tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00
> >
> >  If scsi_request_fn() breaks due to unavailability of device queue (due
> to below
> > check), will there be any side defect as I observe ?
> >                 if (!scsi_dev_queue_ready(q, sdev))
> >                              break;
> >
> > If I reduce HBA QD and make sure IO from above layer is throttled due to
> HBA
> > QD, there is a same impact.
> > MR driver use host wide shared tag map.
> >
> > Can someone help me if this can be tunable in LLD providing additional
> settings
> > or it is expected behavior ? Problem I am facing is, I am not able to
> figure out
> > optimal device queue depth for different configuration and work load.
> >
> > Thanks, Kashyap

-- 
Omar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Device or HBA level QD throttling creates randomness in sequetial workload
@ 2016-10-21 12:13 Kashyap Desai
  2016-10-21 21:31 ` Omar Sandoval
  0 siblings, 1 reply; 17+ messages in thread
From: Kashyap Desai @ 2016-10-21 12:13 UTC (permalink / raw)
  To: linux-scsi, linux-kernel, linux-block
  Cc: axboe, Christoph Hellwig, paolo.valente, osandov

Hi -

I found below conversation and it is on the same line as I wanted some
input from mailing list.

http://marc.info/?l=linux-kernel&m=147569860526197&w=2

I can do testing on any WIP item as Omar mentioned in above discussion.
https://github.com/osandov/linux/tree/blk-mq-iosched

Is there any workaround/alternative in latest upstream kernel, if user
wants to see limited penalty  for Sequential Work load on HDD ?

` Kashyap

> -----Original Message-----
> From: Kashyap Desai [mailto:kashyap.desai@broadcom.com]
> Sent: Thursday, October 20, 2016 3:39 PM
> To: linux-scsi@vger.kernel.org
> Subject: Device or HBA level QD throttling creates randomness in
sequetial
> workload
>
> [ Apologize, if you find more than one instance of my email.
> Web based email client has some issue, so now trying git send mail.]
>
> Hi,
>
> I am doing some performance tuning in MR driver to understand how sdev
> queue depth and hba queue depth play role in IO submission from above
layer.
> I have 24 JBOD connected to MR 12GB controller and I can see performance
for
> 4K Sequential work load as below.
>
> HBA QD for MR controller is 4065 and Per device QD is set to 32
>
> queue depth from <fio> 256 reports 300K IOPS queue depth from <fio> 128
> reports 330K IOPS queue depth from <fio> 64 reports 360K IOPS queue
depth
> from <fio> 32 reports 510K IOPS
>
> In MR driver I added debug print and confirm that more IO come to driver
as
> random IO whenever I have <fio> queue depth more than 32.
>
> I have debug using scsi logging level and blktrace as well. Below is
snippet of
> logs using scsi logging level.  In summary, if SML do flow control of IO
due to
> Device QD or HBA QD, IO coming to LLD is more random pattern.
>
> I see IO coming to driver is not sequential.
>
> [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0
3b 00
> 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00
00 03
> c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB:
Write(10) 2a
> 00 00 03 c0 5b 00 00 01 00
>
> <KD> After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b".
> Two Sequence are overlapped due to sdev QD throttling.
>
> [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0
5c 00
> 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00
00 03
> c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB:
Write(10) 2a
> 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857
CDB:
> Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0:
[sdy]
> tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00
>
>  If scsi_request_fn() breaks due to unavailability of device queue (due
to below
> check), will there be any side defect as I observe ?
>                 if (!scsi_dev_queue_ready(q, sdev))
>                              break;
>
> If I reduce HBA QD and make sure IO from above layer is throttled due to
HBA
> QD, there is a same impact.
> MR driver use host wide shared tag map.
>
> Can someone help me if this can be tunable in LLD providing additional
settings
> or it is expected behavior ? Problem I am facing is, I am not able to
figure out
> optimal device queue depth for different configuration and work load.
>
> Thanks, Kashyap

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Device or HBA level QD throttling creates randomness in sequetial workload
@ 2016-10-20 10:08 Kashyap Desai
  0 siblings, 0 replies; 17+ messages in thread
From: Kashyap Desai @ 2016-10-20 10:08 UTC (permalink / raw)
  To: linux-scsi

[ Apologize, if you find more than one instance of my email.
Web based email client has some issue, so now trying git send mail.]

Hi,

I am doing some performance tuning in MR driver to understand how sdev queue depth and hba queue depth play role in IO submission from above layer.
I have 24 JBOD connected to MR 12GB controller and I can see performance for 4K Sequential work load as below.

HBA QD for MR controller is 4065 and Per device QD is set to 32

queue depth from <fio> 256 reports 300K IOPS 
queue depth from <fio> 128 reports 330K IOPS
queue depth from <fio> 64 reports 360K IOPS 
queue depth from <fio> 32 reports 510K IOPS

In MR driver I added debug print and confirm that more IO come to driver as random IO whenever I have <fio> queue depth more than 32.

I have debug using scsi logging level and blktrace as well. Below is snippet of logs using scsi logging level.  In summary, if SML do flow control of IO due to Device QD or HBA QD, IO coming to LLD is more random pattern.

I see IO coming to driver is not sequential.

[79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 00 01 00
[79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 c0 3c 00 00 01 00
[79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a 00 00 03 c0 5b 00 00 01 00 

<KD> After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". 
Two Sequence are overlapped due to sdev QD throttling.

[79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 00 01 00
[79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 c0 3d 00 00 01 00
[79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a 00 00 03 c0 5d 00 00 01 00
[79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: Write(10) 2a 00 00 03 c0 3e 00 00 01 00
[79546.912268] sd 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00

 If scsi_request_fn() breaks due to unavailability of device queue (due to below check), will there be any side defect as I observe ?
                if (!scsi_dev_queue_ready(q, sdev))
                             break;

If I reduce HBA QD and make sure IO from above layer is throttled due to HBA QD, there is a same impact.
MR driver use host wide shared tag map.

Can someone help me if this can be tunable in LLD providing additional settings or it is expected behavior ? Problem I am facing is, I am not able to figure out optimal device queue depth for different configuration and work load.

Thanks, Kashyap


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Device or HBA level QD throttling creates randomness in sequetial workload
@ 2016-10-20  9:58 Kashyap Desai
  0 siblings, 0 replies; 17+ messages in thread
From: Kashyap Desai @ 2016-10-20  9:58 UTC (permalink / raw)
  To: linux-scsi

[ Apologize, if you find more than one instance of my email.
Web based email client has some issue, so now trying git send mail.]

Hi,

I am doing some performance tuning in MR driver to understand how sdev queue depth and hba queue depth play role in IO submission from above layer.
I have 24 JBOD connected to MR 12GB controller and I can see performance for 4K Sequential work load as below.

HBA QD for MR controller is 4065 and Per device QD is set to 32

queue depth from <fio> 256 reports 300K IOPS 
queue depth from <fio> 128 reports 330K IOPS
queue depth from <fio> 64 reports 360K IOPS 
queue depth from <fio> 32 reports 510K IOPS

In MR driver I added debug print and confirm that more IO come to driver as random IO whenever I have <fio> queue depth more than 32.

I have debug using scsi logging level and blktrace as well. Below is snippet of logs using scsi logging level.  In summary, if SML do flow control of IO due to Device QD or HBA QD, IO coming to LLD is more random pattern.

I see IO coming to driver is not sequential.

[79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 00 01 00
[79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 c0 3c 00 00 01 00
[79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a 00 00 03 c0 5b 00 00 01 00 

<KD> After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". 
Two Sequence are overlapped due to sdev QD throttling.

[79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 00 01 00
[79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 c0 3d 00 00 01 00
[79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a 00 00 03 c0 5d 00 00 01 00
[79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: Write(10) 2a 00 00 03 c0 3e 00 00 01 00
[79546.912268] sd 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00

 If scsi_request_fn() breaks due to unavailability of device queue (due to below check), will there be any side defect as I observe ?
                if (!scsi_dev_queue_ready(q, sdev))
                             break;

If I reduce HBA QD and make sure IO from above layer is throttled due to HBA QD, there is a same impact.
MR driver use host wide shared tag map.

Can someone help me if this can be tunable in LLD providing additional settings or it is expected behavior ? Problem I am facing is, I am not able to figure out optimal device queue depth for different configuration and work load.

Thanks, Kashyap


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-01-30 18:29 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-24 18:54 Device or HBA level QD throttling creates randomness in sequetial workload Kashyap Desai
2016-10-26 20:56 ` Omar Sandoval
2016-10-31 17:24 ` Jens Axboe
2016-11-01  5:40   ` Kashyap Desai
2017-01-30 13:52   ` Kashyap Desai
2017-01-30 16:30     ` Bart Van Assche
2017-01-30 16:30       ` Bart Van Assche
2017-01-30 16:32       ` Jens Axboe
2017-01-30 18:28         ` Kashyap Desai
2017-01-30 18:29           ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2016-10-21 12:13 Kashyap Desai
2016-10-21 21:31 ` Omar Sandoval
2016-10-22 15:04   ` Kashyap Desai
2016-10-24 13:05   ` Kashyap Desai
2016-10-24 15:41     ` Omar Sandoval
2016-10-20 10:08 Kashyap Desai
2016-10-20  9:58 Kashyap Desai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.