All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kashyap Desai <kashyap.desai@broadcom.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: John Garry <john.garry@huawei.com>,
	axboe@kernel.dk, jejb@linux.ibm.com, martin.petersen@oracle.com,
	don.brace@microsemi.com, Sumit Saxena <sumit.saxena@broadcom.com>,
	bvanassche@acm.org, hare@suse.com, hch@lst.de,
	Shivasharan Srikanteshwara 
	<shivasharan.srikanteshwara@broadcom.com>,
	linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	esc.storagedev@microsemi.com, chenxiang66@hisilicon.com,
	"PDL,MEGARAIDLINUX" <megaraidlinux.pdl@broadcom.com>
Subject: RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
Date: Mon, 10 Aug 2020 22:08:59 +0530	[thread overview]
Message-ID: <6f8790811e7a3238f2b0fa35fbb816bc@mail.gmail.com> (raw)
In-Reply-To: <20200809021633.GA2134904@T590>

> On Sun, Aug 09, 2020 at 12:35:21AM +0530, Kashyap Desai wrote:
> > > On Thu, Aug 06, 2020 at 08:07:38PM +0530, Kashyap Desai wrote:
> > > > > > Hi Ming -
> > > > > >
> > > > > > There is still some race which is not handled.  Take a case of
> > > > > > IO is not able to get budget and it has already marked
> > > > > > <restarts>
> > flag.
> > > > > > <restarts> flag will be seen non-zero in completion path and
> > > > > > completion path will attempt h/w queue run. (But this
> > > > > > particular IO is still not in s/w queue.).
> > > > > > Attempt of running h/w queue from completion path will not
> > > > > > flush any IO since there is no IO in s/w queue.
> > > > >
> > > > > Then where is the IO to be submitted in case of running out of
> > budget?
> > > >
> > > > Typical race in your latest patch is - (Lets consider command A,B
> > > > and
> > > > C) Command A did not receive budget. Command B completed  (which
> > > > was already
> > >
> > > Command A doesn't get budget, and A is still in sw/scheduler queue
> > because
> > > we try to acquire budget before dequeuing request from sw/scheduler
> > queue,
> > > see __blk_mq_do_dispatch_sched() and blk_mq_do_dispatch_ctx().
> > >
> > > Not consider direct issue, because the hw queue will be run
> > > explicitly
> > when
> > > not getting budget, see __blk_mq_try_issue_directly.
> > >
> > > Not consider command A being added to hctx->dispatch too, because
> > > blk-mq will re-run the queue, see blk_mq_dispatch_rq_list().
> >
> > Ming -
> >
> > After going through your comment (I noted your comment and thanks for
> > correcting my understanding.) and block layer code, I realize that it
> > is a different race condition. My previous explanation was not
accurate.
> > I debug further and figure out what is actually happening - Consider
> > below scenario/sequence -
> >
> > Thread -1 - Detected budget contention. Set restarts = 1.
> > Thread -2 - old restarts = 1. start hw queue.
> > Thread -3 - old restarts = 1. start hw queue.
> > Thread -2 - move restarts = 0.
> > In my testing, I noticed that both thread-2 and thread-3 started h/w
> > queue but there was no work for them to do. It is possible because
> > some other context of h/w queue run might have done that job.
>
> It should be true, because there is other run queue somewhere, such as
blk-
> mq's restart or delay run queue.
>
> > It means, IO of thread-1 is already posted.
>
> OK.
>
> > Thread -4 - Detected budget contention. Set restart = 1 (because
> > thread-2 has move restarts=0).
>
> OK.
>
> > Thread -3 - move restarts = 0 (because this thread see old value = 1
> > but that is actually updated one more time by thread-4 and theread-4
> > actually wanted to run h/w queues). IO of Thread-4 will not be
scheduled.
>
> Right.
>
> >
> > We have to make sure that completion IO path do atomic_cmpxchng of
> > restarts flag before running the h/w queue.  Below code change - (main
> > fix is sequence of atomic_cmpxchg and blk_mq_run_hw_queues) fix the
> issue.
> >
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -594,8 +594,27 @@ static bool scsi_end_request(struct request *req,
> > blk_status_t error,
> >         if (scsi_target(sdev)->single_lun ||
> >             !list_empty(&sdev->host->starved_list))
> >                 kblockd_schedule_work(&sdev->requeue_work);
> > -       else
> > -               blk_mq_run_hw_queues(q, true);
> > +       else {
> > +               /*
> > +                * smp_mb() implied in either rq->end_io or
> > blk_mq_free_request
> > +                * is for ordering writing .device_busy in
> > scsi_device_unbusy()
> > +                * and reading sdev->restarts.
> > +                */
> > +               int old = atomic_read(&sdev->restarts);
> > +
> > +               if (old) {
> > +                       /*
> > +                        * ->restarts has to be kept as non-zero if
> > + there
> > is
> > +                        *  new budget contention comes.
> > +                        */
> > +                       atomic_cmpxchg(&sdev->restarts, old, 0);
> > +
> > +                       /* run the queue after restarts flag is
updated
> > +                        * to avoid race condition with .get_budget
> > +                        */
> > +                       blk_mq_run_hw_queues(sdev->request_queue,
true);
> > +               }
> > +       }
> >
>
> I think the above change is right, and this patter is basically same
with
> SCHED_RESTART used in blk_mq_sched_restart().
>
> BTW, could you run your function & performance test against the
following
> new version?
> Then I can include your test result in commit log for moving on.

Ming  - I completed both functional and performance test.

System used for the test -
Manufacturer: Supermicro
Product Name: X11DPG-QT

lscpu <snippet>
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Model name:            Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz

Controller used -
MegaRAID 9560-16i

Total 24 SAS driver of model "WDC      WUSTM3240ASS200"

Total 3 VD created each VD consist of 8 SAS Drives.

Performance testing -

Fio script -
[global]
ioengine=libaio
direct=1
sync=0
ramp_time=20
runtime=60
cpus_allowed=18,19
bs=4k
rw=randread
ioscheduler=none
iodepth=128

[seqprecon]
filename=/dev/sdc
[seqprecon]
filename=/dev/sdd
[seqprecon]
filename=/dev/sde

Without this patch - 602K IOPS. Perf top snippet -(Note - Usage of
blk_mq_run_hw_queues -> blk_mq_run_hw_queue is very high. It consume more
CPU which leads to less performance.)

     8.70%  [kernel]        [k] blk_mq_run_hw_queue
     5.24%  [megaraid_sas]  [k] complete_cmd_fusion
     4.65%  [kernel]        [k] sbitmap_any_bit_set
     3.93%  [kernel]        [k] irq_entries_start
     3.58%  perf            [.] __symbols__insert
     2.21%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
     1.91%  [kernel]        [k] blk_mq_run_hw_queues

With this patch - 1110K IOPS. Perf top snippet -

    8.05%  [megaraid_sas]  [k] complete_cmd_fusion
     4.10%  [kernel]        [k] irq_entries_start
     3.71%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
     2.85%  [kernel]        [k] read_tsc
     2.83%  [kernel]        [k] io_submit_one
     2.26%  [kernel]        [k] entry_SYSCALL_64
     2.08%  [megaraid_sas]  [k] megasas_queue_command


Functional Test -

I cover overnight IO testing using <fio> script which sends 4K rand read,
read, rand write and write IOs to the 24 SAS JBOD drives.
Some of the JBOD has ioscheduler=none and some of the JBOD has
ioscheduler=mq-deadline
I used additional script which change sdev->queue_depth of each device
from 2 to 16 range at the interval of 5 seconds.
I used additional script which toggle "rq_affinity=1" and "rq_affinity=2"
at the interval of 5 seconds.

I did not noticed any IO hang.

Thanks, Kashyap

>
>
> From 06993ddf5c5dbe0e772cc38342919eb61a57bc50 Mon Sep 17 00:00:00
> 2001
> From: Ming Lei <ming.lei@redhat.com>
> Date: Wed, 5 Aug 2020 16:35:53 +0800
> Subject: [PATCH] scsi: core: only re-run queue in scsi_end_request() if
device
> queue is busy
>
> Now the request queue is run in scsi_end_request() unconditionally if
both
> target queue and host queue is ready. We should have re-run request
queue
> only after this device queue becomes busy for restarting this LUN only.
>
> Recently Long Li reported that cost of run queue may be very heavy in
case of
> high queue depth. So improve this situation by only running the request
queue
> when this LUN is busy.
>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ewan D. Milne <emilne@redhat.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Long Li <longli@microsoft.com>
> Reported-by: Long Li <longli@microsoft.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V4:
> 	- not clear .restarts in get_budget(), instead clearing it
> 	after re-run queue is done; Kashyap figured out we have to
> 	update ->restarts before re-run queue in scsi_run_queue_async().
>
> V3:
> 	- add one smp_mb() in scsi_mq_get_budget() and comment
>
> V2:
> 	- commit log change, no any code change
> 	- add reported-by tag
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/scsi_lib.c    | 51 +++++++++++++++++++++++++++++++++++---
>  include/scsi/scsi_device.h |  1 +
>  2 files changed, 49 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> c866a4f33871..d083250f9518 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -552,8 +552,27 @@ static void scsi_run_queue_async(struct scsi_device
> *sdev)
>  	if (scsi_target(sdev)->single_lun ||
>  	    !list_empty(&sdev->host->starved_list))
>  		kblockd_schedule_work(&sdev->requeue_work);
> -	else
> -		blk_mq_run_hw_queues(sdev->request_queue, true);
> +	else {
> +		/*
> +		 * smp_mb() implied in either rq->end_io or
> blk_mq_free_request
> +		 * is for ordering writing .device_busy in
scsi_device_unbusy()
> +		 * and reading sdev->restarts.
> +		 */
> +		int old = atomic_read(&sdev->restarts);
> +
> +		if (old) {
> +			/*
> +			 * ->restarts has to be kept as non-zero if there
is
> +			 *  new budget contention comes.
> +			 *
> +			 *  No need to run queue when either another
re-run
> +			 *  queue wins in updating ->restarts or one new
> budget
> +			 *  contention comes.
> +			 */
> +			if (atomic_cmpxchg(&sdev->restarts, old, 0) ==
old)
> +				blk_mq_run_hw_queues(sdev-
> >request_queue, true);
> +		}
> +	}
>  }
>
>  /* Returns false when no more bytes to process, true if there are more
*/
> @@ -1612,8 +1631,34 @@ static void scsi_mq_put_budget(struct
> request_queue *q)  static bool scsi_mq_get_budget(struct request_queue
*q)
> {
>  	struct scsi_device *sdev = q->queuedata;
> +	int ret = scsi_dev_queue_ready(q, sdev);
> +
> +	if (ret)
> +		return true;
> +
> +	atomic_inc(&sdev->restarts);
>
> -	return scsi_dev_queue_ready(q, sdev);
> +	/*
> +	 * Order writing .restarts and reading .device_busy, and make sure
> +	 * .restarts is visible to scsi_end_request(). Its pair is implied
by
> +	 * __blk_mq_end_request() in scsi_end_request() for ordering
> +	 * writing .device_busy in scsi_device_unbusy() and reading
.restarts.
> +	 *
> +	 */
> +	smp_mb__after_atomic();
> +
> +	/*
> +	 * If all in-flight requests originated from this LUN are
completed
> +	 * before setting .restarts, sdev->device_busy will be observed as
> +	 * zero, then blk_mq_delay_run_hw_queues() will dispatch this
> request
> +	 * soon. Otherwise, completion of one of these request will
observe
> +	 * the .restarts flag, and the request queue will be run for
handling
> +	 * this request, see scsi_end_request().
> +	 */
> +	if (unlikely(atomic_read(&sdev->device_busy) == 0 &&
> +				!scsi_device_blocked(sdev)))
> +		blk_mq_delay_run_hw_queues(sdev->request_queue,
> SCSI_QUEUE_DELAY);
> +	return false;
>  }
>
>  static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, diff
--git
> a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index
> bc5909033d13..1a5c9a3df6d6 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -109,6 +109,7 @@ struct scsi_device {
>  	atomic_t device_busy;		/* commands actually active on
LLDD
> */
>  	atomic_t device_blocked;	/* Device returned QUEUE_FULL. */
>
> +	atomic_t restarts;
>  	spinlock_t list_lock;
>  	struct list_head starved_entry;
>  	unsigned short queue_depth;	/* How deep of a queue we want */
> --
> 2.25.2
>
>
>
> Thanks,
> Ming

  reply	other threads:[~2020-08-10 16:39 UTC|newest]

Thread overview: 123+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
2020-06-10 17:29 ` [PATCH RFC v7 01/12] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
2020-06-10 17:29 ` [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
2020-06-11  2:57   ` Ming Lei
2020-06-11  8:26     ` John Garry
2020-06-23 11:25       ` John Garry
2020-06-23 14:23         ` Hannes Reinecke
2020-06-24  8:13           ` Kashyap Desai
2020-06-29 16:18             ` John Garry
2020-08-10 16:51           ` Kashyap Desai
2020-08-11  8:01             ` John Garry
2020-08-11 16:34               ` Kashyap Desai
2020-06-10 17:29 ` [PATCH RFC v7 03/12] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
2020-06-10 17:29 ` [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset John Garry
2020-06-11  3:37   ` Ming Lei
2020-06-11 10:09     ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap John Garry
2020-06-11  4:04   ` Ming Lei
2020-06-11 10:22     ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set " John Garry
2020-06-11 13:16   ` Hannes Reinecke
2020-06-11 14:22     ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a " John Garry
2020-06-11 13:19   ` Hannes Reinecke
2020-06-11 14:33     ` John Garry
2020-06-12  6:06       ` Hannes Reinecke
2020-06-29 15:32         ` About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap) John Garry
2020-06-30  6:33           ` Hannes Reinecke
2020-06-30  7:30             ` John Garry
2020-06-30 11:36               ` John Garry
2020-06-30 14:55           ` Bart Van Assche
2020-07-13  9:41         ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
2020-07-13 12:20           ` Hannes Reinecke
2020-06-10 17:29 ` [PATCH RFC v7 08/12] scsi: Add template flag 'host_tagset' John Garry
2020-06-10 17:29 ` [PATCH RFC v7 09/12] scsi: hisi_sas: Switch v3 hw to MQ John Garry
2020-06-10 17:29 ` [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters " John Garry
2020-07-02 10:23   ` Kashyap Desai
2020-07-06  8:23     ` John Garry
2020-07-06  8:45       ` Hannes Reinecke
2020-07-06  9:26         ` John Garry
2020-07-06  9:40           ` Hannes Reinecke
2020-07-06 19:19       ` Kashyap Desai
2020-07-07  7:58         ` John Garry
2020-07-07 14:45           ` Kashyap Desai
2020-07-07 16:17             ` John Garry
2020-07-09 19:01               ` Kashyap Desai
2020-07-10  8:10                 ` John Garry
2020-07-13  7:55                   ` Kashyap Desai
2020-07-13  8:42                     ` John Garry
2020-07-19 19:07                       ` Kashyap Desai
2020-07-20  7:23                       ` Kashyap Desai
2020-07-20  9:18                         ` John Garry
2020-07-21  1:13                         ` Ming Lei
2020-07-21  6:53                           ` Kashyap Desai
2020-07-22  4:12                             ` Ming Lei
2020-07-22  5:30                               ` Kashyap Desai
2020-07-22  8:04                                 ` Ming Lei
2020-07-22  9:32                                   ` John Garry
2020-07-23 14:07                                     ` Ming Lei
2020-07-23 17:29                                       ` John Garry
2020-07-24  2:47                                         ` Ming Lei
2020-07-28  7:54                                           ` John Garry
2020-07-28  8:45                                             ` Ming Lei
2020-07-29  5:25                                               ` Kashyap Desai
2020-07-29 15:36                                                 ` Ming Lei
2020-07-29 18:31                                                   ` Kashyap Desai
2020-08-04  8:36                                                     ` Ming Lei
2020-08-04  9:27                                                       ` Kashyap Desai
2020-08-05  8:40                                                         ` Ming Lei
2020-08-06 10:25                                                           ` Kashyap Desai
2020-08-06 13:38                                                             ` Ming Lei
2020-08-06 14:37                                                               ` Kashyap Desai
2020-08-06 15:29                                                                 ` Ming Lei
2020-08-08 19:05                                                                   ` Kashyap Desai
2020-08-09  2:16                                                                     ` Ming Lei
2020-08-10 16:38                                                                       ` Kashyap Desai [this message]
2020-08-11  8:09                                                                         ` John Garry
2020-08-04 17:00                                               ` John Garry
2020-08-05  2:56                                                 ` Ming Lei
2020-07-28  8:01                                   ` Kashyap Desai
2020-07-08 11:31         ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 11/12] smartpqi: enable host tagset John Garry
2020-07-14 13:16   ` John Garry
2020-07-14 13:31     ` John Garry
2020-07-14 18:16       ` Don.Brace
2020-07-15  7:28         ` John Garry
2020-07-14 14:02     ` Hannes Reinecke
2020-08-18  8:33       ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ John Garry
2020-07-14  7:37   ` John Garry
2020-07-14  7:41     ` Hannes Reinecke
2020-07-14  7:52       ` John Garry
2020-07-14  8:06         ` Ming Lei
2020-07-14  9:53           ` John Garry
2020-07-14 10:14             ` Ming Lei
2020-07-14 10:43               ` Hannes Reinecke
2020-07-14 10:19             ` Hannes Reinecke
2020-07-14 10:35               ` John Garry
2020-07-14 10:44               ` Ming Lei
2020-07-14 10:52                 ` John Garry
2020-07-14 12:04                   ` Ming Lei
2020-08-03 20:39         ` Don.Brace
2020-08-04  9:27           ` John Garry
2020-08-04 15:18             ` Don.Brace
2020-08-05 11:21               ` John Garry
2020-08-14 21:04                 ` Don.Brace
2020-08-17  8:00                   ` John Garry
2020-08-17 18:39                     ` Don.Brace
2020-08-18  7:14                       ` Hannes Reinecke
2020-07-16 16:14     ` Don.Brace
2020-07-16 19:45     ` Don.Brace
2020-07-17 10:11       ` John Garry
2020-06-11  3:07 ` [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs Ming Lei
2020-06-11  9:35   ` John Garry
2020-06-12 18:47     ` Kashyap Desai
2020-06-15  2:13       ` Ming Lei
2020-06-15  6:57         ` Kashyap Desai
2020-06-16  1:00           ` Ming Lei
2020-06-17 11:26             ` Kashyap Desai
2020-06-22  6:24               ` Hannes Reinecke
2020-06-23  0:55                 ` Ming Lei
2020-06-23 11:50                   ` Kashyap Desai
2020-06-23 12:11                   ` Kashyap Desai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6f8790811e7a3238f2b0fa35fbb816bc@mail.gmail.com \
    --to=kashyap.desai@broadcom.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=chenxiang66@hisilicon.com \
    --cc=don.brace@microsemi.com \
    --cc=esc.storagedev@microsemi.com \
    --cc=hare@suse.com \
    --cc=hch@lst.de \
    --cc=jejb@linux.ibm.com \
    --cc=john.garry@huawei.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=megaraidlinux.pdl@broadcom.com \
    --cc=ming.lei@redhat.com \
    --cc=shivasharan.srikanteshwara@broadcom.com \
    --cc=sumit.saxena@broadcom.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.