From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9CFA0C433DF for ; Sun, 9 Aug 2020 02:17:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5EE38206B2 for ; Sun, 9 Aug 2020 02:17:05 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="QW083Wsz" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726050AbgHICRE (ORCPT ); Sat, 8 Aug 2020 22:17:04 -0400 Received: from us-smtp-1.mimecast.com ([207.211.31.81]:38675 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726009AbgHICRD (ORCPT ); Sat, 8 Aug 2020 22:17:03 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1596939421; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=mYXQdmzaZ6N4GIvtqoMuI82wQrzXh7PjWvL5d//vkVc=; b=QW083Wsz1eZDTeClEHzpYPpnFuJkNbvCnzY7Fpwh/+UXq1PWUL+/HpQ1GwKoGO07Gf/GTp cYXm+O+RKs1jKtJicM1BnWDnOKFKyZyKhREgvez1cDCgnmZGzzUOOXl4ikZPL4yCMvGhdr DQgwnmArvPnqEAu38d78dnmqKx8Bqhk= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-479-6dESzTRyPlCj7ViM1aZSmg-1; Sat, 08 Aug 2020 22:16:57 -0400 X-MC-Unique: 6dESzTRyPlCj7ViM1aZSmg-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 5AC751009600; Sun, 9 Aug 2020 02:16:54 +0000 (UTC) Received: from T590 (ovpn-12-63.pek2.redhat.com [10.72.12.63]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 19B9B10023A0; Sun, 9 Aug 2020 02:16:38 +0000 (UTC) Date: Sun, 9 Aug 2020 10:16:33 +0800 From: Ming Lei To: Kashyap Desai Cc: John Garry , axboe@kernel.dk, jejb@linux.ibm.com, martin.petersen@oracle.com, don.brace@microsemi.com, Sumit Saxena , bvanassche@acm.org, hare@suse.com, hch@lst.de, Shivasharan Srikanteshwara , linux-block@vger.kernel.org, linux-scsi@vger.kernel.org, esc.storagedev@microsemi.com, chenxiang66@hisilicon.com, "PDL,MEGARAIDLINUX" Subject: Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ Message-ID: <20200809021633.GA2134904@T590> References: <20200729153648.GA1698748@T590> <7f94eaf2318cc26ceb64bde88d59d5e2@mail.gmail.com> <20200804083625.GA1958244@T590> <20200805084031.GA1995289@T590> <5adffdf805179428bdd0dd6c293a4f7d@mail.gmail.com> <20200806133819.GA2046861@T590> <20200806152939.GA2062348@T590> <3f35b0f67c73c8c4996fdad80eb6d963@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3f35b0f67c73c8c4996fdad80eb6d963@mail.gmail.com> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Sun, Aug 09, 2020 at 12:35:21AM +0530, Kashyap Desai wrote: > > On Thu, Aug 06, 2020 at 08:07:38PM +0530, Kashyap Desai wrote: > > > > > Hi Ming - > > > > > > > > > > There is still some race which is not handled. Take a case of IO > > > > > is not able to get budget and it has already marked > flag. > > > > > flag will be seen non-zero in completion path and > > > > > completion path will attempt h/w queue run. (But this particular > > > > > IO is still not in s/w queue.). > > > > > Attempt of running h/w queue from completion path will not flush > > > > > any IO since there is no IO in s/w queue. > > > > > > > > Then where is the IO to be submitted in case of running out of > budget? > > > > > > Typical race in your latest patch is - (Lets consider command A,B and > > > C) Command A did not receive budget. Command B completed (which was > > > already > > > > Command A doesn't get budget, and A is still in sw/scheduler queue > because > > we try to acquire budget before dequeuing request from sw/scheduler > queue, > > see __blk_mq_do_dispatch_sched() and blk_mq_do_dispatch_ctx(). > > > > Not consider direct issue, because the hw queue will be run explicitly > when > > not getting budget, see __blk_mq_try_issue_directly. > > > > Not consider command A being added to hctx->dispatch too, because blk-mq > > will re-run the queue, see blk_mq_dispatch_rq_list(). > > Ming - > > After going through your comment (I noted your comment and thanks for > correcting my understanding.) and block layer code, I realize that it is a > different race condition. My previous explanation was not accurate. > I debug further and figure out what is actually happening - Consider below > scenario/sequence - > > Thread -1 - Detected budget contention. Set restarts = 1. > Thread -2 - old restarts = 1. start hw queue. > Thread -3 - old restarts = 1. start hw queue. > Thread -2 - move restarts = 0. > In my testing, I noticed that both thread-2 and thread-3 started h/w queue > but there was no work for them to do. It is possible because some other > context of h/w queue run might have done that job. It should be true, because there is other run queue somewhere, such as blk-mq's restart or delay run queue. > It means, IO of thread-1 is already posted. OK. > Thread -4 - Detected budget contention. Set restart = 1 (because thread-2 > has move restarts=0). OK. > Thread -3 - move restarts = 0 (because this thread see old value = 1 but > that is actually updated one more time by thread-4 and theread-4 actually > wanted to run h/w queues). IO of Thread-4 will not be scheduled. Right. > > We have to make sure that completion IO path do atomic_cmpxchng of > restarts flag before running the h/w queue. Below code change - (main fix > is sequence of atomic_cmpxchg and blk_mq_run_hw_queues) fix the issue. > > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -594,8 +594,27 @@ static bool scsi_end_request(struct request *req, > blk_status_t error, > if (scsi_target(sdev)->single_lun || > !list_empty(&sdev->host->starved_list)) > kblockd_schedule_work(&sdev->requeue_work); > - else > - blk_mq_run_hw_queues(q, true); > + else { > + /* > + * smp_mb() implied in either rq->end_io or > blk_mq_free_request > + * is for ordering writing .device_busy in > scsi_device_unbusy() > + * and reading sdev->restarts. > + */ > + int old = atomic_read(&sdev->restarts); > + > + if (old) { > + /* > + * ->restarts has to be kept as non-zero if there > is > + * new budget contention comes. > + */ > + atomic_cmpxchg(&sdev->restarts, old, 0); > + > + /* run the queue after restarts flag is updated > + * to avoid race condition with .get_budget > + */ > + blk_mq_run_hw_queues(sdev->request_queue, true); > + } > + } > I think the above change is right, and this patter is basically same with SCHED_RESTART used in blk_mq_sched_restart(). BTW, could you run your function & performance test against the following new version? Then I can include your test result in commit log for moving on. >From 06993ddf5c5dbe0e772cc38342919eb61a57bc50 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Wed, 5 Aug 2020 16:35:53 +0800 Subject: [PATCH] scsi: core: only re-run queue in scsi_end_request() if device queue is busy Now the request queue is run in scsi_end_request() unconditionally if both target queue and host queue is ready. We should have re-run request queue only after this device queue becomes busy for restarting this LUN only. Recently Long Li reported that cost of run queue may be very heavy in case of high queue depth. So improve this situation by only running the request queue when this LUN is busy. Cc: Jens Axboe Cc: Ewan D. Milne Cc: Kashyap Desai Cc: Hannes Reinecke Cc: Bart Van Assche Cc: Damien Le Moal Cc: Long Li Reported-by: Long Li Signed-off-by: Ming Lei --- V4: - not clear .restarts in get_budget(), instead clearing it after re-run queue is done; Kashyap figured out we have to update ->restarts before re-run queue in scsi_run_queue_async(). V3: - add one smp_mb() in scsi_mq_get_budget() and comment V2: - commit log change, no any code change - add reported-by tag Signed-off-by: Ming Lei --- drivers/scsi/scsi_lib.c | 51 +++++++++++++++++++++++++++++++++++--- include/scsi/scsi_device.h | 1 + 2 files changed, 49 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index c866a4f33871..d083250f9518 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -552,8 +552,27 @@ static void scsi_run_queue_async(struct scsi_device *sdev) if (scsi_target(sdev)->single_lun || !list_empty(&sdev->host->starved_list)) kblockd_schedule_work(&sdev->requeue_work); - else - blk_mq_run_hw_queues(sdev->request_queue, true); + else { + /* + * smp_mb() implied in either rq->end_io or blk_mq_free_request + * is for ordering writing .device_busy in scsi_device_unbusy() + * and reading sdev->restarts. + */ + int old = atomic_read(&sdev->restarts); + + if (old) { + /* + * ->restarts has to be kept as non-zero if there is + * new budget contention comes. + * + * No need to run queue when either another re-run + * queue wins in updating ->restarts or one new budget + * contention comes. + */ + if (atomic_cmpxchg(&sdev->restarts, old, 0) == old) + blk_mq_run_hw_queues(sdev->request_queue, true); + } + } } /* Returns false when no more bytes to process, true if there are more */ @@ -1612,8 +1631,34 @@ static void scsi_mq_put_budget(struct request_queue *q) static bool scsi_mq_get_budget(struct request_queue *q) { struct scsi_device *sdev = q->queuedata; + int ret = scsi_dev_queue_ready(q, sdev); + + if (ret) + return true; + + atomic_inc(&sdev->restarts); - return scsi_dev_queue_ready(q, sdev); + /* + * Order writing .restarts and reading .device_busy, and make sure + * .restarts is visible to scsi_end_request(). Its pair is implied by + * __blk_mq_end_request() in scsi_end_request() for ordering + * writing .device_busy in scsi_device_unbusy() and reading .restarts. + * + */ + smp_mb__after_atomic(); + + /* + * If all in-flight requests originated from this LUN are completed + * before setting .restarts, sdev->device_busy will be observed as + * zero, then blk_mq_delay_run_hw_queues() will dispatch this request + * soon. Otherwise, completion of one of these request will observe + * the .restarts flag, and the request queue will be run for handling + * this request, see scsi_end_request(). + */ + if (unlikely(atomic_read(&sdev->device_busy) == 0 && + !scsi_device_blocked(sdev))) + blk_mq_delay_run_hw_queues(sdev->request_queue, SCSI_QUEUE_DELAY); + return false; } static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index bc5909033d13..1a5c9a3df6d6 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -109,6 +109,7 @@ struct scsi_device { atomic_t device_busy; /* commands actually active on LLDD */ atomic_t device_blocked; /* Device returned QUEUE_FULL. */ + atomic_t restarts; spinlock_t list_lock; struct list_head starved_entry; unsigned short queue_depth; /* How deep of a queue we want */ -- 2.25.2 Thanks, Ming