linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Paolo Valente <paolo.valente@linaro.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: Doug Anderson <dianders@chromium.org>,
	Ming Lei <ming.lei@redhat.com>,
	"James E.J. Bottomley" <jejb@linux.ibm.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	linux-block <linux-block@vger.kernel.org>,
	Guenter Roeck <groeck@chromium.org>,
	linux-scsi@vger.kernel.org, Salman Qazi <sqazi@google.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] scsi: core: Fix stall if two threads request budget at the same time
Date: Wed, 1 Apr 2020 09:49:40 +0200	[thread overview]
Message-ID: <1D4B63FC-FA4B-4C73-B70F-6639CC41D3A6@linaro.org> (raw)
In-Reply-To: <02968c1d-bd3a-af9c-77e7-23a9d9aa9af4@kernel.dk>



> Il giorno 1 apr 2020, alle ore 03:21, Jens Axboe <axboe@kernel.dk> ha scritto:
> 
> On 3/31/20 5:51 PM, Doug Anderson wrote:
>> Hi,
>> 
>> On Tue, Mar 31, 2020 at 11:26 AM Jens Axboe <axboe@kernel.dk> wrote:
>>> 
>>> On 3/31/20 12:07 PM, Paolo Valente wrote:
>>>>> Il giorno 31 mar 2020, alle ore 03:41, Ming Lei <ming.lei@redhat.com> ha scritto:
>>>>> 
>>>>> On Mon, Mar 30, 2020 at 07:49:06AM -0700, Douglas Anderson wrote:
>>>>>> It is possible for two threads to be running
>>>>>> blk_mq_do_dispatch_sched() at the same time with the same "hctx".
>>>>>> This is because there can be more than one caller to
>>>>>> __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't
>>>>>> prevent more than one thread from entering.
>>>>>> 
>>>>>> If more than one thread is running blk_mq_do_dispatch_sched() at the
>>>>>> same time with the same "hctx", they may have contention acquiring
>>>>>> budget.  The blk_mq_get_dispatch_budget() can eventually translate
>>>>>> into scsi_mq_get_budget().  If the device's "queue_depth" is 1 (not
>>>>>> uncommon) then only one of the two threads will be the one to
>>>>>> increment "device_busy" to 1 and get the budget.
>>>>>> 
>>>>>> The losing thread will break out of blk_mq_do_dispatch_sched() and
>>>>>> will stop dispatching requests.  The assumption is that when more
>>>>>> budget is available later (when existing transactions finish) the
>>>>>> queue will be kicked again, perhaps in scsi_end_request().
>>>>>> 
>>>>>> The winning thread now has budget and can go on to call
>>>>>> dispatch_request().  If dispatch_request() returns NULL here then we
>>>>>> have a potential problem.  Specifically we'll now call
>>>>> 
>>>>> I guess this problem should be BFQ specific. Now there is definitely
>>>>> requests in BFQ queue wrt. this hctx. However, looks this request is
>>>>> only available from another loser thread, and it won't be retrieved in
>>>>> the winning thread via e->type->ops.dispatch_request().
>>>>> 
>>>>> Just wondering why BFQ is implemented in this way?
>>>>> 
>>>> 
>>>> BFQ inherited this powerful non-working scheme from CFQ, some age ago.
>>>> 
>>>> In more detail: if BFQ has at least one non-empty internal queue, then
>>>> is says of course that there is work to do.  But if the currently
>>>> in-service queue is empty, and is expected to receive new I/O, then
>>>> BFQ plugs I/O dispatch to enforce service guarantees for the
>>>> in-service queue, i.e., BFQ responds NULL to a dispatch request.
>>> 
>>> What BFQ is doing is fine, IFF it always ensures that the queue is run
>>> at some later time, if it returns "yep I have work" yet returns NULL
>>> when attempting to retrieve that work. Generally this should happen from
>>> subsequent IO completion, or whatever else condition will resolve the
>>> issue that is currently preventing dispatch of that request. Last resort
>>> would be a timer, but that can happen if you're slicing your scheduling
>>> somehow.
>> 
>> I've been poking more at this today trying to understand why the idle
>> timer that Paolo says is in BFQ isn't doing what it should be doing.
>> I've been continuing to put most of my stream-of-consciousness at
>> <https://crbug.com/1061950> to avoid so much spamming of this thread.
>> In the trace I looked at most recently it looks like BFQ does try to
>> ensure that the queue is run at a later time, but at least in this
>> trace the later time is not late enough.  Specifically the quick
>> summary of my recent trace:
>> 
>> 28977309us - PID 2167 got the budget.
>> 28977518us - BFQ told PID 2167 that there was nothing to dispatch.
>> 28977702us - BFQ idle timer fires.
>> 28977725us - We start to try to dispatch as a result of BFQ's idle timer.
>> 28977732us - Dispatching that was a result of BFQ's idle timer can't get
>>             budget and thus does nothing.
>> 28977780us - PID 2167 put the budget and exits since there was nothing
>>             to dispatch.
>> 
>> This is only one particular trace, but in this case BFQ did attempt to
>> rerun the queue after it returned NULL, but that ran almost
>> immediately after it returned NULL and thus ran into the race.  :(
> 
> OK, and then it doesn't trigger again? It's key that it keeps doing this
> timeout and re-dispatch if it fails, not just once.
> 

The goal of BFQ's timer is to make BFQ switch from non-work-conserving
to work-conserving mode, just because not doing so would cause a
stall.  In contrast, it sounds a little weird that an I/O scheduler
systematically kicks I/O periodically (how can BFQ know when no more
kicking is needed?).  IOW, it doesn't seem very robust that blk-mq may
need a series of periodic kicks to finally restart, like a flooded
engine.

Compared with this solution, I'd still prefer one where BFQ doesn't
trigger this blk-mq stall at all.

Paolo

> But BFQ really should be smarter here. It's the same caller etc that
> asks whether it has work and whether it can dispatch, yet the answer is
> different. That's just kind of silly, and it'd make more sense if BFQ
> actually implemented the ->has_work() as a "would I actually dispatch
> for this guy, now".
> 
>>>> It would be very easy to change bfq_has_work so that it returns false
>>>> in case the in-service queue is empty, even if there is I/O
>>>> backlogged.  My only concern is: since everything has worked with the
>>>> current scheme for probably 15 years, are we sure that everything is
>>>> still ok after we change this scheme?
>>> 
>>> You're comparing apples to oranges, CFQ never worked within the blk-mq
>>> scheduling framework.
>>> 
>>> That said, I don't think such a change is needed. If we currently have a
>>> hang due to this discrepancy between has_work and gets_work, then it
>>> sounds like we're not always re-running the queue as we should. From the
>>> original patch, the budget putting is not something the scheduler is
>>> involved with. Do we just need to ensure that if we put budget without
>>> having dispatched a request, we need to kick off dispatching again?
>> 
>> By this you mean a change like this in blk_mq_do_dispatch_sched()?
>> 
>>  if (!rq) {
>>    blk_mq_put_dispatch_budget(hctx);
>> +    ret = true;
>>    break;
>>  }
>> 
>> I'm pretty sure that would fix the problems and I'd be happy to test,
>> but it feels like a heavy hammer.  IIUC we're essentially going to go
>> into a polling loop and keep calling has_work() and dispatch_request()
>> over and over again until has_work() returns false or we manage to
>> dispatch something...
> 
> We obviously have to be careful not to introduce a busy-loop, where we
> just keep scheduling dispatch, only to fail.
> 
> -- 
> Jens Axboe


  reply	other threads:[~2020-04-01  7:48 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-30 14:49 [PATCH 0/2] blk-mq: Fix two causes of IO stalls found in reboot testing Douglas Anderson
2020-03-30 14:49 ` [PATCH 1/2] blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick Douglas Anderson
2020-03-30 14:49 ` [PATCH 2/2] scsi: core: Fix stall if two threads request budget at the same time Douglas Anderson
2020-03-31  1:41   ` Ming Lei
2020-03-31  2:15     ` Doug Anderson
2020-03-31  2:58       ` Ming Lei
2020-03-31  4:05         ` Doug Anderson
2020-03-31 18:07     ` Paolo Valente
2020-03-31 18:26       ` Jens Axboe
2020-03-31 23:51         ` Doug Anderson
2020-04-01  1:21           ` Jens Axboe
2020-04-01  7:49             ` Paolo Valente [this message]
2020-04-02 15:52               ` Doug Anderson
2020-04-01  2:04           ` Ming Lei
2020-04-01  2:32             ` Doug Anderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1D4B63FC-FA4B-4C73-B70F-6639CC41D3A6@linaro.org \
    --to=paolo.valente@linaro.org \
    --cc=axboe@kernel.dk \
    --cc=dianders@chromium.org \
    --cc=groeck@chromium.org \
    --cc=jejb@linux.ibm.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ming.lei@redhat.com \
    --cc=sqazi@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).