Re: [PATCH 2/2] scsi: core: Fix stall if two threads request budget at the same time

From: Doug Anderson <dianders@chromium.org>
To: Paolo Valente <paolo.valente@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>, Ming Lei <ming.lei@redhat.com>,
	"James E.J. Bottomley" <jejb@linux.ibm.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	linux-block <linux-block@vger.kernel.org>,
	Guenter Roeck <groeck@chromium.org>,
	linux-scsi@vger.kernel.org, Salman Qazi <sqazi@google.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] scsi: core: Fix stall if two threads request budget at the same time
Date: Thu, 2 Apr 2020 08:52:59 -0700	[thread overview]
Message-ID: <CAD=FV=XX+S==bz1OQFYyQxNmsPdMQoJ7AJSs++D9dP_r0Rc93A@mail.gmail.com> (raw)
In-Reply-To: <1D4B63FC-FA4B-4C73-B70F-6639CC41D3A6@linaro.org>

Hi,

On Wed, Apr 1, 2020 at 12:48 AM Paolo Valente <paolo.valente@linaro.org> wrote:
>
> >> 28977309us - PID 2167 got the budget.
> >> 28977518us - BFQ told PID 2167 that there was nothing to dispatch.
> >> 28977702us - BFQ idle timer fires.
> >> 28977725us - We start to try to dispatch as a result of BFQ's idle timer.
> >> 28977732us - Dispatching that was a result of BFQ's idle timer can't get
> >>             budget and thus does nothing.
> >> 28977780us - PID 2167 put the budget and exits since there was nothing
> >>             to dispatch.
> >>
> >> This is only one particular trace, but in this case BFQ did attempt to
> >> rerun the queue after it returned NULL, but that ran almost
> >> immediately after it returned NULL and thus ran into the race.  :(
> >
> > OK, and then it doesn't trigger again? It's key that it keeps doing this
> > timeout and re-dispatch if it fails, not just once.
> >
>
> The goal of BFQ's timer is to make BFQ switch from non-work-conserving
> to work-conserving mode, just because not doing so would cause a
> stall.  In contrast, it sounds a little weird that an I/O scheduler
> systematically kicks I/O periodically (how can BFQ know when no more
> kicking is needed?).  IOW, it doesn't seem very robust that blk-mq may
> need a series of periodic kicks to finally restart, like a flooded
> engine.
>
> Compared with this solution, I'd still prefer one where BFQ doesn't
> trigger this blk-mq stall at all.

I spent more time thinking about this / testing things.  Probably the
best summary of my thoughts can be found at
<https://crbug.com/1061950#c79>.  The quick summary is that I believe
the problem is that BFQ has faith that when it calls
blk_mq_run_hw_queues() that it will eventually cause BFQ to be called
back at least once to dispatch.  That doesn't always happen due to the
race we're trying to solve here.  If we fix the race / make
blk_mq_run_hw_queues() reliable then I don't think there's a need for
BFQ to implement some type of timeout/retry mechanism.

> > But BFQ really should be smarter here. It's the same caller etc that
> > asks whether it has work and whether it can dispatch, yet the answer is
> > different. That's just kind of silly, and it'd make more sense if BFQ
> > actually implemented the ->has_work() as a "would I actually dispatch
> > for this guy, now".

I prototyped this and I think it would solve the problem (though I
haven't had time to do extensive testing yet).  It certainly makes
BFQ's has_work() more expensive in some cases, but it might be worth
it.  Someone setup to do benchmarking would need to say for sure.

However, I think I've figured out an inexpensive / lightweight
solution that means we can let has_work() be inexact.  It's mostly the
same as this patch but implemented at the blk-mq layer (not the SCSI
layer) and doesn't add a spinlock.  I'll post a v2 and you can see if
you hate it or if it looks OK.  You can find it at:

https://lore.kernel.org/r/20200402085050.v2.2.I28278ef8ea27afc0ec7e597752a6d4e58c16176f@changeid

-Doug