Re: [PATCH V2 5/5] blk-mq: Wait for for hctx inflight requests on CPU unplug

From: Ming Lei <tom.leiming@gmail.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: John Garry <john.garry@huawei.com>, Jens Axboe <axboe@kernel.dk>,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	linux-block <linux-block@vger.kernel.org>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	Linux SCSI List <linux-scsi@vger.kernel.org>,
	Bart Van Assche <bvanassche@acm.org>,
	Hannes Reinecke <hare@suse.com>,
	Keith Busch <keith.busch@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Don Brace <don.brace@microsemi.com>,
	Kashyap Desai <kashyap.desai@broadcom.com>,
	Sathya Prakash <sathya.prakash@broadcom.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH V2 5/5] blk-mq: Wait for for hctx inflight requests on CPU unplug
Date: Thu, 30 May 2019 12:11:17 +0800	[thread overview]
Message-ID: <CACVXFVN729SgFQGUgmu1iN7P6Mv5+puE78STz8hj9J5bS828Ng@mail.gmail.com> (raw)
In-Reply-To: <20190530022810.GA16730@ming.t460p>

On Thu, May 30, 2019 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Wed, May 29, 2019 at 05:10:38PM +0100, John Garry wrote:
> >
> > > >
> > > > And we should be careful to handle the multiple reply queue case, given the queue
> > > > shouldn't be stopped or quieseced because other reply queues are still active.
> > > >
> > > > The new CPUHP state for blk-mq should be invoked after the to-be-offline
> > > > CPU is quiesced and before it becomes offline.
> > >
> > > Hi John,
> > >
> >
> > Hi Ming,
> >
> > > Thinking of this issue further, so far, one doable solution is to
> > > expose reply queues
> > > as blk-mq hw queues, as done by the following patchset:
> > >
> > > https://lore.kernel.org/linux-block/20180205152035.15016-1-ming.lei@redhat.com/
> >
> > I thought that this patchset had fundamental issues, in terms of working for
> > all types of hosts. FYI, I did the backport of latest hisi_sas_v3 to v4.15
>
> Could you explain it a bit about the fundamental issues for all types of
> host?
>
> It is just for hosts with multiple reply queues, such as hisi_sas v3,
> megaraid_sas, mpt3sas and hpsa.
>
> > with this patchset (as you may have noticed in my git send mistake), but we
> > have not got to test it yet.
> >
> > On a related topic, we did test exposing reply queues as blk-mq hw queues
> > and generating the host-wide tag internally in the LLDD with sbitmap, and
> > unfortunately we were experiencing a significant performance hit, like 2300K
> > -> 1800K IOPs for 4K read.
> >
> > We need to test this further. I don't understand why we get such a big hit.
>
> The performance regression shouldn't have been introduced in theory, and it is
> because blk_mq_queue_tag_busy_iter() iterates over the same duplicated tags multiple
> times, which can be fixed easily.
>
> >
> > >
> > > In which global host-wide tags are shared for all blk-mq hw queues.
> > >
> > > Also we can remove all the reply_map stuff in drivers, then solve the problem of
> > > draining in-flight requests during unplugging CPU in a generic approach.
> >
> > So you're saying that removing this reply queue stuff can make the solution
> > to the problem more generic, but do you have an idea of the overall
> > solution?
>
> 1) convert reply queue into blk-mq hw queue first
>
> 2) then all drivers are in same position wrt. handling requests vs.
> unplugging CPU (shutdown managed IRQ)
>
> The current handling in blk_mq_hctx_notify_dead() is actually wrong,
> at that time, all CPUs on the hctx are dead, blk_mq_run_hw_queue()
> still dispatches requests on driver's hw queue, and driver is invisible
> to DEAD CPUs mapped to this hctx, and finally interrupt for these
> requests on the hctx are lost.
>
> Frankly speaking, the above 2nd problem is still hard to solve.
>
> 1) take_cpu_down() shutdown managed IRQ first, then run teardown callback
> for states in [CPUHP_AP_ONLINE, CPUHP_AP_OFFLINE) on the to-be-offline
> CPU
>
> 2) However, all runnable tasks are removed from the CPU in the teardown
> callback for CPUHP_AP_SCHED_STARTING, which is run after managed IRQs
> are shutdown. That said it is hard to avoid new request queued to
> the hctx with all DEAD CPUs.
>
> 3) we don't support to freeze queue for specific hctx yet, or that way
> may not be accepted because of extra cost in fast path
>
> 4) once request is allocated, it should be submitted to driver no matter
> if CPU hotplug happens or not. Or free it and re-allocate new request
> on proper sw/hw queue?

That looks doable, we may steal bios from the old in-queue request, then
re-submit them via generic_make_request(), and finally free the old request,
but RQF_DONTPREP has to be addressed via one new callback.

So follows the overall solution for waiting request vs. CPU hotplug,
which is done
in two stages:

1) in the teardown callback of new  CPUHP state of CPUHP_BLK_MQ_PREP,
which is run before CPUHP_AP_ONLINE_IDLE,  at that time the CPU & managed
IRQ is still alive:

- stopped the hctx
- wait in-flight requests from this hctx until all are completed

2) in the teardown callback of CPUHP_BLK_MQ_DEAD, which is run
after the CPU is dead

- dequeue request queued in sw queue or scheduler queue from this hctx
- steal bios from the dequeued request, and re-submit them via
generic_make_request()
- free the dequeued request, and need to free driver resource via new
callback for
RQF_DONTPREP, looks only SCSI needs it.
- restart this hctx

Thanks,
Ming Lei