All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <tom.leiming@gmail.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: John Garry <john.garry@huawei.com>, Jens Axboe <axboe@kernel.dk>,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	linux-block <linux-block@vger.kernel.org>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	Linux SCSI List <linux-scsi@vger.kernel.org>,
	Bart Van Assche <bvanassche@acm.org>,
	Hannes Reinecke <hare@suse.com>,
	Keith Busch <keith.busch@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Don Brace <don.brace@microsemi.com>,
	Kashyap Desai <kashyap.desai@broadcom.com>,
	Sathya Prakash <sathya.prakash@broadcom.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH V2 5/5] blk-mq: Wait for for hctx inflight requests on CPU unplug
Date: Thu, 30 May 2019 12:11:17 +0800	[thread overview]
Message-ID: <CACVXFVN729SgFQGUgmu1iN7P6Mv5+puE78STz8hj9J5bS828Ng@mail.gmail.com> (raw)
In-Reply-To: <20190530022810.GA16730@ming.t460p>

On Thu, May 30, 2019 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Wed, May 29, 2019 at 05:10:38PM +0100, John Garry wrote:
> >
> > > >
> > > > And we should be careful to handle the multiple reply queue case, given the queue
> > > > shouldn't be stopped or quieseced because other reply queues are still active.
> > > >
> > > > The new CPUHP state for blk-mq should be invoked after the to-be-offline
> > > > CPU is quiesced and before it becomes offline.
> > >
> > > Hi John,
> > >
> >
> > Hi Ming,
> >
> > > Thinking of this issue further, so far, one doable solution is to
> > > expose reply queues
> > > as blk-mq hw queues, as done by the following patchset:
> > >
> > > https://lore.kernel.org/linux-block/20180205152035.15016-1-ming.lei@redhat.com/
> >
> > I thought that this patchset had fundamental issues, in terms of working for
> > all types of hosts. FYI, I did the backport of latest hisi_sas_v3 to v4.15
>
> Could you explain it a bit about the fundamental issues for all types of
> host?
>
> It is just for hosts with multiple reply queues, such as hisi_sas v3,
> megaraid_sas, mpt3sas and hpsa.
>
> > with this patchset (as you may have noticed in my git send mistake), but we
> > have not got to test it yet.
> >
> > On a related topic, we did test exposing reply queues as blk-mq hw queues
> > and generating the host-wide tag internally in the LLDD with sbitmap, and
> > unfortunately we were experiencing a significant performance hit, like 2300K
> > -> 1800K IOPs for 4K read.
> >
> > We need to test this further. I don't understand why we get such a big hit.
>
> The performance regression shouldn't have been introduced in theory, and it is
> because blk_mq_queue_tag_busy_iter() iterates over the same duplicated tags multiple
> times, which can be fixed easily.
>
> >
> > >
> > > In which global host-wide tags are shared for all blk-mq hw queues.
> > >
> > > Also we can remove all the reply_map stuff in drivers, then solve the problem of
> > > draining in-flight requests during unplugging CPU in a generic approach.
> >
> > So you're saying that removing this reply queue stuff can make the solution
> > to the problem more generic, but do you have an idea of the overall
> > solution?
>
> 1) convert reply queue into blk-mq hw queue first
>
> 2) then all drivers are in same position wrt. handling requests vs.
> unplugging CPU (shutdown managed IRQ)
>
> The current handling in blk_mq_hctx_notify_dead() is actually wrong,
> at that time, all CPUs on the hctx are dead, blk_mq_run_hw_queue()
> still dispatches requests on driver's hw queue, and driver is invisible
> to DEAD CPUs mapped to this hctx, and finally interrupt for these
> requests on the hctx are lost.
>
> Frankly speaking, the above 2nd problem is still hard to solve.
>
> 1) take_cpu_down() shutdown managed IRQ first, then run teardown callback
> for states in [CPUHP_AP_ONLINE, CPUHP_AP_OFFLINE) on the to-be-offline
> CPU
>
> 2) However, all runnable tasks are removed from the CPU in the teardown
> callback for CPUHP_AP_SCHED_STARTING, which is run after managed IRQs
> are shutdown. That said it is hard to avoid new request queued to
> the hctx with all DEAD CPUs.
>
> 3) we don't support to freeze queue for specific hctx yet, or that way
> may not be accepted because of extra cost in fast path
>
> 4) once request is allocated, it should be submitted to driver no matter
> if CPU hotplug happens or not. Or free it and re-allocate new request
> on proper sw/hw queue?

That looks doable, we may steal bios from the old in-queue request, then
re-submit them via generic_make_request(), and finally free the old request,
but RQF_DONTPREP has to be addressed via one new callback.

So follows the overall solution for waiting request vs. CPU hotplug,
which is done
in two stages:

1) in the teardown callback of new  CPUHP state of CPUHP_BLK_MQ_PREP,
which is run before CPUHP_AP_ONLINE_IDLE,  at that time the CPU & managed
IRQ is still alive:

- stopped the hctx
- wait in-flight requests from this hctx until all are completed

2) in the teardown callback of CPUHP_BLK_MQ_DEAD, which is run
after the CPU is dead

- dequeue request queued in sw queue or scheduler queue from this hctx
- steal bios from the dequeued request, and re-submit them via
generic_make_request()
- free the dequeued request, and need to free driver resource via new
callback for
RQF_DONTPREP, looks only SCSI needs it.
- restart this hctx


Thanks,
Ming Lei

  reply	other threads:[~2019-05-30  4:11 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-27 15:02 [PATCH V2 0/5] blk-mq: Wait for for hctx inflight requests on CPU unplug Ming Lei
2019-05-27 15:02 ` [PATCH V2 1/5] scsi: select reply queue from request's CPU Ming Lei
2019-05-28  5:43   ` Hannes Reinecke
2019-05-28 10:33   ` John Garry
2019-05-29  2:36     ` Ming Lei
2019-05-27 15:02 ` [PATCH V2 2/5] blk-mq: introduce .complete_queue_affinity Ming Lei
2019-05-27 15:02 ` [PATCH V2 3/5] scsi: core: implement callback of .complete_queue_affinity Ming Lei
2019-05-27 15:02 ` [PATCH V2 4/5] scsi: implement .complete_queue_affinity Ming Lei
2019-05-27 15:02 ` [PATCH V2 5/5] blk-mq: Wait for for hctx inflight requests on CPU unplug Ming Lei
2019-05-28 16:50   ` John Garry
2019-05-29  2:28     ` Ming Lei
2019-05-29  2:42       ` Ming Lei
2019-05-29  9:42         ` John Garry
2019-05-29 10:10           ` Ming Lei
2019-05-29 15:33             ` Ming Lei
2019-05-29 16:10               ` John Garry
2019-05-30  2:28                 ` Ming Lei
2019-05-30  4:11                   ` Ming Lei [this message]
2019-05-30  9:31                   ` John Garry
2019-05-30  9:45                     ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACVXFVN729SgFQGUgmu1iN7P6Mv5+puE78STz8hj9J5bS828Ng@mail.gmail.com \
    --to=tom.leiming@gmail.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=don.brace@microsemi.com \
    --cc=hare@suse.com \
    --cc=hch@lst.de \
    --cc=john.garry@huawei.com \
    --cc=kashyap.desai@broadcom.com \
    --cc=keith.busch@intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ming.lei@redhat.com \
    --cc=sathya.prakash@broadcom.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.