Re: [PATCH] blk-mq: Wait for for hctx inflight requests on CPU unplug

From: Ming Lei <ming.lei@redhat.com>
To: John Garry <john.garry@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	Bart Van Assche <bvanassche@acm.org>,
	Hannes Reinecke <hare@suse.com>,
	Keith Busch <keith.busch@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	Kashyap Desai <kashyap.desai@broadcom.com>,
	chenxiang <chenxiang66@hisilicon.com>
Subject: Re: [PATCH] blk-mq: Wait for for hctx inflight requests on CPU unplug
Date: Wed, 22 May 2019 18:01:41 +0800	[thread overview]
Message-ID: <20190522100139.GA16877@ming.t460p> (raw)
In-Reply-To: <ce014369-4bf2-55fe-3c0f-3a46d3a016dc@huawei.com>

On Wed, May 22, 2019 at 10:06:16AM +0100, John Garry wrote:
> > > > 
> > > > +static int blk_mq_hctx_notify_prepare(unsigned int cpu, struct hlist_node *node)
> > > > +{
> > > > +	struct blk_mq_hw_ctx	*hctx;
> > > > +	struct blk_mq_tags	*tags;
> > > > +
> > > > +	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
> > > > +	tags = hctx->tags;
> > > > +
> > > > +	if (tags)
> > > > +		clear_bit(BLK_MQ_TAGS_DRAINED, &tags->flags);
> > > > +
> > > 
> > > Hi Ming,
> > > 
> > > Thanks for the effort here.
> > > 
> > > I would like to make an assertion on a related topic, which I hope you can
> > > comment on:
> > > 
> > > For this drain mechanism to work, the blk_mq_hw_ctx’s (and related cpu
> > > masks) for a request queue are required to match the hw queues used in the
> > > LLDD (if using managed interrupts).
> > > 
> > > In others words, a SCSI LLDD needs to expose all hw queues for this to work.
> > > 
> > > The reason I say this is because if the LLDD does not expose the hw queues
> > > and manages them internally - as some SCSI LLDDs do - yet uses managed
> > > interrupts to spread the hw queue MSI vectors across all CPUs, then we still
> > > only have a single blk_mq_hw_ctx per rq with a cpumask covering all cpus,
> > > which is not what we would want.
> > 
> 
> Hi Ming,
> 
> > Good catch!
> > 
> > This drain mechanism won't address the issue for these SCSI LLDDs in which:
> > 
> > 	1) blk_mq_hw_ctx serves as submission hw queue
> > 
> > 	2) one private reply queue serves as completion queue, for which one
> > 	MSI vector with cpumask is setup via pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY).
> > 
> > What we should only drain is the completion queue if all its mapped
> > CPUs are offline.
> > 
> > Looks you suggest to expose all completion(reply) queues as 'struct blk_mq_hw_ctx',
> > which may involve in another more hard problem:  how to split the single
> > hostwide tags into each reply queue.
> 
> Yes, and this is what I expecting to hear Re. hostwide tags.
> 
> I'd rather not work towards that
> > direction because:
> > 
> > 1) it is very hard to partition global resources into several parts,
> > especially it is hard to make every part happy.
> > 
> > 2) sbitmap is smart/efficient enough for this global allocation
> > 
> > 3) no obvious improvement is obtained from the resource partition, according
> > to previous experiment result done by Kashyap.
> 
> I'd like to also do the test.
> 
> However I would need to forward port the patchset, which no longer cleanly
> applies (I was referring to this https://lore.kernel.org/linux-block/20180205152035.15016-1-ming.lei@redhat.com/).
> Any help with that would be appreciated.

The queue type change causes patches not applied any more.

Could you just test the patch against v4.15 and see if there is any
improvement?

Even if it may improve performance on hisi_sas, I still suggest to not
use that approach to solve the issue for draining in-flight requests when
all CPUs of one hw queue becomes offline, since this way might hurt
performance on other drivers.

> 
> > 
> > I think we could implement the drain mechanism in the following way:
> > 
> > 1) if 'struct blk_mq_hw_ctx' serves as completion queue, use the
> > approach in the patch
> 
> Maybe the gain of exposing multiple queues+managed interrupts outweighs the
> loss in the LLDD of having to generate this unique tag with sbitmap; I know

The unique tag has zero cost, see blk_mq_unique_tag().

> that we did not use sbitmap ever in the LLDD for generating the tag when
> testing previously. However I'm still not too hopeful.
> 
> > 
> > 2) otherwise:
> > - introduce one callbcack of .prep_queue_dead(hctx, down_cpu) to
> > 'struct blk_mq_ops'
> 
> This would not be allowed to block, right?

It is allowed to block in CPU hotplug handler.

> 
> > 
> > - call .prep_queue_dead from blk_mq_hctx_notify_dead()
> > 
> > 3) inside .prep_queue_dead():
> > - the driver checks if all mapped CPU on the completion queue is offline
> > - if yes, wait for in-flight requests originated from all CPUs mapped to
> > this completion queue, and it can be implemented as one block layer API
> 
> That could work. However I think that someone may ask why the LLDD just
> doesn't register for the CPU hotplug event itself (which I would really
> rather avoid), instead of being relayed the info from the block layer.

.prep_queue_dead() is run from blk-mq's CPU hotplug handler.

I also think of abstracting completion queue in blk-mq for hpsa,
hisi_sas_v3_hw and  mpt3sas, but that can't cover to drain device's internal
command, so looks it is inevitable for us to introduce driver callback.

Thanks,
Ming