Re: blk-mq: improvement CPU hotplug (simplified version) v3

From: Ming Lei <ming.lei@redhat.com>
To: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>,
	linux-block@vger.kernel.org, John Garry <john.garry@huawei.com>,
	Hannes Reinecke <hare@suse.com>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: blk-mq: improvement CPU hotplug (simplified version) v3
Date: Fri, 22 May 2020 10:39:23 +0800	[thread overview]
Message-ID: <20200522023923.GC755458@T590> (raw)
In-Reply-To: <7accb5b2-6c7d-0e0d-56df-d06e8d9ac5af@acm.org>

On Thu, May 21, 2020 at 12:15:52PM -0700, Bart Van Assche wrote:
> On 2020-05-20 21:33, Ming Lei wrote:
> > No.
> > 
> > If vector 3 is for covering hw queue 12 ~ 15, the vector shouldn't be
> > shutdown when cpu 14 is offline.
> >> Also I am pretty sure that we don't do this way with managed IRQ. And
> > non-managed IRQ will be migrated to other online cpus during cpu offline,
> > so not an issue at all. See migrate_one_irq().
> 
> Thanks for the pointer to migrate_one_irq().
> 
> However, I'm not convinced the above statement is correct. My
> understanding is that the block driver knows which interrupt vector has
> been associated with which hardware queue but the blk-mq core not. It
> seems to me that patch 6/6 of this series is based on the following
> assumptions:
> (a) That the interrupt that is associated with a hardware queue is
>     processed by one of the CPU's in hctx->cpumask.
> (b) That hardware queues do not share interrupt vectors.
> 
> I don't think that either assumption is correct.

What the patch tries to do is just:

- when the last cpu of hctx->cpumask is going to become offline, mark
this hctx as inactive, then drain any inflight IO requests originated
from this hctx

The correctness is that once we stops to produce request, we can drain
any in-flight requests before shutdown the last cpu of hctx. Then finally
this hctx becomes quiesced completely. Do you think this way is wrong?
If yes, please prove it.

So correctness of the patch 6/6 does not depend on the two assumptions,
does it?

This way solves the request timeout or never completion issue in case
that managed interrupt affinity is same with the hw queue's cpumask. I believe
this way is the normal usage, and most of storage drivers use managed
interrupt in this way. And motivation of this patch is to fix this kind
of normal usage.

You may argue that two hw queue may share single managed interrupt, that
is possible if driver plays the trick. But if driver plays the trick in
this way, it is driver's responsibility to guarantee that the managed
irq won't be shutdown if either of the two hctxs are active, such as,
making sure that hctx->cpumask + hctx->cpumask <= this managed interrupt's affinity.
It is definitely one strange enough case, and this patch doesn't
suppose to cover this strange case. But, this patch won't break this
case. Also just be curious, do you have such in-tree case? and are you
sure the driver uses managed interrupt?

Again, no such problem in case of non-managed interrupt, because they
will be migrated to other online cpus. But this patchset is harmless for
non-managed interrupt, and still correct to quiesce hctx after all cpus
of hctx become offline from blk-mq queue mapping point, because no request
produced any more.

Thanks,
Ming