Re: blk-mq: improvement CPU hotplug (simplified version) v3

From: Ming Lei <ming.lei@redhat.com>
To: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>,
	linux-block@vger.kernel.org, John Garry <john.garry@huawei.com>,
	Hannes Reinecke <hare@suse.com>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: blk-mq: improvement CPU hotplug (simplified version) v3
Date: Thu, 21 May 2020 10:57:44 +0800	[thread overview]
Message-ID: <20200521025744.GC735749@T590> (raw)
In-Reply-To: <0cbc37cf-5439-c68c-3581-b3c436932388@acm.org>

On Wed, May 20, 2020 at 02:46:52PM -0700, Bart Van Assche wrote:
> On 2020-05-20 10:06, Christoph Hellwig wrote:
> > this series ensures I/O is quiesced before a cpu and thus the managed
> > interrupt handler is shut down.
> > 
> > This patchset tries to address the issue by the following approach:
> > 
> >  - before the last cpu in hctx->cpumask is going to offline, mark this
> >    hctx as inactive
> > 
> >  - disable preempt during allocating tag for request, and after tag is
> >    allocated, check if this hctx is inactive. If yes, give up the
> >    allocation and try remote allocation from online CPUs
> > 
> >  - before hctx becomes inactive, drain all allocated requests on this
> >    hctx
> 
> What is not clear to me is which assumptions about the relationship
> between interrupts and hardware queues this patch series is based on.
> Does this patch series perhaps only support a 1:1 mapping between
> interrupts and hardware queues?

No, it supports any mapping, but the issue won't be triggered on 1:N
mapping, since this kind of hctx never becomes inactive.

> What if there are more hardware queues
> than interrupts? An example of a block driver that allocates multiple

It doesn't matter, see blew comment.

> hardware queues is the NVMeOF initiator driver. From the NVMeOF
> initiator driver function nvme_rdma_alloc_tagset() and for the code that
> refers to I/O queues:
> 
> 	set->nr_hw_queues = nctrl->queue_count - 1;
> 
> From nvme_rdma_alloc_io_queues():
> 
> 	nr_read_queues = min_t(unsigned int, ibdev->num_comp_vectors,
> 				min(opts->nr_io_queues,
> 				    num_online_cpus()));
> 	nr_default_queues =  min_t(unsigned int,
> 	 			ibdev->num_comp_vectors,
> 				min(opts->nr_write_queues,
> 					 num_online_cpus()));
> 	nr_poll_queues = min(opts->nr_poll_queues, num_online_cpus());
> 	nr_io_queues = nr_read_queues + nr_default_queues +
> 			 nr_poll_queues;
> 	[ ... ]
> 	ctrl->ctrl.queue_count = nr_io_queues + 1;
> 
> From nvmf_parse_options():
> 
> 	/* Set defaults */
> 	opts->nr_io_queues = num_online_cpus();
> 
> Can this e.g. result in 16 hardware queues being allocated for I/O even
> if the underlying RDMA adapter only supports four interrupt vectors?
> Does that mean that four hardware queues will be associated with each
> interrupt vector?

The patchset actually doesn't bind to interrupt vector, and that said we
don't care actuall interrupt allocation.

> If the CPU to which one of these interrupt vectors has
> been assigned is hotplugged, does that mean that four hardware queues
> have to be quiesced instead of only one as is done in patch 6/6?

No, one hctx only becomes inactive after each CPU in hctx->cpumask is offline.
No matter how interrupt vector is assigned to hctx, requests shouldn't
be dispatched to that hctx any more.

Thanks,
Ming