Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request

From: Sagi Grimberg <sagi@grimberg.me>
To: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	James Smart <james.smart@broadcom.com>,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
Date: Mon, 18 Nov 2019 16:05:56 -0800	[thread overview]
Message-ID: <016afdbc-9c63-4193-e64b-aad91ba5fcc1@grimberg.me> (raw)
In-Reply-To: <20191116071754.GB18194@ming.t460p>

> Hi Sagi,
> 
> On Fri, Nov 15, 2019 at 02:38:44PM -0800, Sagi Grimberg wrote:
>>
>>> Hi,
>>
>> Hey Ming,
>>
>>> Use blk_mq_alloc_request() for allocating NVMe loop, fc, rdma and tcp's
>>> connect request, and selecting transport queue runtime for connect
>>> request.
>>>
>>> Then kill blk_mq_alloc_request_hctx().
>>
>> Is it really so wrong to have an API to allocate a tag that belongs to
>> a specific queue? Why must the tags allocation always correlate to the
>> running cpu? Its true that NVMe is the only consumer of this at the
>> moment, but does this mean that the interface should be removed because
>> it has one (or rather four) consumer(s)?
> 
> Now blk-mq takes a static queue mapping between CPU and hw queues, given
> CPU hotplug may happen any time, so the specified hw queue may become
> inactive any time.
> 
> Queue mapping from CPU to hw queue is one core model of blk-mq which
> relies a lot on the fact that hw queue active or not depends on
> request's submission CPU. And we always can retrieve one active hw
> queue if there is at least one online CPU.
> 
> IO request is always mapped to the proper hw queue via the submission
> CPU and queue type.
> 
> So blk_mq_alloc_request_hctx() is really weird from the above blk-mq's
> model.
> 
> Actually the 4 consumer is just one single type of usage for submitting
> connect command, seems no one explain this special usage before. And the
> driver can handle well enough without this interface just like this
> patch, can't it?

Does removing the cpumask_and with cpu_online_mask fix your test?

this check is wrong to begin with because it can not be true right after
the check.

This is a much simpler fix that does not create this churn local to
every driver. Also, I don't like the assumptions about tag reservations
that the drivers is taking locally (that the connect will have tag 0
for example). All this makes this look like a hack.

There is also the question of what happens when we want to make connects
parallel, which is not the case at the moment...

>> I would instead suggest to simply remove the constraint that
>> blk_mq_alloc_request_hctx() will fail if the first cpu in the mask
>> is not on the cpu_online_mask.. The caller of this would know and
>> be able to handle it.
> 
> Of course, this usage is very special, which is different with normal
> IO or passthrough request.
> 
> The caller of this still needs to rely on blk-mq for dispatching this
> request:
> 
> 1) blk-mq needs to run hw queue in round-robin style, and different
> CPU is selected from active CPU masks for running the hw queue.
> 
> 2) Most of blk-mq drivers have switched to managed IRQ, which may be
> shutdown even though there is still in-flight requests not completed
> on the hw queue. We need to fix this issue. One solution is to free
> the request and remap the bios into proper active hw queue according to
> the new submission CPU.
> 
> 3) warning will be caused when dispatching one request on inactive hw queue
> 
> If we are going to support this special usage, lots of blk-mq needs to
> be changed for covering the so special case.

I'm starting to think we maybe need to get the connect out of the block
layer execution if its such a big problem... Its a real shame if that is
the case...

>> To me it feels like this approach is fundamentally wrong. IMO, having
>> the driver select a different queue than the tag naturally belongs to
>> feels like a backwards design.
> 
> The root cause is that the connection command needs to be submitted via
> one specified queue, which is fundamentally not compatible with blk-mq's
> queue mapping model.
> 
> Given the connect command is so special for nvme, I'd suggest to let driver
> handle it completely, since blk-mq is supposed to serve generic IO function.

Its one thing to handling it locally, and to hack queue_rq to queue on a
different queue than what was determined by the block layer... If this
model is fundamentally broken with how the block layer dispatch
requests, then we need a different solution.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme