Re: [PATCH v5 1/2] blk-mq: add tagset quiesce interface

From: Sagi Grimberg <sagi@grimberg.me>
To: Jens Axboe <axboe@kernel.dk>, Ming Lei <ming.lei@redhat.com>
Cc: linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>,
	Keith Busch <kbusch@kernel.org>,
	linux-block@vger.kernel.org, Ming Lin <mlin@kernel.org>,
	Chao Leng <lengchao@huawei.com>
Subject: Re: [PATCH v5 1/2] blk-mq: add tagset quiesce interface
Date: Mon, 27 Jul 2020 20:29:43 -0700	[thread overview]
Message-ID: <0af89fcf-3505-acb1-6c91-1fff8e53b146@grimberg.me> (raw)
In-Reply-To: <baede23a-94c1-1494-bcca-964e1396f253@kernel.dk>

>>>>>>> +static void blk_mq_quiesce_blocking_queue_async(struct request_queue *q)
>>>>>>> +{
>>>>>>> +	struct blk_mq_hw_ctx *hctx;
>>>>>>> +	unsigned int i;
>>>>>>> +
>>>>>>> +	blk_mq_quiesce_queue_nowait(q);
>>>>>>> +
>>>>>>> +	queue_for_each_hw_ctx(q, hctx, i) {
>>>>>>> +		WARN_ON_ONCE(!(hctx->flags & BLK_MQ_F_BLOCKING));
>>>>>>> +		hctx->rcu_sync = kmalloc(sizeof(*hctx->rcu_sync), GFP_KERNEL);
>>>>>>> +		if (!hctx->rcu_sync)
>>>>>>> +			continue;
>>>>>>
>>>>>> This approach of quiesce/unquiesce tagset is good abstraction.
>>>>>>
>>>>>> Just one more thing, please allocate a rcu_sync array because hctx is
>>>>>> supposed to not store scratch stuff.
>>>>>
>>>>> I'd be all for not stuffing this in the hctx, but how would that work?
>>>>> The only thing I can think of that would work reliably is batching the
>>>>> queue+wait into units of N. We could potentially have many thousands of
>>>>> queues, and it could get iffy (and/or unreliable) in terms of allocation
>>>>> size. Looks like rcu_synchronize is 48-bytes on my local install, and it
>>>>> doesn't take a lot of devices at current CPU counts to make an alloc
>>>>> covering all of it huge. Let's say 64 threads, and 32 devices, then
>>>>> we're already at 64*32*48 bytes which is an order 5 allocation. Not
>>>>> friendly, and not going to be reliable when you need it. And if we start
>>>>> batching in reasonable counts, then we're _almost_ back to doing a queue
>>>>> or two at the time... 32 * 48 is 1536 bytes, so we could only do two at
>>>>> the time for single page allocations.
>>>>
>>>> We can convert to order 0 allocation by one extra indirect array.
>>>
>>> I guess that could work, and would just be one extra alloc + free if we
>>> still retain the batch. That'd take it to 16 devices (at 32 CPUs) per
>>> round, potentially way less of course if we have more CPUs. So still
>>> somewhat limiting, rather than do all at once.
>>
>> With the approach in blk_mq_alloc_rqs(), each allocated page can be
>> added to one list, so the indirect array can be saved. Then it is
>> possible to allocate for any size queues/devices since every
>> allocation is just for single page in case that it is needed, even no
>> pre-calculation is required.
> 
> As long as we watch the complexity, don't think we need to go overboard
> here in the risk of adding issues for the failure path.

No we don't. I prefer not to do it. And if this turns out to be that bad
we can later convert it to a complicated page vector.

I'll move forward with this approach.