Re: [PATCH v5 1/2] blk-mq: add tagset quiesce interface

From: Jens Axboe <axboe@kernel.dk>
To: Ming Lei <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>,
	linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>,
	Keith Busch <kbusch@kernel.org>,
	linux-block@vger.kernel.org, Ming Lin <mlin@kernel.org>,
	Chao Leng <lengchao@huawei.com>
Subject: Re: [PATCH v5 1/2] blk-mq: add tagset quiesce interface
Date: Mon, 27 Jul 2020 20:32:53 -0600	[thread overview]
Message-ID: <baede23a-94c1-1494-bcca-964e1396f253@kernel.dk> (raw)
In-Reply-To: <20200728022802.GC1305646@T590>

On 7/27/20 8:28 PM, Ming Lei wrote:
> On Mon, Jul 27, 2020 at 08:23:15PM -0600, Jens Axboe wrote:
>> On 7/27/20 8:17 PM, Ming Lei wrote:
>>> On Mon, Jul 27, 2020 at 07:51:16PM -0600, Jens Axboe wrote:
>>>> On 7/27/20 7:40 PM, Ming Lei wrote:
>>>>> On Mon, Jul 27, 2020 at 04:10:21PM -0700, Sagi Grimberg wrote:
>>>>>> drivers that have shared tagsets may need to quiesce potentially a lot
>>>>>> of request queues that all share a single tagset (e.g. nvme). Add an interface
>>>>>> to quiesce all the queues on a given tagset. This interface is useful because
>>>>>> it can speedup the quiesce by doing it in parallel.
>>>>>>
>>>>>> For tagsets that have BLK_MQ_F_BLOCKING set, we use call_srcu to all hctxs
>>>>>> in parallel such that all of them wait for the same rcu elapsed period with
>>>>>> a per-hctx heap allocated rcu_synchronize. for tagsets that don't have
>>>>>> BLK_MQ_F_BLOCKING set, we simply call a single synchronize_rcu as this is
>>>>>> sufficient.
>>>>>>
>>>>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>>>>> ---
>>>>>>  block/blk-mq.c         | 66 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>  include/linux/blk-mq.h |  4 +++
>>>>>>  2 files changed, 70 insertions(+)
>>>>>>
>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>> index abcf590f6238..c37e37354330 100644
>>>>>> --- a/block/blk-mq.c
>>>>>> +++ b/block/blk-mq.c
>>>>>> @@ -209,6 +209,42 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
>>>>>>  }
>>>>>>  EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
>>>>>>  
>>>>>> +static void blk_mq_quiesce_blocking_queue_async(struct request_queue *q)
>>>>>> +{
>>>>>> +	struct blk_mq_hw_ctx *hctx;
>>>>>> +	unsigned int i;
>>>>>> +
>>>>>> +	blk_mq_quiesce_queue_nowait(q);
>>>>>> +
>>>>>> +	queue_for_each_hw_ctx(q, hctx, i) {
>>>>>> +		WARN_ON_ONCE(!(hctx->flags & BLK_MQ_F_BLOCKING));
>>>>>> +		hctx->rcu_sync = kmalloc(sizeof(*hctx->rcu_sync), GFP_KERNEL);
>>>>>> +		if (!hctx->rcu_sync)
>>>>>> +			continue;
>>>>>
>>>>> This approach of quiesce/unquiesce tagset is good abstraction.
>>>>>
>>>>> Just one more thing, please allocate a rcu_sync array because hctx is
>>>>> supposed to not store scratch stuff.
>>>>
>>>> I'd be all for not stuffing this in the hctx, but how would that work?
>>>> The only thing I can think of that would work reliably is batching the
>>>> queue+wait into units of N. We could potentially have many thousands of
>>>> queues, and it could get iffy (and/or unreliable) in terms of allocation
>>>> size. Looks like rcu_synchronize is 48-bytes on my local install, and it
>>>> doesn't take a lot of devices at current CPU counts to make an alloc
>>>> covering all of it huge. Let's say 64 threads, and 32 devices, then
>>>> we're already at 64*32*48 bytes which is an order 5 allocation. Not
>>>> friendly, and not going to be reliable when you need it. And if we start
>>>> batching in reasonable counts, then we're _almost_ back to doing a queue
>>>> or two at the time... 32 * 48 is 1536 bytes, so we could only do two at
>>>> the time for single page allocations.
>>>
>>> We can convert to order 0 allocation by one extra indirect array. 
>>
>> I guess that could work, and would just be one extra alloc + free if we
>> still retain the batch. That'd take it to 16 devices (at 32 CPUs) per
>> round, potentially way less of course if we have more CPUs. So still
>> somewhat limiting, rather than do all at once.
> 
> With the approach in blk_mq_alloc_rqs(), each allocated page can be
> added to one list, so the indirect array can be saved. Then it is
> possible to allocate for any size queues/devices since every
> allocation is just for single page in case that it is needed, even no
> pre-calculation is required.

As long as we watch the complexity, don't think we need to go overboard
here in the risk of adding issues for the failure path. But yes, we
could use the same trick I did in blk_mq_alloc_rqs() and just alloc
pages as we go.

-- 
Jens Axboe