Re: [PATCH v3 1/2] blk-mq: add async quiesce interface

From: Sagi Grimberg <sagi@grimberg.me>
To: Ming Lei <ming.lei@redhat.com>
Cc: linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>,
	Keith Busch <kbusch@kernel.org>, Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, Chao Leng <lengchao@huawei.com>
Subject: Re: [PATCH v3 1/2] blk-mq: add async quiesce interface
Date: Mon, 27 Jul 2020 11:36:08 -0700	[thread overview]
Message-ID: <2c2ae567-6953-5b7f-2fa1-a65e287b5a9d@grimberg.me> (raw)
In-Reply-To: <20200727020803.GC1129253@T590>

>>>> +void blk_mq_quiesce_queue_async(struct request_queue *q)
>>>> +{
>>>> +	struct blk_mq_hw_ctx *hctx;
>>>> +	unsigned int i;
>>>> +
>>>> +	blk_mq_quiesce_queue_nowait(q);
>>>> +
>>>> +	queue_for_each_hw_ctx(q, hctx, i) {
>>>> +		init_completion(&hctx->rcu_sync.completion);
>>>> +		init_rcu_head(&hctx->rcu_sync.head);
>>>> +		if (hctx->flags & BLK_MQ_F_BLOCKING)
>>>> +			call_srcu(hctx->srcu, &hctx->rcu_sync.head,
>>>> +				wakeme_after_rcu);
>>>> +		else
>>>> +			call_rcu(&hctx->rcu_sync.head,
>>>> +				wakeme_after_rcu);
>>>> +	}
>>>
>>> Looks not necessary to do anything in case of !BLK_MQ_F_BLOCKING, and single
>>> synchronize_rcu() is OK for all hctx during waiting.
>>
>> That's true, but I want a single interface for both. v2 had exactly
>> that, but I decided that this approach is better.
> 
> Not sure one new interface is needed, and one simple way is to:
> 
> 1) call blk_mq_quiesce_queue_nowait() for each request queue
> 
> 2) wait in driver specific way
> 
> Or just wondering why nvme doesn't use set->tag_list to retrieve NS,
> then you may add per-tagset APIs for the waiting.

Because it puts assumptions on how quiesce works, which is something
I'd like to avoid because I think its cleaner, what do others think?
Jens? Christoph?

>> Also, having the driver call a single synchronize_rcu isn't great
> 
> Too many drivers are using synchronize_rcu():
> 
> 	$ git grep -n synchronize_rcu ./drivers/ | wc
> 	    186     524   11384

Wasn't talking about the usage of synchronize_rcu, was referring to
the hidden assumption that quiesce is an rcu driven operation.

>> layering (as quiesce can possibly use a different mechanism in the future).
> 
> What is the different mechanism?

Nothing specific, just said that having drivers assume that quiesce is
synchronizing rcu or srcu is not great.

>> So drivers assumptions like:
>>
>>          /*
>>           * SCSI never enables blk-mq's BLK_MQ_F_BLOCKING flag so
>>           * calling synchronize_rcu() once is enough.
>>           */
>>          WARN_ON_ONCE(shost->tag_set.flags & BLK_MQ_F_BLOCKING);
>>
>>          if (!ret)
>>                  synchronize_rcu();
>>
>> Are not great...
> 
> Both rcu read lock/unlock and synchronize_rcu is global interface, then
> it is reasonable to avoid unnecessary synchronize_rcu().

Again, the fact that quiesce translates to synchronize rcu/srcu based
on the underlying tagset is implicit.

>>>> +}
>>>> +EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_async);
>>>> +
>>>> +void blk_mq_quiesce_queue_async_wait(struct request_queue *q)
>>>> +{
>>>> +	struct blk_mq_hw_ctx *hctx;
>>>> +	unsigned int i;
>>>> +
>>>> +	queue_for_each_hw_ctx(q, hctx, i) {
>>>> +		wait_for_completion(&hctx->rcu_sync.completion);
>>>> +		destroy_rcu_head(&hctx->rcu_sync.head);
>>>> +	}
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_async_wait);
>>>> +
>>>>    /**
>>>>     * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
>>>>     * @q: request queue.
>>>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>>>> index 23230c1d031e..5536e434311a 100644
>>>> --- a/include/linux/blk-mq.h
>>>> +++ b/include/linux/blk-mq.h
>>>> @@ -5,6 +5,7 @@
>>>>    #include <linux/blkdev.h>
>>>>    #include <linux/sbitmap.h>
>>>>    #include <linux/srcu.h>
>>>> +#include <linux/rcupdate_wait.h>
>>>>    struct blk_mq_tags;
>>>>    struct blk_flush_queue;
>>>> @@ -170,6 +171,7 @@ struct blk_mq_hw_ctx {
>>>>    	 */
>>>>    	struct list_head	hctx_list;
>>>> +	struct rcu_synchronize	rcu_sync;
>>> The above struct takes at least 5 words, and I'd suggest to avoid it,
>>> and the hctx->srcu should be re-used for waiting BLK_MQ_F_BLOCKING.
>>> Meantime !BLK_MQ_F_BLOCKING doesn't need it.
>>
>> It is at the end and contains exactly what is needed to synchronize. Not
> 
> The sync is simply single global synchronize_rcu(), and why bother to add
> extra >=40bytes for each hctx.

We can use the heap for this, but it will slow down the operation. Not
sure if this is really meaningful given that it is in the end of the
struct...

We cannot use the stack, because we do the wait asynchronously.

>> sure what you mean by reuse hctx->srcu?
> 
> You already reuses hctx->srcu, but not see reason to add extra rcu_synchronize
> to each hctx for just simulating one single synchronize_rcu().

That is my preference, I don't want nvme or other drivers to take a
different route for blocking vs. non-bloking based on