linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Ming Lei <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>,
	linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>,
	Keith Busch <kbusch@kernel.org>,
	linux-block@vger.kernel.org, Ming Lin <mlin@kernel.org>,
	Chao Leng <lengchao@huawei.com>
Subject: Re: [PATCH v5 1/2] blk-mq: add tagset quiesce interface
Date: Mon, 27 Jul 2020 20:32:53 -0600	[thread overview]
Message-ID: <baede23a-94c1-1494-bcca-964e1396f253@kernel.dk> (raw)
In-Reply-To: <20200728022802.GC1305646@T590>

On 7/27/20 8:28 PM, Ming Lei wrote:
> On Mon, Jul 27, 2020 at 08:23:15PM -0600, Jens Axboe wrote:
>> On 7/27/20 8:17 PM, Ming Lei wrote:
>>> On Mon, Jul 27, 2020 at 07:51:16PM -0600, Jens Axboe wrote:
>>>> On 7/27/20 7:40 PM, Ming Lei wrote:
>>>>> On Mon, Jul 27, 2020 at 04:10:21PM -0700, Sagi Grimberg wrote:
>>>>>> drivers that have shared tagsets may need to quiesce potentially a lot
>>>>>> of request queues that all share a single tagset (e.g. nvme). Add an interface
>>>>>> to quiesce all the queues on a given tagset. This interface is useful because
>>>>>> it can speedup the quiesce by doing it in parallel.
>>>>>>
>>>>>> For tagsets that have BLK_MQ_F_BLOCKING set, we use call_srcu to all hctxs
>>>>>> in parallel such that all of them wait for the same rcu elapsed period with
>>>>>> a per-hctx heap allocated rcu_synchronize. for tagsets that don't have
>>>>>> BLK_MQ_F_BLOCKING set, we simply call a single synchronize_rcu as this is
>>>>>> sufficient.
>>>>>>
>>>>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>>>>> ---
>>>>>>  block/blk-mq.c         | 66 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>  include/linux/blk-mq.h |  4 +++
>>>>>>  2 files changed, 70 insertions(+)
>>>>>>
>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>> index abcf590f6238..c37e37354330 100644
>>>>>> --- a/block/blk-mq.c
>>>>>> +++ b/block/blk-mq.c
>>>>>> @@ -209,6 +209,42 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
>>>>>>  }
>>>>>>  EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
>>>>>>  
>>>>>> +static void blk_mq_quiesce_blocking_queue_async(struct request_queue *q)
>>>>>> +{
>>>>>> +	struct blk_mq_hw_ctx *hctx;
>>>>>> +	unsigned int i;
>>>>>> +
>>>>>> +	blk_mq_quiesce_queue_nowait(q);
>>>>>> +
>>>>>> +	queue_for_each_hw_ctx(q, hctx, i) {
>>>>>> +		WARN_ON_ONCE(!(hctx->flags & BLK_MQ_F_BLOCKING));
>>>>>> +		hctx->rcu_sync = kmalloc(sizeof(*hctx->rcu_sync), GFP_KERNEL);
>>>>>> +		if (!hctx->rcu_sync)
>>>>>> +			continue;
>>>>>
>>>>> This approach of quiesce/unquiesce tagset is good abstraction.
>>>>>
>>>>> Just one more thing, please allocate a rcu_sync array because hctx is
>>>>> supposed to not store scratch stuff.
>>>>
>>>> I'd be all for not stuffing this in the hctx, but how would that work?
>>>> The only thing I can think of that would work reliably is batching the
>>>> queue+wait into units of N. We could potentially have many thousands of
>>>> queues, and it could get iffy (and/or unreliable) in terms of allocation
>>>> size. Looks like rcu_synchronize is 48-bytes on my local install, and it
>>>> doesn't take a lot of devices at current CPU counts to make an alloc
>>>> covering all of it huge. Let's say 64 threads, and 32 devices, then
>>>> we're already at 64*32*48 bytes which is an order 5 allocation. Not
>>>> friendly, and not going to be reliable when you need it. And if we start
>>>> batching in reasonable counts, then we're _almost_ back to doing a queue
>>>> or two at the time... 32 * 48 is 1536 bytes, so we could only do two at
>>>> the time for single page allocations.
>>>
>>> We can convert to order 0 allocation by one extra indirect array. 
>>
>> I guess that could work, and would just be one extra alloc + free if we
>> still retain the batch. That'd take it to 16 devices (at 32 CPUs) per
>> round, potentially way less of course if we have more CPUs. So still
>> somewhat limiting, rather than do all at once.
> 
> With the approach in blk_mq_alloc_rqs(), each allocated page can be
> added to one list, so the indirect array can be saved. Then it is
> possible to allocate for any size queues/devices since every
> allocation is just for single page in case that it is needed, even no
> pre-calculation is required.

As long as we watch the complexity, don't think we need to go overboard
here in the risk of adding issues for the failure path. But yes, we
could use the same trick I did in blk_mq_alloc_rqs() and just alloc
pages as we go.

-- 
Jens Axboe


  reply	other threads:[~2020-07-28  2:32 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-27 23:10 [PATCH v5 0/2] improve nvme quiesce time for large amount of namespaces Sagi Grimberg
2020-07-27 23:10 ` [PATCH v5 1/2] blk-mq: add tagset quiesce interface Sagi Grimberg
2020-07-27 23:32   ` Keith Busch
2020-07-28  0:12     ` Sagi Grimberg
2020-07-28  1:40   ` Ming Lei
2020-07-28  1:51     ` Jens Axboe
2020-07-28  2:17       ` Ming Lei
2020-07-28  2:23         ` Jens Axboe
2020-07-28  2:28           ` Ming Lei
2020-07-28  2:32             ` Jens Axboe [this message]
2020-07-28  3:29               ` Sagi Grimberg
2020-07-28  3:25     ` Sagi Grimberg
2020-07-28  7:18   ` Christoph Hellwig
2020-07-28  7:48     ` Sagi Grimberg
2020-07-28  9:16     ` Ming Lei
2020-07-28  9:24       ` Sagi Grimberg
2020-07-28  9:33         ` Ming Lei
2020-07-28  9:37           ` Sagi Grimberg
2020-07-28  9:43             ` Sagi Grimberg
2020-07-28 10:10               ` Ming Lei
2020-07-28 10:57                 ` Christoph Hellwig
2020-07-28 14:13                 ` Paul E. McKenney
2020-07-28 10:58             ` Christoph Hellwig
2020-07-28 16:25               ` Sagi Grimberg
2020-07-28 13:54         ` Paul E. McKenney
2020-07-28 23:46           ` Sagi Grimberg
2020-07-29  0:31             ` Paul E. McKenney
2020-07-29  0:43               ` Sagi Grimberg
2020-07-29  0:59                 ` Keith Busch
2020-07-29  4:39                   ` Sagi Grimberg
2020-08-07  9:04                     ` Chao Leng
2020-08-07  9:24                       ` Ming Lei
2020-08-07  9:35                         ` Chao Leng
2020-07-29  4:10                 ` Paul E. McKenney
2020-07-29  4:37                   ` Sagi Grimberg
2020-07-27 23:10 ` [PATCH v5 2/2] nvme: use blk_mq_[un]quiesce_tagset Sagi Grimberg
2020-07-28  0:54   ` Sagi Grimberg
2020-07-28  3:21     ` Chao Leng
2020-07-28  3:34       ` Sagi Grimberg
2020-07-28  3:51         ` Chao Leng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=baede23a-94c1-1494-bcca-964e1396f253@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=lengchao@huawei.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=ming.lei@redhat.com \
    --cc=mlin@kernel.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).