All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sagi Grimberg <sagi@grimberg.me>
To: Ming Lei <ming.lei@redhat.com>
Cc: linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>,
	Keith Busch <kbusch@kernel.org>, Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, Chao Leng <lengchao@huawei.com>
Subject: Re: [PATCH v3 1/2] blk-mq: add async quiesce interface
Date: Mon, 27 Jul 2020 11:36:08 -0700	[thread overview]
Message-ID: <2c2ae567-6953-5b7f-2fa1-a65e287b5a9d@grimberg.me> (raw)
In-Reply-To: <20200727020803.GC1129253@T590>


>>>> +void blk_mq_quiesce_queue_async(struct request_queue *q)
>>>> +{
>>>> +	struct blk_mq_hw_ctx *hctx;
>>>> +	unsigned int i;
>>>> +
>>>> +	blk_mq_quiesce_queue_nowait(q);
>>>> +
>>>> +	queue_for_each_hw_ctx(q, hctx, i) {
>>>> +		init_completion(&hctx->rcu_sync.completion);
>>>> +		init_rcu_head(&hctx->rcu_sync.head);
>>>> +		if (hctx->flags & BLK_MQ_F_BLOCKING)
>>>> +			call_srcu(hctx->srcu, &hctx->rcu_sync.head,
>>>> +				wakeme_after_rcu);
>>>> +		else
>>>> +			call_rcu(&hctx->rcu_sync.head,
>>>> +				wakeme_after_rcu);
>>>> +	}
>>>
>>> Looks not necessary to do anything in case of !BLK_MQ_F_BLOCKING, and single
>>> synchronize_rcu() is OK for all hctx during waiting.
>>
>> That's true, but I want a single interface for both. v2 had exactly
>> that, but I decided that this approach is better.
> 
> Not sure one new interface is needed, and one simple way is to:
> 
> 1) call blk_mq_quiesce_queue_nowait() for each request queue
> 
> 2) wait in driver specific way
> 
> Or just wondering why nvme doesn't use set->tag_list to retrieve NS,
> then you may add per-tagset APIs for the waiting.

Because it puts assumptions on how quiesce works, which is something
I'd like to avoid because I think its cleaner, what do others think?
Jens? Christoph?

>> Also, having the driver call a single synchronize_rcu isn't great
> 
> Too many drivers are using synchronize_rcu():
> 
> 	$ git grep -n synchronize_rcu ./drivers/ | wc
> 	    186     524   11384

Wasn't talking about the usage of synchronize_rcu, was referring to
the hidden assumption that quiesce is an rcu driven operation.

>> layering (as quiesce can possibly use a different mechanism in the future).
> 
> What is the different mechanism?

Nothing specific, just said that having drivers assume that quiesce is
synchronizing rcu or srcu is not great.

>> So drivers assumptions like:
>>
>>          /*
>>           * SCSI never enables blk-mq's BLK_MQ_F_BLOCKING flag so
>>           * calling synchronize_rcu() once is enough.
>>           */
>>          WARN_ON_ONCE(shost->tag_set.flags & BLK_MQ_F_BLOCKING);
>>
>>          if (!ret)
>>                  synchronize_rcu();
>>
>> Are not great...
> 
> Both rcu read lock/unlock and synchronize_rcu is global interface, then
> it is reasonable to avoid unnecessary synchronize_rcu().

Again, the fact that quiesce translates to synchronize rcu/srcu based
on the underlying tagset is implicit.

>>>> +}
>>>> +EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_async);
>>>> +
>>>> +void blk_mq_quiesce_queue_async_wait(struct request_queue *q)
>>>> +{
>>>> +	struct blk_mq_hw_ctx *hctx;
>>>> +	unsigned int i;
>>>> +
>>>> +	queue_for_each_hw_ctx(q, hctx, i) {
>>>> +		wait_for_completion(&hctx->rcu_sync.completion);
>>>> +		destroy_rcu_head(&hctx->rcu_sync.head);
>>>> +	}
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_async_wait);
>>>> +
>>>>    /**
>>>>     * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
>>>>     * @q: request queue.
>>>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>>>> index 23230c1d031e..5536e434311a 100644
>>>> --- a/include/linux/blk-mq.h
>>>> +++ b/include/linux/blk-mq.h
>>>> @@ -5,6 +5,7 @@
>>>>    #include <linux/blkdev.h>
>>>>    #include <linux/sbitmap.h>
>>>>    #include <linux/srcu.h>
>>>> +#include <linux/rcupdate_wait.h>
>>>>    struct blk_mq_tags;
>>>>    struct blk_flush_queue;
>>>> @@ -170,6 +171,7 @@ struct blk_mq_hw_ctx {
>>>>    	 */
>>>>    	struct list_head	hctx_list;
>>>> +	struct rcu_synchronize	rcu_sync;
>>> The above struct takes at least 5 words, and I'd suggest to avoid it,
>>> and the hctx->srcu should be re-used for waiting BLK_MQ_F_BLOCKING.
>>> Meantime !BLK_MQ_F_BLOCKING doesn't need it.
>>
>> It is at the end and contains exactly what is needed to synchronize. Not
> 
> The sync is simply single global synchronize_rcu(), and why bother to add
> extra >=40bytes for each hctx.

We can use the heap for this, but it will slow down the operation. Not
sure if this is really meaningful given that it is in the end of the
struct...

We cannot use the stack, because we do the wait asynchronously.

>> sure what you mean by reuse hctx->srcu?
> 
> You already reuses hctx->srcu, but not see reason to add extra rcu_synchronize
> to each hctx for just simulating one single synchronize_rcu().

That is my preference, I don't want nvme or other drivers to take a
different route for blocking vs. non-bloking based on

WARNING: multiple messages have this Message-ID (diff)
From: Sagi Grimberg <sagi@grimberg.me>
To: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	Chao Leng <lengchao@huawei.com>, Keith Busch <kbusch@kernel.org>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v3 1/2] blk-mq: add async quiesce interface
Date: Mon, 27 Jul 2020 11:36:08 -0700	[thread overview]
Message-ID: <2c2ae567-6953-5b7f-2fa1-a65e287b5a9d@grimberg.me> (raw)
In-Reply-To: <20200727020803.GC1129253@T590>


>>>> +void blk_mq_quiesce_queue_async(struct request_queue *q)
>>>> +{
>>>> +	struct blk_mq_hw_ctx *hctx;
>>>> +	unsigned int i;
>>>> +
>>>> +	blk_mq_quiesce_queue_nowait(q);
>>>> +
>>>> +	queue_for_each_hw_ctx(q, hctx, i) {
>>>> +		init_completion(&hctx->rcu_sync.completion);
>>>> +		init_rcu_head(&hctx->rcu_sync.head);
>>>> +		if (hctx->flags & BLK_MQ_F_BLOCKING)
>>>> +			call_srcu(hctx->srcu, &hctx->rcu_sync.head,
>>>> +				wakeme_after_rcu);
>>>> +		else
>>>> +			call_rcu(&hctx->rcu_sync.head,
>>>> +				wakeme_after_rcu);
>>>> +	}
>>>
>>> Looks not necessary to do anything in case of !BLK_MQ_F_BLOCKING, and single
>>> synchronize_rcu() is OK for all hctx during waiting.
>>
>> That's true, but I want a single interface for both. v2 had exactly
>> that, but I decided that this approach is better.
> 
> Not sure one new interface is needed, and one simple way is to:
> 
> 1) call blk_mq_quiesce_queue_nowait() for each request queue
> 
> 2) wait in driver specific way
> 
> Or just wondering why nvme doesn't use set->tag_list to retrieve NS,
> then you may add per-tagset APIs for the waiting.

Because it puts assumptions on how quiesce works, which is something
I'd like to avoid because I think its cleaner, what do others think?
Jens? Christoph?

>> Also, having the driver call a single synchronize_rcu isn't great
> 
> Too many drivers are using synchronize_rcu():
> 
> 	$ git grep -n synchronize_rcu ./drivers/ | wc
> 	    186     524   11384

Wasn't talking about the usage of synchronize_rcu, was referring to
the hidden assumption that quiesce is an rcu driven operation.

>> layering (as quiesce can possibly use a different mechanism in the future).
> 
> What is the different mechanism?

Nothing specific, just said that having drivers assume that quiesce is
synchronizing rcu or srcu is not great.

>> So drivers assumptions like:
>>
>>          /*
>>           * SCSI never enables blk-mq's BLK_MQ_F_BLOCKING flag so
>>           * calling synchronize_rcu() once is enough.
>>           */
>>          WARN_ON_ONCE(shost->tag_set.flags & BLK_MQ_F_BLOCKING);
>>
>>          if (!ret)
>>                  synchronize_rcu();
>>
>> Are not great...
> 
> Both rcu read lock/unlock and synchronize_rcu is global interface, then
> it is reasonable to avoid unnecessary synchronize_rcu().

Again, the fact that quiesce translates to synchronize rcu/srcu based
on the underlying tagset is implicit.

>>>> +}
>>>> +EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_async);
>>>> +
>>>> +void blk_mq_quiesce_queue_async_wait(struct request_queue *q)
>>>> +{
>>>> +	struct blk_mq_hw_ctx *hctx;
>>>> +	unsigned int i;
>>>> +
>>>> +	queue_for_each_hw_ctx(q, hctx, i) {
>>>> +		wait_for_completion(&hctx->rcu_sync.completion);
>>>> +		destroy_rcu_head(&hctx->rcu_sync.head);
>>>> +	}
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_async_wait);
>>>> +
>>>>    /**
>>>>     * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
>>>>     * @q: request queue.
>>>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>>>> index 23230c1d031e..5536e434311a 100644
>>>> --- a/include/linux/blk-mq.h
>>>> +++ b/include/linux/blk-mq.h
>>>> @@ -5,6 +5,7 @@
>>>>    #include <linux/blkdev.h>
>>>>    #include <linux/sbitmap.h>
>>>>    #include <linux/srcu.h>
>>>> +#include <linux/rcupdate_wait.h>
>>>>    struct blk_mq_tags;
>>>>    struct blk_flush_queue;
>>>> @@ -170,6 +171,7 @@ struct blk_mq_hw_ctx {
>>>>    	 */
>>>>    	struct list_head	hctx_list;
>>>> +	struct rcu_synchronize	rcu_sync;
>>> The above struct takes at least 5 words, and I'd suggest to avoid it,
>>> and the hctx->srcu should be re-used for waiting BLK_MQ_F_BLOCKING.
>>> Meantime !BLK_MQ_F_BLOCKING doesn't need it.
>>
>> It is at the end and contains exactly what is needed to synchronize. Not
> 
> The sync is simply single global synchronize_rcu(), and why bother to add
> extra >=40bytes for each hctx.

We can use the heap for this, but it will slow down the operation. Not
sure if this is really meaningful given that it is in the end of the
struct...

We cannot use the stack, because we do the wait asynchronously.

>> sure what you mean by reuse hctx->srcu?
> 
> You already reuses hctx->srcu, but not see reason to add extra rcu_synchronize
> to each hctx for just simulating one single synchronize_rcu().

That is my preference, I don't want nvme or other drivers to take a
different route for blocking vs. non-bloking based on

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  parent reply	other threads:[~2020-07-27 18:36 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-26  0:22 [PATCH v3 0/2] improve quiesce time for large amount of namespaces Sagi Grimberg
2020-07-26  0:22 ` Sagi Grimberg
2020-07-26  0:23 ` [PATCH v3 1/2] blk-mq: add async quiesce interface Sagi Grimberg
2020-07-26  0:23   ` Sagi Grimberg
2020-07-26  9:31   ` Ming Lei
2020-07-26  9:31     ` Ming Lei
2020-07-26 16:27     ` Sagi Grimberg
2020-07-26 16:27       ` Sagi Grimberg
2020-07-27  2:08       ` Ming Lei
2020-07-27  2:08         ` Ming Lei
2020-07-27  3:33         ` Chao Leng
2020-07-27  3:33           ` Chao Leng
2020-07-27  3:50           ` Ming Lei
2020-07-27  3:50             ` Ming Lei
2020-07-27  5:55             ` Chao Leng
2020-07-27  5:55               ` Chao Leng
2020-07-27  6:32               ` Ming Lei
2020-07-27  6:32                 ` Ming Lei
2020-07-27 18:40                 ` Sagi Grimberg
2020-07-27 18:40                   ` Sagi Grimberg
2020-07-27 18:38             ` Sagi Grimberg
2020-07-27 18:38               ` Sagi Grimberg
2020-07-27 18:36         ` Sagi Grimberg [this message]
2020-07-27 18:36           ` Sagi Grimberg
2020-07-27 20:37           ` Jens Axboe
2020-07-27 20:37             ` Jens Axboe
2020-07-27 21:00             ` Sagi Grimberg
2020-07-27 21:00               ` Sagi Grimberg
2020-07-27 21:05               ` Jens Axboe
2020-07-27 21:05                 ` Jens Axboe
2020-07-27 21:21                 ` Keith Busch
2020-07-27 21:21                   ` Keith Busch
2020-07-27 21:30                   ` Jens Axboe
2020-07-27 21:30                     ` Jens Axboe
2020-07-28  1:09               ` Ming Lei
2020-07-28  1:09                 ` Ming Lei
2020-07-26  0:23 ` [PATCH v3 2/2] nvme: improve quiesce time for large amount of namespaces Sagi Grimberg
2020-07-26  0:23   ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2c2ae567-6953-5b7f-2fa1-a65e287b5a9d@grimberg.me \
    --to=sagi@grimberg.me \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=lengchao@huawei.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=ming.lei@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.