linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] blk-mq: avoid excessive boot delays with large lun counts
@ 2015-10-23 14:57 Jeff Moyer
  2015-10-28  9:45 ` jason
  2015-10-29  1:19 ` Ming Lei
  0 siblings, 2 replies; 8+ messages in thread
From: Jeff Moyer @ 2015-10-23 14:57 UTC (permalink / raw)
  To: axboe, Jason Luo; +Cc: linux-kernel, Guru Anbalagane, Feng Jin, tj, Ming Lei

Hi,

Zhangqing Luo reported long boot times on a system with thousands of
LUNs when scsi-mq was enabled.  He narrowed the problem down to
blk_mq_add_queue_tag_set, where every queue is frozen in order to set
the BLK_MQ_F_TAG_SHARED flag.  Each added device will freeze all queues
added before it in sequence, which involves waiting for an RCU grace
period for each one.  We don't need to do this.  After the second queue
is added, only new queues need to be initialized with the shared tag.
We can do that by percolating the flag up to the blk_mq_tag_set, since
new hctxs already inherit the tag_set's flags.  That's what the below
patch does.

The re-check in blk_mq_add_queue_tag_set is done because we could have
two queues racing to be the second queue added for the tag set.  In such
a case, it's possible that not all of the hctxs within the loser will be
initialized.  Because this initialization is done sequentially, we can
simply check for the BLK_MQ_F_TAG_SHARED flag in the first hctx.

This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve
support for shared tags maps).

Reported-and-tested-by: Jason Luo <zhangqing.luo@oracle.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7785ae9..8b4c484 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1860,27 +1860,26 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 	}
 }
 
-static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
+static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 {
 	struct blk_mq_hw_ctx *hctx;
-	struct request_queue *q;
-	bool shared;
 	int i;
 
-	if (set->tag_list.next == set->tag_list.prev)
-		shared = false;
-	else
-		shared = true;
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (shared)
+			hctx->flags |= BLK_MQ_F_TAG_SHARED;
+		else
+			hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
+	}
+}
+
+static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
+{
+	struct request_queue *q;
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_freeze_queue(q);
-
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (shared)
-				hctx->flags |= BLK_MQ_F_TAG_SHARED;
-			else
-				hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
-		}
+		queue_set_hctx_shared(q, shared);
 		blk_mq_unfreeze_queue(q);
 	}
 }
@@ -1891,7 +1890,13 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 
 	mutex_lock(&set->tag_list_lock);
 	list_del_init(&q->tag_set_list);
-	blk_mq_update_tag_set_depth(set);
+
+	if (set->tag_list.next == set->tag_list.prev) {
+		/* just transitioned to unshared */
+		set->flags &= ~BLK_MQ_F_TAG_SHARED;
+		/* update existing queue */
+		blk_mq_update_tag_set_depth(set, false);
+	}
 	mutex_unlock(&set->tag_list_lock);
 }
 
@@ -1902,7 +1907,24 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 
 	mutex_lock(&set->tag_list_lock);
 	list_add_tail(&q->tag_set_list, &set->tag_list);
-	blk_mq_update_tag_set_depth(set);
+
+	if (set->tag_list.next != set->tag_list.prev) {
+		/*
+		 * Only update the tag set state if the state has
+		 * actually changed.
+		 */
+		if (!(set->flags & BLK_MQ_F_TAG_SHARED)) {
+			/* just transitioned to shared tags */
+			set->flags |= BLK_MQ_F_TAG_SHARED;
+			blk_mq_update_tag_set_depth(set, true);
+		} else {
+			/* ensure we didn't race with another addition */
+			struct blk_mq_hw_ctx *hctx = queue_first_hw_ctx(q);
+			if ((hctx->flags & BLK_MQ_F_TAG_SHARED) !=
+							BLK_MQ_F_TAG_SHARED)
+				queue_set_hctx_shared(q, true);
+		}
+	}
 	mutex_unlock(&set->tag_list_lock);
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5e7d43a..12ffc40 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -254,6 +254,9 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
 	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
 	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
 
+#define queue_first_hw_ctx(q)						\
+	(q)->queue_hw_ctx[0]
+
 #define blk_ctx_sum(q, sum)						\
 ({									\
 	struct blk_mq_ctx *__x;						\

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [patch] blk-mq: avoid excessive boot delays with large lun counts
  2015-10-23 14:57 [patch] blk-mq: avoid excessive boot delays with large lun counts Jeff Moyer
@ 2015-10-28  9:45 ` jason
  2015-10-29  1:19 ` Ming Lei
  1 sibling, 0 replies; 8+ messages in thread
From: jason @ 2015-10-28  9:45 UTC (permalink / raw)
  To: Jeff Moyer, axboe; +Cc: linux-kernel, Guru Anbalagane, Feng Jin, tj, Ming Lei

Hi,
     Does anyone have time to look into the patch?  I've tested, it works

/Thanks.

On Friday, October 23, 2015 10:57 PM, Jeff Moyer wrote:
> Hi,
>
> Zhangqing Luo reported long boot times on a system with thousands of
> LUNs when scsi-mq was enabled.  He narrowed the problem down to
> blk_mq_add_queue_tag_set, where every queue is frozen in order to set
> the BLK_MQ_F_TAG_SHARED flag.  Each added device will freeze all queues
> added before it in sequence, which involves waiting for an RCU grace
> period for each one.  We don't need to do this.  After the second queue
> is added, only new queues need to be initialized with the shared tag.
> We can do that by percolating the flag up to the blk_mq_tag_set, since
> new hctxs already inherit the tag_set's flags.  That's what the below
> patch does.
>
> The re-check in blk_mq_add_queue_tag_set is done because we could have
> two queues racing to be the second queue added for the tag set.  In such
> a case, it's possible that not all of the hctxs within the loser will be
> initialized.  Because this initialization is done sequentially, we can
> simply check for the BLK_MQ_F_TAG_SHARED flag in the first hctx.
>
> This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve
> support for shared tags maps).
>
> Reported-and-tested-by: Jason Luo <zhangqing.luo@oracle.com>
> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7785ae9..8b4c484 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1860,27 +1860,26 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>   	}
>   }
>
> -static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
> +static void queue_set_hctx_shared(struct request_queue *q, bool shared)
>   {
>   	struct blk_mq_hw_ctx *hctx;
> -	struct request_queue *q;
> -	bool shared;
>   	int i;
>
> -	if (set->tag_list.next == set->tag_list.prev)
> -		shared = false;
> -	else
> -		shared = true;
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (shared)
> +			hctx->flags |= BLK_MQ_F_TAG_SHARED;
> +		else
> +			hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
> +	}
> +}
> +
> +static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
> +{
> +	struct request_queue *q;
>
>   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>   		blk_mq_freeze_queue(q);
> -
> -		queue_for_each_hw_ctx(q, hctx, i) {
> -			if (shared)
> -				hctx->flags |= BLK_MQ_F_TAG_SHARED;
> -			else
> -				hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
> -		}
> +		queue_set_hctx_shared(q, shared);
>   		blk_mq_unfreeze_queue(q);
>   	}
>   }
> @@ -1891,7 +1890,13 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>
>   	mutex_lock(&set->tag_list_lock);
>   	list_del_init(&q->tag_set_list);
> -	blk_mq_update_tag_set_depth(set);
> +
> +	if (set->tag_list.next == set->tag_list.prev) {
> +		/* just transitioned to unshared */
> +		set->flags &= ~BLK_MQ_F_TAG_SHARED;
> +		/* update existing queue */
> +		blk_mq_update_tag_set_depth(set, false);
> +	}
>   	mutex_unlock(&set->tag_list_lock);
>   }
>
> @@ -1902,7 +1907,24 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>
>   	mutex_lock(&set->tag_list_lock);
>   	list_add_tail(&q->tag_set_list, &set->tag_list);
> -	blk_mq_update_tag_set_depth(set);
> +
> +	if (set->tag_list.next != set->tag_list.prev) {
> +		/*
> +		 * Only update the tag set state if the state has
> +		 * actually changed.
> +		 */
> +		if (!(set->flags & BLK_MQ_F_TAG_SHARED)) {
> +			/* just transitioned to shared tags */
> +			set->flags |= BLK_MQ_F_TAG_SHARED;
> +			blk_mq_update_tag_set_depth(set, true);
> +		} else {
> +			/* ensure we didn't race with another addition */
> +			struct blk_mq_hw_ctx *hctx = queue_first_hw_ctx(q);
> +			if ((hctx->flags & BLK_MQ_F_TAG_SHARED) !=
> +							BLK_MQ_F_TAG_SHARED)
> +				queue_set_hctx_shared(q, true);
> +		}
> +	}
>   	mutex_unlock(&set->tag_list_lock);
>   }
>
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 5e7d43a..12ffc40 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -254,6 +254,9 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
>   	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
>   	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
>
> +#define queue_first_hw_ctx(q)						\
> +	(q)->queue_hw_ctx[0]
> +
>   #define blk_ctx_sum(q, sum)						\
>   ({									\
>   	struct blk_mq_ctx *__x;						\
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] blk-mq: avoid excessive boot delays with large lun counts
  2015-10-23 14:57 [patch] blk-mq: avoid excessive boot delays with large lun counts Jeff Moyer
  2015-10-28  9:45 ` jason
@ 2015-10-29  1:19 ` Ming Lei
  2015-10-29 14:20   ` Jeff Moyer
  1 sibling, 1 reply; 8+ messages in thread
From: Ming Lei @ 2015-10-29  1:19 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jens Axboe, Jason Luo, Linux Kernel Mailing List,
	Guru Anbalagane, Feng Jin, Tejun Heo

On Fri, Oct 23, 2015 at 10:57 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Hi,
>
> Zhangqing Luo reported long boot times on a system with thousands of
> LUNs when scsi-mq was enabled.  He narrowed the problem down to
> blk_mq_add_queue_tag_set, where every queue is frozen in order to set
> the BLK_MQ_F_TAG_SHARED flag.  Each added device will freeze all queues
> added before it in sequence, which involves waiting for an RCU grace
> period for each one.  We don't need to do this.  After the second queue
> is added, only new queues need to be initialized with the shared tag.
> We can do that by percolating the flag up to the blk_mq_tag_set, since
> new hctxs already inherit the tag_set's flags.  That's what the below
> patch does.

Looks the idea is correct.

>
> The re-check in blk_mq_add_queue_tag_set is done because we could have
> two queues racing to be the second queue added for the tag set.  In such

That isn't possible because of set->tag_list_lock.

> a case, it's possible that not all of the hctxs within the loser will be
> initialized.  Because this initialization is done sequentially, we can
> simply check for the BLK_MQ_F_TAG_SHARED flag in the first hctx.
>
> This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve
> support for shared tags maps).
>
> Reported-and-tested-by: Jason Luo <zhangqing.luo@oracle.com>
> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7785ae9..8b4c484 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1860,27 +1860,26 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>         }
>  }
>
> -static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
> +static void queue_set_hctx_shared(struct request_queue *q, bool shared)
>  {
>         struct blk_mq_hw_ctx *hctx;
> -       struct request_queue *q;
> -       bool shared;
>         int i;
>
> -       if (set->tag_list.next == set->tag_list.prev)
> -               shared = false;
> -       else
> -               shared = true;
> +       queue_for_each_hw_ctx(q, hctx, i) {
> +               if (shared)
> +                       hctx->flags |= BLK_MQ_F_TAG_SHARED;
> +               else
> +                       hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
> +       }
> +}
> +
> +static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
> +{
> +       struct request_queue *q;
>
>         list_for_each_entry(q, &set->tag_list, tag_set_list) {
>                 blk_mq_freeze_queue(q);
> -
> -               queue_for_each_hw_ctx(q, hctx, i) {
> -                       if (shared)
> -                               hctx->flags |= BLK_MQ_F_TAG_SHARED;
> -                       else
> -                               hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
> -               }
> +               queue_set_hctx_shared(q, shared);
>                 blk_mq_unfreeze_queue(q);
>         }
>  }
> @@ -1891,7 +1890,13 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>
>         mutex_lock(&set->tag_list_lock);
>         list_del_init(&q->tag_set_list);
> -       blk_mq_update_tag_set_depth(set);
> +
> +       if (set->tag_list.next == set->tag_list.prev) {
> +               /* just transitioned to unshared */
> +               set->flags &= ~BLK_MQ_F_TAG_SHARED;
> +               /* update existing queue */
> +               blk_mq_update_tag_set_depth(set, false);
> +       }
>         mutex_unlock(&set->tag_list_lock);
>  }
>
> @@ -1902,7 +1907,24 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>
>         mutex_lock(&set->tag_list_lock);
>         list_add_tail(&q->tag_set_list, &set->tag_list);
> -       blk_mq_update_tag_set_depth(set);
> +
> +       if (set->tag_list.next != set->tag_list.prev) {
> +               /*
> +                * Only update the tag set state if the state has
> +                * actually changed.
> +                */
> +               if (!(set->flags & BLK_MQ_F_TAG_SHARED)) {
> +                       /* just transitioned to shared tags */
> +                       set->flags |= BLK_MQ_F_TAG_SHARED;
> +                       blk_mq_update_tag_set_depth(set, true);
> +               } else {
> +                       /* ensure we didn't race with another addition */
> +                       struct blk_mq_hw_ctx *hctx = queue_first_hw_ctx(q);
> +                       if ((hctx->flags & BLK_MQ_F_TAG_SHARED) !=
> +                                                       BLK_MQ_F_TAG_SHARED)
> +                               queue_set_hctx_shared(q, true);

The above check isn't necessary because the queue can't be up now, so
it is OK to just call 'queue_set_hctx_shared(q, true)' directly.

> +               }
> +       }
>         mutex_unlock(&set->tag_list_lock);
>  }
>
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 5e7d43a..12ffc40 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -254,6 +254,9 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
>         for ((i) = 0; (i) < (hctx)->nr_ctx &&                           \
>              ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
>
> +#define queue_first_hw_ctx(q)                                          \
> +       (q)->queue_hw_ctx[0]
> +
>  #define blk_ctx_sum(q, sum)                                            \
>  ({                                                                     \
>         struct blk_mq_ctx *__x;                                         \



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] blk-mq: avoid excessive boot delays with large lun counts
  2015-10-29  1:19 ` Ming Lei
@ 2015-10-29 14:20   ` Jeff Moyer
  2015-10-29 14:55     ` Ming Lei
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Moyer @ 2015-10-29 14:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Jason Luo, Linux Kernel Mailing List,
	Guru Anbalagane, Feng Jin, Tejun Heo

Hi Ming,

Thanks for taking a look.

Ming Lei <tom.leiming@gmail.com> writes:

> On Fri, Oct 23, 2015 at 10:57 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Hi,
>>
>> Zhangqing Luo reported long boot times on a system with thousands of
>> LUNs when scsi-mq was enabled.  He narrowed the problem down to
>> blk_mq_add_queue_tag_set, where every queue is frozen in order to set
>> the BLK_MQ_F_TAG_SHARED flag.  Each added device will freeze all queues
>> added before it in sequence, which involves waiting for an RCU grace
>> period for each one.  We don't need to do this.  After the second queue
>> is added, only new queues need to be initialized with the shared tag.
>> We can do that by percolating the flag up to the blk_mq_tag_set, since
>> new hctxs already inherit the tag_set's flags.  That's what the below
>> patch does.
>
> Looks the idea is correct.
>
>>
>> The re-check in blk_mq_add_queue_tag_set is done because we could have
>> two queues racing to be the second queue added for the tag set.  In such
>
> That isn't possible because of set->tag_list_lock.

Yes, it is possible.  blk_mq_init_allocated_queue calls
blk_mq_init_hw_queues without the set->tag_list_lock.  Consider a single
queue is present for the tag_set, and thus the BLK_MQ_F_TAG_SHARED flag
is not set.  Then, two more async scan events are running on separate
CPUs.  Each will go through this loop in blk_mq_init_hw_queues:

queue_for_each_hw_ctx(q, hctx, i) {
        if (blk_mq_init_hctx(q, set, hctx, i))
                break;
}

and blk_mq_init_hctx copies the flags:
        hctx->flags = set->flags;

At this point, neither queue's hctxs have the shared flag set.  Next,
both will race to get the tag_list_lock for the tag_set inside of
blk_mq_add_queue_tag_set.  Only one will win and mark the initial
queue's hctx's as shared (as well as its own).  Then, when the second
queue gets the lock, it will find that the shared tag is already set,
and assume that it doesn't have to do anything.  But, because its
hctx->flags were copied before the BLK_MQ_F_TAG_SHARED flag was set, it
now will operate as if it's the only queue in the tag set.

So, yes, there is a race, and that extra check is necessary.  Hopefully
I've explained it well enough.  If not, let me know and I'll try again.

Cheers,
Jeff

>
>> a case, it's possible that not all of the hctxs within the loser will be
>> initialized.  Because this initialization is done sequentially, we can
>> simply check for the BLK_MQ_F_TAG_SHARED flag in the first hctx.
>>
>> This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve
>> support for shared tags maps).
>>
>> Reported-and-tested-by: Jason Luo <zhangqing.luo@oracle.com>
>> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 7785ae9..8b4c484 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1860,27 +1860,26 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>>         }
>>  }
>>
>> -static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
>> +static void queue_set_hctx_shared(struct request_queue *q, bool shared)
>>  {
>>         struct blk_mq_hw_ctx *hctx;
>> -       struct request_queue *q;
>> -       bool shared;
>>         int i;
>>
>> -       if (set->tag_list.next == set->tag_list.prev)
>> -               shared = false;
>> -       else
>> -               shared = true;
>> +       queue_for_each_hw_ctx(q, hctx, i) {
>> +               if (shared)
>> +                       hctx->flags |= BLK_MQ_F_TAG_SHARED;
>> +               else
>> +                       hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
>> +       }
>> +}
>> +
>> +static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
>> +{
>> +       struct request_queue *q;
>>
>>         list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>                 blk_mq_freeze_queue(q);
>> -
>> -               queue_for_each_hw_ctx(q, hctx, i) {
>> -                       if (shared)
>> -                               hctx->flags |= BLK_MQ_F_TAG_SHARED;
>> -                       else
>> -                               hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
>> -               }
>> +               queue_set_hctx_shared(q, shared);
>>                 blk_mq_unfreeze_queue(q);
>>         }
>>  }
>> @@ -1891,7 +1890,13 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>>
>>         mutex_lock(&set->tag_list_lock);
>>         list_del_init(&q->tag_set_list);
>> -       blk_mq_update_tag_set_depth(set);
>> +
>> +       if (set->tag_list.next == set->tag_list.prev) {
>> +               /* just transitioned to unshared */
>> +               set->flags &= ~BLK_MQ_F_TAG_SHARED;
>> +               /* update existing queue */
>> +               blk_mq_update_tag_set_depth(set, false);
>> +       }
>>         mutex_unlock(&set->tag_list_lock);
>>  }
>>
>> @@ -1902,7 +1907,24 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>>
>>         mutex_lock(&set->tag_list_lock);
>>         list_add_tail(&q->tag_set_list, &set->tag_list);
>> -       blk_mq_update_tag_set_depth(set);
>> +
>> +       if (set->tag_list.next != set->tag_list.prev) {
>> +               /*
>> +                * Only update the tag set state if the state has
>> +                * actually changed.
>> +                */
>> +               if (!(set->flags & BLK_MQ_F_TAG_SHARED)) {
>> +                       /* just transitioned to shared tags */
>> +                       set->flags |= BLK_MQ_F_TAG_SHARED;
>> +                       blk_mq_update_tag_set_depth(set, true);
>> +               } else {
>> +                       /* ensure we didn't race with another addition */
>> +                       struct blk_mq_hw_ctx *hctx = queue_first_hw_ctx(q);
>> +                       if ((hctx->flags & BLK_MQ_F_TAG_SHARED) !=
>> +                                                       BLK_MQ_F_TAG_SHARED)
>> +                               queue_set_hctx_shared(q, true);
>
> The above check isn't necessary because the queue can't be up now, so
> it is OK to just call 'queue_set_hctx_shared(q, true)' directly.
>
>> +               }
>> +       }
>>         mutex_unlock(&set->tag_list_lock);
>>  }
>>
>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index 5e7d43a..12ffc40 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -254,6 +254,9 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
>>         for ((i) = 0; (i) < (hctx)->nr_ctx &&                           \
>>              ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
>>
>> +#define queue_first_hw_ctx(q)                                          \
>> +       (q)->queue_hw_ctx[0]
>> +
>>  #define blk_ctx_sum(q, sum)                                            \
>>  ({                                                                     \
>>         struct blk_mq_ctx *__x;                                         \

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] blk-mq: avoid excessive boot delays with large lun counts
  2015-10-29 14:20   ` Jeff Moyer
@ 2015-10-29 14:55     ` Ming Lei
  2015-10-29 15:18       ` Jeff Moyer
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2015-10-29 14:55 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jens Axboe, Jason Luo, Linux Kernel Mailing List,
	Guru Anbalagane, Feng Jin, Tejun Heo

On Thu, Oct 29, 2015 at 10:20 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Hi Ming,
>
> Thanks for taking a look.
>
> Ming Lei <tom.leiming@gmail.com> writes:
>
>> On Fri, Oct 23, 2015 at 10:57 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> Hi,
>>>
>>> Zhangqing Luo reported long boot times on a system with thousands of
>>> LUNs when scsi-mq was enabled.  He narrowed the problem down to
>>> blk_mq_add_queue_tag_set, where every queue is frozen in order to set
>>> the BLK_MQ_F_TAG_SHARED flag.  Each added device will freeze all queues
>>> added before it in sequence, which involves waiting for an RCU grace
>>> period for each one.  We don't need to do this.  After the second queue
>>> is added, only new queues need to be initialized with the shared tag.
>>> We can do that by percolating the flag up to the blk_mq_tag_set, since
>>> new hctxs already inherit the tag_set's flags.  That's what the below
>>> patch does.
>>
>> Looks the idea is correct.
>>
>>>
>>> The re-check in blk_mq_add_queue_tag_set is done because we could have
>>> two queues racing to be the second queue added for the tag set.  In such
>>
>> That isn't possible because of set->tag_list_lock.
>
> Yes, it is possible.  blk_mq_init_allocated_queue calls
> blk_mq_init_hw_queues without the set->tag_list_lock.  Consider a single
> queue is present for the tag_set, and thus the BLK_MQ_F_TAG_SHARED flag
> is not set.  Then, two more async scan events are running on separate
> CPUs.  Each will go through this loop in blk_mq_init_hw_queues:
>
> queue_for_each_hw_ctx(q, hctx, i) {
>         if (blk_mq_init_hctx(q, set, hctx, i))
>                 break;
> }
>
> and blk_mq_init_hctx copies the flags:
>         hctx->flags = set->flags;

Looks we should have cleared the TAG_SHARED flag during
blk_mq_init_hctx() and just let blk_mq_update_tag_set_depth()
deal with that, then the race can be avoided.

>
> At this point, neither queue's hctxs have the shared flag set.  Next,
> both will race to get the tag_list_lock for the tag_set inside of
> blk_mq_add_queue_tag_set.  Only one will win and mark the initial
> queue's hctx's as shared (as well as its own).  Then, when the second
> queue gets the lock, it will find that the shared tag is already set,
> and assume that it doesn't have to do anything.  But, because its

As I suggested, we can set it always in case that TAG_SHARED
is set in set->flags because we know the queue isn't ready yet at that
time.

> hctx->flags were copied before the BLK_MQ_F_TAG_SHARED flag was set, it
> now will operate as if it's the only queue in the tag set.

The shared flag should be set here because the flag has been set in set->flags,
then no such issue, is there?

>
> So, yes, there is a race, and that extra check is necessary.  Hopefully
> I've explained it well enough.  If not, let me know and I'll try again.
>
> Cheers,
> Jeff
>
>>
>>> a case, it's possible that not all of the hctxs within the loser will be
>>> initialized.  Because this initialization is done sequentially, we can
>>> simply check for the BLK_MQ_F_TAG_SHARED flag in the first hctx.
>>>
>>> This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve
>>> support for shared tags maps).
>>>
>>> Reported-and-tested-by: Jason Luo <zhangqing.luo@oracle.com>
>>> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 7785ae9..8b4c484 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -1860,27 +1860,26 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>>>         }
>>>  }
>>>
>>> -static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
>>> +static void queue_set_hctx_shared(struct request_queue *q, bool shared)
>>>  {
>>>         struct blk_mq_hw_ctx *hctx;
>>> -       struct request_queue *q;
>>> -       bool shared;
>>>         int i;
>>>
>>> -       if (set->tag_list.next == set->tag_list.prev)
>>> -               shared = false;
>>> -       else
>>> -               shared = true;
>>> +       queue_for_each_hw_ctx(q, hctx, i) {
>>> +               if (shared)
>>> +                       hctx->flags |= BLK_MQ_F_TAG_SHARED;
>>> +               else
>>> +                       hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
>>> +       }
>>> +}
>>> +
>>> +static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
>>> +{
>>> +       struct request_queue *q;
>>>
>>>         list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>>                 blk_mq_freeze_queue(q);
>>> -
>>> -               queue_for_each_hw_ctx(q, hctx, i) {
>>> -                       if (shared)
>>> -                               hctx->flags |= BLK_MQ_F_TAG_SHARED;
>>> -                       else
>>> -                               hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
>>> -               }
>>> +               queue_set_hctx_shared(q, shared);
>>>                 blk_mq_unfreeze_queue(q);
>>>         }
>>>  }
>>> @@ -1891,7 +1890,13 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>>>
>>>         mutex_lock(&set->tag_list_lock);
>>>         list_del_init(&q->tag_set_list);
>>> -       blk_mq_update_tag_set_depth(set);
>>> +
>>> +       if (set->tag_list.next == set->tag_list.prev) {
>>> +               /* just transitioned to unshared */
>>> +               set->flags &= ~BLK_MQ_F_TAG_SHARED;
>>> +               /* update existing queue */
>>> +               blk_mq_update_tag_set_depth(set, false);
>>> +       }
>>>         mutex_unlock(&set->tag_list_lock);
>>>  }
>>>
>>> @@ -1902,7 +1907,24 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>>>
>>>         mutex_lock(&set->tag_list_lock);
>>>         list_add_tail(&q->tag_set_list, &set->tag_list);
>>> -       blk_mq_update_tag_set_depth(set);
>>> +
>>> +       if (set->tag_list.next != set->tag_list.prev) {
>>> +               /*
>>> +                * Only update the tag set state if the state has
>>> +                * actually changed.
>>> +                */
>>> +               if (!(set->flags & BLK_MQ_F_TAG_SHARED)) {
>>> +                       /* just transitioned to shared tags */
>>> +                       set->flags |= BLK_MQ_F_TAG_SHARED;
>>> +                       blk_mq_update_tag_set_depth(set, true);
>>> +               } else {
>>> +                       /* ensure we didn't race with another addition */
>>> +                       struct blk_mq_hw_ctx *hctx = queue_first_hw_ctx(q);
>>> +                       if ((hctx->flags & BLK_MQ_F_TAG_SHARED) !=
>>> +                                                       BLK_MQ_F_TAG_SHARED)
>>> +                               queue_set_hctx_shared(q, true);
>>
>> The above check isn't necessary because the queue can't be up now, so
>> it is OK to just call 'queue_set_hctx_shared(q, true)' directly.
>>
>>> +               }
>>> +       }
>>>         mutex_unlock(&set->tag_list_lock);
>>>  }
>>>
>>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>>> index 5e7d43a..12ffc40 100644
>>> --- a/include/linux/blk-mq.h
>>> +++ b/include/linux/blk-mq.h
>>> @@ -254,6 +254,9 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
>>>         for ((i) = 0; (i) < (hctx)->nr_ctx &&                           \
>>>              ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
>>>
>>> +#define queue_first_hw_ctx(q)                                          \
>>> +       (q)->queue_hw_ctx[0]
>>> +
>>>  #define blk_ctx_sum(q, sum)                                            \
>>>  ({                                                                     \
>>>         struct blk_mq_ctx *__x;                                         \



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] blk-mq: avoid excessive boot delays with large lun counts
  2015-10-29 14:55     ` Ming Lei
@ 2015-10-29 15:18       ` Jeff Moyer
  2015-10-30  0:37         ` Ming Lei
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Moyer @ 2015-10-29 15:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Jason Luo, Linux Kernel Mailing List,
	Guru Anbalagane, Feng Jin, Tejun Heo

Ming Lei <tom.leiming@gmail.com> writes:

> Looks we should have cleared the TAG_SHARED flag during
> blk_mq_init_hctx() and just let blk_mq_update_tag_set_depth()
> deal with that, then the race can be avoided.

The whole point of the patch set is to propagate the flag up to the tag
set so that we can avoid iterating all hctxs in all queues.

>> At this point, neither queue's hctxs have the shared flag set.  Next,
>> both will race to get the tag_list_lock for the tag_set inside of
>> blk_mq_add_queue_tag_set.  Only one will win and mark the initial
>> queue's hctx's as shared (as well as its own).  Then, when the second
>> queue gets the lock, it will find that the shared tag is already set,
>> and assume that it doesn't have to do anything.  But, because its
>
> As I suggested, we can set it always in case that TAG_SHARED
> is set in set->flags because we know the queue isn't ready yet at that
> time.

I see.  You are suggesting that I just get rid of the conditional.  We
could do that, but you will get the exact same result as what I posted.
I'm not sure why you would prefer that over the explicit check.  With
the patch I posted, we can avoid walking the list of hctxs a second
time.

Anyway, here's a patch that I think implements your suggestion.  I
prefer the original, but this should achieve the same exact result.
Let me know if I've misunderstood.

Cheers,
Jeff

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85f0143..7bf717a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1860,27 +1860,26 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 	}
 }
 
-static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
+static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 {
 	struct blk_mq_hw_ctx *hctx;
-	struct request_queue *q;
-	bool shared;
 	int i;
 
-	if (set->tag_list.next == set->tag_list.prev)
-		shared = false;
-	else
-		shared = true;
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (shared)
+			hctx->flags |= BLK_MQ_F_TAG_SHARED;
+		else
+			hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
+	}
+}
+
+static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
+{
+	struct request_queue *q;
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_freeze_queue(q);
-
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (shared)
-				hctx->flags |= BLK_MQ_F_TAG_SHARED;
-			else
-				hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
-		}
+		queue_set_hctx_shared(q, shared);
 		blk_mq_unfreeze_queue(q);
 	}
 }
@@ -1891,7 +1890,13 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 
 	mutex_lock(&set->tag_list_lock);
 	list_del_init(&q->tag_set_list);
-	blk_mq_update_tag_set_depth(set);
+
+	if (set->tag_list.next == set->tag_list.prev) {
+		/* just transitioned to unshared */
+		set->flags &= ~BLK_MQ_F_TAG_SHARED;
+		/* update existing queue */
+		blk_mq_update_tag_set_depth(set, false);
+	}
 	mutex_unlock(&set->tag_list_lock);
 }
 
@@ -1902,7 +1907,21 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 
 	mutex_lock(&set->tag_list_lock);
 	list_add_tail(&q->tag_set_list, &set->tag_list);
-	blk_mq_update_tag_set_depth(set);
+
+	if (set->tag_list.next != set->tag_list.prev) {
+		/*
+		 * Only update the tag set state if the state has
+		 * actually changed.
+		 */
+		if (!(set->flags & BLK_MQ_F_TAG_SHARED)) {
+			/* just transitioned to shared tags */
+			set->flags |= BLK_MQ_F_TAG_SHARED;
+			blk_mq_update_tag_set_depth(set, true);
+		} else {
+			/* ensure we didn't race with another addition */
+			queue_set_hctx_shared(q, true);
+		}
+	}
 	mutex_unlock(&set->tag_list_lock);
 }
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [patch] blk-mq: avoid excessive boot delays with large lun counts
  2015-10-29 15:18       ` Jeff Moyer
@ 2015-10-30  0:37         ` Ming Lei
  2015-10-30 15:55           ` Jeff Moyer
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2015-10-30  0:37 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jens Axboe, Jason Luo, Linux Kernel Mailing List,
	Guru Anbalagane, Feng Jin, Tejun Heo

On Thu, Oct 29, 2015 at 11:18 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Ming Lei <tom.leiming@gmail.com> writes:
>
>> Looks we should have cleared the TAG_SHARED flag during
>> blk_mq_init_hctx() and just let blk_mq_update_tag_set_depth()
>> deal with that, then the race can be avoided.
>
> The whole point of the patch set is to propagate the flag up to the tag
> set so that we can avoid iterating all hctxs in all queues.

Yes, the idea is correct, and the difference is that I don't see there isn't
any race, for which your original patch tryed to address.

>
>>> At this point, neither queue's hctxs have the shared flag set.  Next,
>>> both will race to get the tag_list_lock for the tag_set inside of
>>> blk_mq_add_queue_tag_set.  Only one will win and mark the initial
>>> queue's hctx's as shared (as well as its own).  Then, when the second
>>> queue gets the lock, it will find that the shared tag is already set,
>>> and assume that it doesn't have to do anything.  But, because its
>>
>> As I suggested, we can set it always in case that TAG_SHARED
>> is set in set->flags because we know the queue isn't ready yet at that
>> time.
>
> I see.  You are suggesting that I just get rid of the conditional.  We
> could do that, but you will get the exact same result as what I posted.
> I'm not sure why you would prefer that over the explicit check.  With
> the patch I posted, we can avoid walking the list of hctxs a second
> time.
>
> Anyway, here's a patch that I think implements your suggestion.  I
> prefer the original, but this should achieve the same exact result.
> Let me know if I've misunderstood.

Yes, this one is better, and the big difference is that there isn't so called
race, and the new one is more clean.

>
> Cheers,
> Jeff
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 85f0143..7bf717a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1860,27 +1860,26 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>         }
>  }

The following change should be put into blk_mq_init_hctx() too.

     hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;

and the idea is that we always figure out the shared flag in
following blk_mq_update_tag_set_depth().

>
> -static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
> +static void queue_set_hctx_shared(struct request_queue *q, bool shared)
>  {
>         struct blk_mq_hw_ctx *hctx;
> -       struct request_queue *q;
> -       bool shared;
>         int i;
>
> -       if (set->tag_list.next == set->tag_list.prev)
> -               shared = false;
> -       else
> -               shared = true;
> +       queue_for_each_hw_ctx(q, hctx, i) {
> +               if (shared)
> +                       hctx->flags |= BLK_MQ_F_TAG_SHARED;
> +               else
> +                       hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
> +       }
> +}
> +
> +static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
> +{
> +       struct request_queue *q;
>
>         list_for_each_entry(q, &set->tag_list, tag_set_list) {
>                 blk_mq_freeze_queue(q);
> -
> -               queue_for_each_hw_ctx(q, hctx, i) {
> -                       if (shared)
> -                               hctx->flags |= BLK_MQ_F_TAG_SHARED;
> -                       else
> -                               hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
> -               }
> +               queue_set_hctx_shared(q, shared);
>                 blk_mq_unfreeze_queue(q);
>         }
>  }
> @@ -1891,7 +1890,13 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>
>         mutex_lock(&set->tag_list_lock);
>         list_del_init(&q->tag_set_list);
> -       blk_mq_update_tag_set_depth(set);
> +
> +       if (set->tag_list.next == set->tag_list.prev) {
> +               /* just transitioned to unshared */
> +               set->flags &= ~BLK_MQ_F_TAG_SHARED;
> +               /* update existing queue */
> +               blk_mq_update_tag_set_depth(set, false);
> +       }
>         mutex_unlock(&set->tag_list_lock);
>  }
>
> @@ -1902,7 +1907,21 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>
>         mutex_lock(&set->tag_list_lock);
>         list_add_tail(&q->tag_set_list, &set->tag_list);
> -       blk_mq_update_tag_set_depth(set);
> +
> +       if (set->tag_list.next != set->tag_list.prev) {
> +               /*
> +                * Only update the tag set state if the state has
> +                * actually changed.
> +                */
> +               if (!(set->flags & BLK_MQ_F_TAG_SHARED)) {
> +                       /* just transitioned to shared tags */
> +                       set->flags |= BLK_MQ_F_TAG_SHARED;
> +                       blk_mq_update_tag_set_depth(set, true);
> +               } else {
> +                       /* ensure we didn't race with another addition */

The above comment should be removed, and I don't see there
is any race now.

> +                       queue_set_hctx_shared(q, true);
> +               }
> +       }
>         mutex_unlock(&set->tag_list_lock);
>  }
>

This one is a bit more clean than original one, and the trick of
queue_first_hw_ctx() can be avoided.

-- 
Ming Lei

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] blk-mq: avoid excessive boot delays with large lun counts
  2015-10-30  0:37         ` Ming Lei
@ 2015-10-30 15:55           ` Jeff Moyer
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Moyer @ 2015-10-30 15:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Jason Luo, Linux Kernel Mailing List,
	Guru Anbalagane, Feng Jin, Tejun Heo

Ming Lei <tom.leiming@gmail.com> writes:

> The following change should be put into blk_mq_init_hctx() too.
>
>      hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
>
> and the idea is that we always figure out the shared flag in
> following blk_mq_update_tag_set_depth().

OK, this is the part I was missing from your earlier mails.  That makes
sense, I'll update the patch and repost.

Thanks, Ming!

-Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-10-30 15:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-23 14:57 [patch] blk-mq: avoid excessive boot delays with large lun counts Jeff Moyer
2015-10-28  9:45 ` jason
2015-10-29  1:19 ` Ming Lei
2015-10-29 14:20   ` Jeff Moyer
2015-10-29 14:55     ` Ming Lei
2015-10-29 15:18       ` Jeff Moyer
2015-10-30  0:37         ` Ming Lei
2015-10-30 15:55           ` Jeff Moyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).