linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 0/2] fixes for the updating nr_hw_queues
@ 2018-08-17  3:54 Jianchao Wang
  2018-08-17  3:54 ` [PATCH V2 1/2] blk-mq: init hctx sched after update ctx and hctx mapping Jianchao Wang
  2018-08-17  3:54 ` [PATCH V2 2/2] blk-mq: sync the update nr_hw_queues with part_in_flight Jianchao Wang
  0 siblings, 2 replies; 11+ messages in thread
From: Jianchao Wang @ 2018-08-17  3:54 UTC (permalink / raw)
  To: axboe; +Cc: tom.leiming, bart.vanassche, keith.busch, linux-block, linux-kernel

Hi Jens

Two fixes for updating nr_hw_queues.

The first patch fixes the following scenario:
io scheduler (kyber) depends on the mapping between ctx and hctx.
When update nr_hw_queues, io scheduler's init_hctx will be
invoked before the mapping is adapted correctly, this would cause
panic in kyber.

The second patch fixes the following scenario:
part_in_flight/rw will invoke blk_mq_in_flight/rw to account the
inflight requests. It will access the queue_hw_ctx and nr_hw_queues
w/o any protection. When updating nr_hw_queues and blk_mq_in_flight
/rw occur concurrently, panic comes up.


V2:
 - remove blk_mq_sched_init/exit_hctx in patch 1.
 - use q_usage_counter instead of adding new member suggested by
   Ming in patch 2.
 - comment modification.

Jianchao Wang(2)
blk-mq: init hctx sched after update ctx and hctx mapping
blk-mq: sync the update nr_hw_queues with part_in_flight

 block/blk-mq-sched.c   | 44 ----------------------------------
 block/blk-mq-sched.h   |  5 ----
 block/blk-mq.c         | 65 +++++++++++++++++++++++++++++++++++++++++++-------
 block/blk.h            |  2 ++
 block/elevator.c       | 20 +++++++++-------
 include/linux/blkdev.h |  3 +++
 6 files changed, 74 insertions(+), 65 deletions(-)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH V2 1/2] blk-mq: init hctx sched after update ctx and hctx mapping
  2018-08-17  3:54 [PATCH V2 0/2] fixes for the updating nr_hw_queues Jianchao Wang
@ 2018-08-17  3:54 ` Jianchao Wang
  2018-08-17  9:33   ` Ming Lei
  2018-08-17  3:54 ` [PATCH V2 2/2] blk-mq: sync the update nr_hw_queues with part_in_flight Jianchao Wang
  1 sibling, 1 reply; 11+ messages in thread
From: Jianchao Wang @ 2018-08-17  3:54 UTC (permalink / raw)
  To: axboe; +Cc: tom.leiming, bart.vanassche, keith.busch, linux-block, linux-kernel

Currently, when update nr_hw_queues, io scheduler's init_hctx will
be invoked before the mapping between ctx and hctx is adapted
correctly by blk_mq_map_swqueue. The io scheduler init_hctx (kyber)
may depend on this mapping and get wrong result and panic finally.
A simply way to fix this is switch the io scheduler to none before
update the nr_hw_queues, and then get it back after update nr_hw_queues.
To achieve this, we add a new member elv_type in request_queue to
save the original elevator and adapt and export elevator_switch_mq.
And also blk_mq_sched_init_/exit_hctx are removed due to nobody use
them any more.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 block/blk-mq-sched.c   | 44 --------------------------------------------
 block/blk-mq-sched.h   |  5 -----
 block/blk-mq.c         | 36 ++++++++++++++++++++++++++++--------
 block/blk.h            |  2 ++
 block/elevator.c       | 20 ++++++++++++--------
 include/linux/blkdev.h |  3 +++
 6 files changed, 45 insertions(+), 65 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index cf9c66c..29bfe80 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -462,50 +462,6 @@ static void blk_mq_sched_tags_teardown(struct request_queue *q)
 		blk_mq_sched_free_tags(set, hctx, i);
 }
 
-int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
-			   unsigned int hctx_idx)
-{
-	struct elevator_queue *e = q->elevator;
-	int ret;
-
-	if (!e)
-		return 0;
-
-	ret = blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
-	if (ret)
-		return ret;
-
-	if (e->type->ops.mq.init_hctx) {
-		ret = e->type->ops.mq.init_hctx(hctx, hctx_idx);
-		if (ret) {
-			blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
-			return ret;
-		}
-	}
-
-	blk_mq_debugfs_register_sched_hctx(q, hctx);
-
-	return 0;
-}
-
-void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
-			    unsigned int hctx_idx)
-{
-	struct elevator_queue *e = q->elevator;
-
-	if (!e)
-		return;
-
-	blk_mq_debugfs_unregister_sched_hctx(hctx);
-
-	if (e->type->ops.mq.exit_hctx && hctx->sched_data) {
-		e->type->ops.mq.exit_hctx(hctx, hctx_idx);
-		hctx->sched_data = NULL;
-	}
-
-	blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
-}
-
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct blk_mq_hw_ctx *hctx;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 0cb8f93..4e028ee 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -28,11 +28,6 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 
-int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
-			   unsigned int hctx_idx);
-void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
-			    unsigned int hctx_idx);
-
 static inline bool
 blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
 {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5efd789..de7027f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2147,8 +2147,6 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_request)
 		set->ops->exit_request(set, hctx->fq->flush_rq, hctx_idx);
 
-	blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
-
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
@@ -2216,12 +2214,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
 		goto free_bitmap;
 
-	if (blk_mq_sched_init_hctx(q, hctx, hctx_idx))
-		goto exit_hctx;
-
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
 	if (!hctx->fq)
-		goto sched_exit_hctx;
+		goto exit_hctx;
 
 	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
 		goto free_fq;
@@ -2235,8 +2230,6 @@ static int blk_mq_init_hctx(struct request_queue *q,
 
  free_fq:
 	kfree(hctx->fq);
- sched_exit_hctx:
-	blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
  exit_hctx:
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
@@ -2913,6 +2906,25 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_freeze_queue(q);
 
+	/*
+	 * switch io scheduler to NULL to clean up the data in it.
+	 * will get it back after update mapping between cpu and hw queues.
+	 */
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		if (!q->elevator) {
+			q->elv_type = NULL;
+			continue;
+		}
+		q->elv_type = q->elevator->type;
+		mutex_lock(&q->sysfs_lock);
+		/*
+		 * elevator_release will put it.
+		 */
+		__module_get(q->elv_type->elevator_owner);
+		elevator_switch_mq(q, NULL);
+		mutex_unlock(&q->sysfs_lock);
+	}
+
 	set->nr_hw_queues = nr_hw_queues;
 	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
@@ -2920,6 +2932,14 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_queue_reinit(q);
 	}
 
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		if (!q->elv_type)
+			continue;
+
+		mutex_lock(&q->sysfs_lock);
+		elevator_switch_mq(q, q->elv_type);
+		mutex_unlock(&q->sysfs_lock);
+	}
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_unfreeze_queue(q);
 }
diff --git a/block/blk.h b/block/blk.h
index d4d67e9..0c9bc8d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -234,6 +234,8 @@ static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq
 
 int elevator_init(struct request_queue *);
 int elevator_init_mq(struct request_queue *q);
+int elevator_switch_mq(struct request_queue *q,
+			      struct elevator_type *new_e);
 void elevator_exit(struct request_queue *, struct elevator_queue *);
 int elv_register_queue(struct request_queue *q);
 void elv_unregister_queue(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index fa828b5..5ea6e7d 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -933,16 +933,13 @@ void elv_unregister(struct elevator_type *e)
 }
 EXPORT_SYMBOL_GPL(elv_unregister);
 
-static int elevator_switch_mq(struct request_queue *q,
+int elevator_switch_mq(struct request_queue *q,
 			      struct elevator_type *new_e)
 {
 	int ret;
 
 	lockdep_assert_held(&q->sysfs_lock);
 
-	blk_mq_freeze_queue(q);
-	blk_mq_quiesce_queue(q);
-
 	if (q->elevator) {
 		if (q->elevator->registered)
 			elv_unregister_queue(q);
@@ -968,8 +965,6 @@ static int elevator_switch_mq(struct request_queue *q,
 		blk_add_trace_msg(q, "elv switch: none");
 
 out:
-	blk_mq_unquiesce_queue(q);
-	blk_mq_unfreeze_queue(q);
 	return ret;
 }
 
@@ -1021,8 +1016,17 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 
 	lockdep_assert_held(&q->sysfs_lock);
 
-	if (q->mq_ops)
-		return elevator_switch_mq(q, new_e);
+	if (q->mq_ops) {
+		blk_mq_freeze_queue(q);
+		blk_mq_quiesce_queue(q);
+
+		err = elevator_switch_mq(q, new_e);
+
+		blk_mq_unquiesce_queue(q);
+		blk_mq_unfreeze_queue(q);
+
+		return err;
+	}
 
 	/*
 	 * Turn on BYPASS and drain all requests w/ elevator private data.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d6869e0..ee930c4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -437,6 +437,9 @@ struct request_queue {
 	struct list_head	queue_head;
 	struct request		*last_merge;
 	struct elevator_queue	*elevator;
+
+	/* used when update nr_hw_queues */
+	struct elevator_type	*elv_type;
 	int			nr_rqs[2];	/* # allocated [a]sync rqs */
 	int			nr_rqs_elvpriv;	/* # allocated rqs w/ elvpriv */
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH V2 2/2] blk-mq: sync the update nr_hw_queues with part_in_flight
  2018-08-17  3:54 [PATCH V2 0/2] fixes for the updating nr_hw_queues Jianchao Wang
  2018-08-17  3:54 ` [PATCH V2 1/2] blk-mq: init hctx sched after update ctx and hctx mapping Jianchao Wang
@ 2018-08-17  3:54 ` Jianchao Wang
  2018-08-17  9:36   ` Ming Lei
  1 sibling, 1 reply; 11+ messages in thread
From: Jianchao Wang @ 2018-08-17  3:54 UTC (permalink / raw)
  To: axboe; +Cc: tom.leiming, bart.vanassche, keith.busch, linux-block, linux-kernel

For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.

Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. And also both the
check and blk_mq_queue_tag_busy_iter are under rcu critical section,
then __blk_mq_update_nr_hw_queues could ensure the zeroed q_usage_counter
to be globally visible.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 block/blk-mq.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index de7027f..9ec98bd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -112,7 +112,22 @@ void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
 	struct mq_inflight mi = { .part = part, .inflight = inflight, };
 
 	inflight[0] = inflight[1] = 0;
+
+	/*
+	 * __blk_mq_update_nr_hw_queues will update the nr_hw_queues and
+	 * queue_hw_ctx after freeze the queue. So we could use q_usage_counter
+	 * to avoid race with it. __blk_mq_update_nr_hw_queues will use
+	 * synchronize_rcu to ensure all of the users of blk_mq_in_flight
+	 * go out of the critical section and see zeroed q_usage_counter.
+	 */
+	rcu_read_lock();
+	if (percpu_ref_is_zero(&q->q_usage_counter)) {
+		rcu_read_unlock();
+		return;
+	}
+
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
+	rcu_read_unlock();
 }
 
 static void blk_mq_check_inflight_rw(struct blk_mq_hw_ctx *hctx,
@@ -131,7 +146,18 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part,
 	struct mq_inflight mi = { .part = part, .inflight = inflight, };
 
 	inflight[0] = inflight[1] = 0;
+
+	/*
+	 * See comment of blk_mq_in_flight.
+	 */
+	rcu_read_lock();
+	if (percpu_ref_is_zero(&q->q_usage_counter)) {
+		rcu_read_unlock();
+		return;
+	}
+
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_rw, &mi);
+	rcu_read_unlock();
 }
 
 void blk_freeze_queue_start(struct request_queue *q)
@@ -2905,7 +2931,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_freeze_queue(q);
-
+	/*
+	 * Sync with blk_mq_in_flight and blk_mq_in_flight_rw
+	 */
+	synchronize_rcu();
 	/*
 	 * switch io scheduler to NULL to clean up the data in it.
 	 * will get it back after update mapping between cpu and hw queues.
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/2] blk-mq: init hctx sched after update ctx and hctx mapping
  2018-08-17  3:54 ` [PATCH V2 1/2] blk-mq: init hctx sched after update ctx and hctx mapping Jianchao Wang
@ 2018-08-17  9:33   ` Ming Lei
  2018-08-20  1:29     ` jianchao.wang
  0 siblings, 1 reply; 11+ messages in thread
From: Ming Lei @ 2018-08-17  9:33 UTC (permalink / raw)
  To: Jianchao Wang
  Cc: Jens Axboe, Bart Van Assche, Keith Busch, linux-block,
	Linux Kernel Mailing List

On Fri, Aug 17, 2018 at 11:54 AM, Jianchao Wang
<jianchao.w.wang@oracle.com> wrote:
> Currently, when update nr_hw_queues, io scheduler's init_hctx will
> be invoked before the mapping between ctx and hctx is adapted
> correctly by blk_mq_map_swqueue. The io scheduler init_hctx (kyber)
> may depend on this mapping and get wrong result and panic finally.
> A simply way to fix this is switch the io scheduler to none before
> update the nr_hw_queues, and then get it back after update nr_hw_queues.
> To achieve this, we add a new member elv_type in request_queue to
> save the original elevator and adapt and export elevator_switch_mq.
> And also blk_mq_sched_init_/exit_hctx are removed due to nobody use
> them any more.
>
> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
> ---
>  block/blk-mq-sched.c   | 44 --------------------------------------------
>  block/blk-mq-sched.h   |  5 -----
>  block/blk-mq.c         | 36 ++++++++++++++++++++++++++++--------
>  block/blk.h            |  2 ++
>  block/elevator.c       | 20 ++++++++++++--------
>  include/linux/blkdev.h |  3 +++
>  6 files changed, 45 insertions(+), 65 deletions(-)
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index cf9c66c..29bfe80 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -462,50 +462,6 @@ static void blk_mq_sched_tags_teardown(struct request_queue *q)
>                 blk_mq_sched_free_tags(set, hctx, i);
>  }
>
> -int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
> -                          unsigned int hctx_idx)
> -{
> -       struct elevator_queue *e = q->elevator;
> -       int ret;
> -
> -       if (!e)
> -               return 0;
> -
> -       ret = blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
> -       if (ret)
> -               return ret;
> -
> -       if (e->type->ops.mq.init_hctx) {
> -               ret = e->type->ops.mq.init_hctx(hctx, hctx_idx);
> -               if (ret) {
> -                       blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
> -                       return ret;
> -               }
> -       }
> -
> -       blk_mq_debugfs_register_sched_hctx(q, hctx);
> -
> -       return 0;
> -}
> -
> -void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
> -                           unsigned int hctx_idx)
> -{
> -       struct elevator_queue *e = q->elevator;
> -
> -       if (!e)
> -               return;
> -
> -       blk_mq_debugfs_unregister_sched_hctx(hctx);
> -
> -       if (e->type->ops.mq.exit_hctx && hctx->sched_data) {
> -               e->type->ops.mq.exit_hctx(hctx, hctx_idx);
> -               hctx->sched_data = NULL;
> -       }
> -
> -       blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
> -}
> -
>  int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
>  {
>         struct blk_mq_hw_ctx *hctx;
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 0cb8f93..4e028ee 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -28,11 +28,6 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
>  int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
>  void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
>
> -int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
> -                          unsigned int hctx_idx);
> -void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
> -                           unsigned int hctx_idx);
> -
>  static inline bool
>  blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
>  {
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 5efd789..de7027f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2147,8 +2147,6 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>         if (set->ops->exit_request)
>                 set->ops->exit_request(set, hctx->fq->flush_rq, hctx_idx);
>
> -       blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
> -
>         if (set->ops->exit_hctx)
>                 set->ops->exit_hctx(hctx, hctx_idx);
>
> @@ -2216,12 +2214,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
>             set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
>                 goto free_bitmap;
>
> -       if (blk_mq_sched_init_hctx(q, hctx, hctx_idx))
> -               goto exit_hctx;
> -
>         hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
>         if (!hctx->fq)
> -               goto sched_exit_hctx;
> +               goto exit_hctx;
>
>         if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
>                 goto free_fq;
> @@ -2235,8 +2230,6 @@ static int blk_mq_init_hctx(struct request_queue *q,
>
>   free_fq:
>         kfree(hctx->fq);
> - sched_exit_hctx:
> -       blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
>   exit_hctx:
>         if (set->ops->exit_hctx)
>                 set->ops->exit_hctx(hctx, hctx_idx);
> @@ -2913,6 +2906,25 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>         list_for_each_entry(q, &set->tag_list, tag_set_list)
>                 blk_mq_freeze_queue(q);
>
> +       /*
> +        * switch io scheduler to NULL to clean up the data in it.
> +        * will get it back after update mapping between cpu and hw queues.
> +        */
> +       list_for_each_entry(q, &set->tag_list, tag_set_list) {
> +               if (!q->elevator) {
> +                       q->elv_type = NULL;
> +                       continue;
> +               }
> +               q->elv_type = q->elevator->type;
> +               mutex_lock(&q->sysfs_lock);
> +               /*
> +                * elevator_release will put it.
> +                */
> +               __module_get(q->elv_type->elevator_owner);

I understand what elevator_release() frees is the 'ref-counter' got in
elevator_get(), but who will be the counter-pair of the above  __module_get()?

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 2/2] blk-mq: sync the update nr_hw_queues with part_in_flight
  2018-08-17  3:54 ` [PATCH V2 2/2] blk-mq: sync the update nr_hw_queues with part_in_flight Jianchao Wang
@ 2018-08-17  9:36   ` Ming Lei
  2018-08-20  1:32     ` jianchao.wang
  0 siblings, 1 reply; 11+ messages in thread
From: Ming Lei @ 2018-08-17  9:36 UTC (permalink / raw)
  To: Jianchao Wang
  Cc: Jens Axboe, Bart Van Assche, Keith Busch, linux-block,
	Linux Kernel Mailing List

On Fri, Aug 17, 2018 at 11:54 AM, Jianchao Wang
<jianchao.w.wang@oracle.com> wrote:
> For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
> account the inflight requests. It will access the queue_hw_ctx and
> nr_hw_queues w/o any protection. When updating nr_hw_queues and
> blk_mq_in_flight/rw occur concurrently, panic comes up.
>
> Before update nr_hw_queues, the q will be frozen. So we could use
> q_usage_counter to avoid the race. percpu_ref_is_zero is used here
> so that we will not miss any in-flight request. And also both the
> check and blk_mq_queue_tag_busy_iter are under rcu critical section,
> then __blk_mq_update_nr_hw_queues could ensure the zeroed q_usage_counter
> to be globally visible.
>
> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
> ---
>  block/blk-mq.c | 31 ++++++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index de7027f..9ec98bd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -112,7 +112,22 @@ void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
>         struct mq_inflight mi = { .part = part, .inflight = inflight, };
>
>         inflight[0] = inflight[1] = 0;
> +
> +       /*
> +        * __blk_mq_update_nr_hw_queues will update the nr_hw_queues and
> +        * queue_hw_ctx after freeze the queue. So we could use q_usage_counter
> +        * to avoid race with it. __blk_mq_update_nr_hw_queues will use
> +        * synchronize_rcu to ensure all of the users of blk_mq_in_flight
> +        * go out of the critical section and see zeroed q_usage_counter.
> +        */
> +       rcu_read_lock();
> +       if (percpu_ref_is_zero(&q->q_usage_counter)) {
> +               rcu_read_unlock();
> +               return;
> +       }
> +
>         blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
> +       rcu_read_unlock();
>  }
>
>  static void blk_mq_check_inflight_rw(struct blk_mq_hw_ctx *hctx,
> @@ -131,7 +146,18 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part,
>         struct mq_inflight mi = { .part = part, .inflight = inflight, };
>
>         inflight[0] = inflight[1] = 0;
> +
> +       /*
> +        * See comment of blk_mq_in_flight.
> +        */
> +       rcu_read_lock();
> +       if (percpu_ref_is_zero(&q->q_usage_counter)) {
> +               rcu_read_unlock();
> +               return;
> +       }
> +
>         blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_rw, &mi);
> +       rcu_read_unlock();

I'd suggest to put the rcu_* and percpu_ref_is_zero() into
blk_mq_queue_tag_busy_iter().


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/2] blk-mq: init hctx sched after update ctx and hctx mapping
  2018-08-17  9:33   ` Ming Lei
@ 2018-08-20  1:29     ` jianchao.wang
  2018-08-20  2:24       ` Ming Lei
  0 siblings, 1 reply; 11+ messages in thread
From: jianchao.wang @ 2018-08-20  1:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Bart Van Assche, Keith Busch, linux-block,
	Linux Kernel Mailing List

Hi Ming

On 08/17/2018 05:33 PM, Ming Lei wrote:
>> +               /*
>> +                * elevator_release will put it.
>> +                */
>> +               __module_get(q->elv_type->elevator_owner);
> I understand what elevator_release() frees is the 'ref-counter' got in
> elevator_get(), but who will be the counter-pair of the above  __module_get()?


Sorry for my bad description.

The code path is:

elevator_release
  -> elevator_put(e->type)
    -> module_put(e->elevator_owner)

In normal elevator switch path, elevator_get will hold a reference counter of the
elevator_owner.
In this patch set, the elevator_type is saved directly. To prevent the io scheduler module
is removed, we need to hold a reference of the module.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 2/2] blk-mq: sync the update nr_hw_queues with part_in_flight
  2018-08-17  9:36   ` Ming Lei
@ 2018-08-20  1:32     ` jianchao.wang
  0 siblings, 0 replies; 11+ messages in thread
From: jianchao.wang @ 2018-08-20  1:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Bart Van Assche, Keith Busch, linux-block,
	Linux Kernel Mailing List

Hi Ming

On 08/17/2018 05:36 PM, Ming Lei wrote:
> I'd suggest to put the rcu_* and percpu_ref_is_zero() into
> blk_mq_queue_tag_busy_iter().

Yes, it's fine for me. :)
I will change it in next version.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/2] blk-mq: init hctx sched after update ctx and hctx mapping
  2018-08-20  1:29     ` jianchao.wang
@ 2018-08-20  2:24       ` Ming Lei
  2018-08-20  3:57         ` jianchao.wang
  0 siblings, 1 reply; 11+ messages in thread
From: Ming Lei @ 2018-08-20  2:24 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Jens Axboe, Bart Van Assche, Keith Busch, linux-block,
	Linux Kernel Mailing List, Ming Lei

On Mon, Aug 20, 2018 at 9:29 AM jianchao.wang
<jianchao.w.wang@oracle.com> wrote:
>
> Hi Ming
>
> On 08/17/2018 05:33 PM, Ming Lei wrote:
> >> +               /*
> >> +                * elevator_release will put it.
> >> +                */
> >> +               __module_get(q->elv_type->elevator_owner);
> > I understand what elevator_release() frees is the 'ref-counter' got in
> > elevator_get(), but who will be the counter-pair of the above  __module_get()?
>
>
> Sorry for my bad description.
>
> The code path is:
>
> elevator_release
>   -> elevator_put(e->type)
>     -> module_put(e->elevator_owner)
>
> In normal elevator switch path, elevator_get will hold a reference counter of the
> elevator_owner.
> In this patch set, the elevator_type is saved directly. To prevent the io scheduler module
> is removed, we need to hold a reference of the module.

Yeah, I agree that the module reference need to be held, but it need to be
released too.

My concern is that this introduced getting module reference in your patch
isn't released. The module reference is a counter too, so the get and
put operation should be matched.


thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/2] blk-mq: init hctx sched after update ctx and hctx mapping
  2018-08-20  2:24       ` Ming Lei
@ 2018-08-20  3:57         ` jianchao.wang
  2018-08-20  4:18           ` Ming Lei
  0 siblings, 1 reply; 11+ messages in thread
From: jianchao.wang @ 2018-08-20  3:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Bart Van Assche, Keith Busch, linux-block,
	Linux Kernel Mailing List, Ming Lei

Hi Ming

On 08/20/2018 10:24 AM, Ming Lei wrote:
>> The code path is:
>>
>> elevator_release
>>   -> elevator_put(e->type)
>>     -> module_put(e->elevator_owner)
>>
>> In normal elevator switch path, elevator_get will hold a reference counter of the
>> elevator_owner.
>> In this patch set, the elevator_type is saved directly. To prevent the io scheduler module
>> is removed, we need to hold a reference of the module.
> Yeah, I agree that the module reference need to be held, but it need to be
> released too.
> 
> My concern is that this introduced getting module reference in your patch
> isn't released. The module reference is a counter too, so the get and
> put operation should be matched.

elevator_switch_mq
  -> elevator_exit
it will put a reference count of the elevator_queue associated with the request_queue.
and the elevator_release will be invoked when the reference count of elevator_queue->kobj reaches zero.
elevator_release will put the reference count of the io scheduler module.

the elevator_queue structure will be allocated and freed every time when we switch io scheduler.
so the elevator_release will always be invoked.

This is the put ref corresponding to the get one in this patch.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/2] blk-mq: init hctx sched after update ctx and hctx mapping
  2018-08-20  3:57         ` jianchao.wang
@ 2018-08-20  4:18           ` Ming Lei
  2018-08-20  5:21             ` jianchao.wang
  0 siblings, 1 reply; 11+ messages in thread
From: Ming Lei @ 2018-08-20  4:18 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, Bart Van Assche, Keith Busch, linux-block,
	Linux Kernel Mailing List

On Mon, Aug 20, 2018 at 11:57:52AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 08/20/2018 10:24 AM, Ming Lei wrote:
> >> The code path is:
> >>
> >> elevator_release
> >>   -> elevator_put(e->type)
> >>     -> module_put(e->elevator_owner)
> >>
> >> In normal elevator switch path, elevator_get will hold a reference counter of the
> >> elevator_owner.
> >> In this patch set, the elevator_type is saved directly. To prevent the io scheduler module
> >> is removed, we need to hold a reference of the module.
> > Yeah, I agree that the module reference need to be held, but it need to be
> > released too.
> > 
> > My concern is that this introduced getting module reference in your patch
> > isn't released. The module reference is a counter too, so the get and
> > put operation should be matched.
> 
> elevator_switch_mq
>   -> elevator_exit
> it will put a reference count of the elevator_queue associated with the request_queue.
> and the elevator_release will be invoked when the reference count of elevator_queue->kobj reaches zero.
> elevator_release will put the reference count of the io scheduler module.
> 
> the elevator_queue structure will be allocated and freed every time when we switch io scheduler.
> so the elevator_release will always be invoked.
> 
> This is the put ref corresponding to the get one in this patch.

OK, got it, the added __module_get() is just like what elevator_get()
does from __elevator_change(). Sorry for the noise.

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/2] blk-mq: init hctx sched after update ctx and hctx mapping
  2018-08-20  4:18           ` Ming Lei
@ 2018-08-20  5:21             ` jianchao.wang
  0 siblings, 0 replies; 11+ messages in thread
From: jianchao.wang @ 2018-08-20  5:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, Bart Van Assche, Keith Busch, linux-block,
	Linux Kernel Mailing List



On 08/20/2018 12:18 PM, Ming Lei wrote:
> On Mon, Aug 20, 2018 at 11:57:52AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 08/20/2018 10:24 AM, Ming Lei wrote:
>>>> The code path is:
>>>>
>>>> elevator_release
>>>>   -> elevator_put(e->type)
>>>>     -> module_put(e->elevator_owner)
>>>>
>>>> In normal elevator switch path, elevator_get will hold a reference counter of the
>>>> elevator_owner.
>>>> In this patch set, the elevator_type is saved directly. To prevent the io scheduler module
>>>> is removed, we need to hold a reference of the module.
>>> Yeah, I agree that the module reference need to be held, but it need to be
>>> released too.
>>>
>>> My concern is that this introduced getting module reference in your patch
>>> isn't released. The module reference is a counter too, so the get and
>>> put operation should be matched.
>>
>> elevator_switch_mq
>>   -> elevator_exit
>> it will put a reference count of the elevator_queue associated with the request_queue.
>> and the elevator_release will be invoked when the reference count of elevator_queue->kobj reaches zero.
>> elevator_release will put the reference count of the io scheduler module.
>>
>> the elevator_queue structure will be allocated and freed every time when we switch io scheduler.
>> so the elevator_release will always be invoked.
>>
>> This is the put ref corresponding to the get one in this patch.
> 
> OK, got it, the added __module_get() is just like what elevator_get()
> does from __elevator_change(). Sorry for the noise.
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> 

Thanks very much for your reviewing.

I will add more comment to describe the added __module_get() in the V3.

Thanks
Jianchao
 

> Thanks,
> Ming
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-08-20  5:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-17  3:54 [PATCH V2 0/2] fixes for the updating nr_hw_queues Jianchao Wang
2018-08-17  3:54 ` [PATCH V2 1/2] blk-mq: init hctx sched after update ctx and hctx mapping Jianchao Wang
2018-08-17  9:33   ` Ming Lei
2018-08-20  1:29     ` jianchao.wang
2018-08-20  2:24       ` Ming Lei
2018-08-20  3:57         ` jianchao.wang
2018-08-20  4:18           ` Ming Lei
2018-08-20  5:21             ` jianchao.wang
2018-08-17  3:54 ` [PATCH V2 2/2] blk-mq: sync the update nr_hw_queues with part_in_flight Jianchao Wang
2018-08-17  9:36   ` Ming Lei
2018-08-20  1:32     ` jianchao.wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).