[PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling
@ 2015-09-26 17:09 Akinobu Mita
  2015-09-26 17:09 ` [PATCH v4 1/7] blk-mq: avoid setting hctx->tags->cpumask before allocation Akinobu Mita
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Akinobu Mita @ 2015-09-26 17:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Akinobu Mita, Jens Axboe, Ming Lei, Tejun Heo, Keith Busch,
	Christoph Hellwig, Wanpeng Li

This patchset addresses several race conditions on cpu hotplug handling
for blk-mq.  All problems can be reproducible by the following script.

while true; do
	echo 0 > /sys/devices/system/cpu/cpu1/online
	echo 1 > /sys/devices/system/cpu/cpu1/online
done &

while true; do
	modprobe -r null_blk
	modprobe null_blk queue_mode=2 irqmode=1
	sleep 0.1
done

* Changes from v3
- Rebased to the latest kernel
- Add Reviewed-by tags

* Changes from v2
- Add regression fix for hctx->tags->cpumask
- Remove BLK_MQ_F_SYSFS_UP per-hctx flag and use mq_sysfs_init_done
  per-queue flag instead with appropriate locking in order to keep
  track of 'mq' sysfs entry's life
- Add comments on non-trivial stuffs, suggested by Ming

* Changes from v1
- Release q->mq_map in blk_mq_release()
- Fix deadlock when reading cpu_list
- Fix race freeze and unfreeze

Akinobu Mita (7):
  blk-mq: avoid setting hctx->tags->cpumask before allocation
  blk-mq: fix sysfs registration/unregistration race
  blk-mq: Fix use after of free q->mq_map
  blk-mq: fix q->mq_usage_counter access race
  blk-mq: avoid inserting requests before establishing new mapping
  blk-mq: fix freeze queue race
  blk-mq: fix deadlock when reading cpu_list

 block/blk-core.c       |   1 +
 block/blk-mq-cpumap.c  |   9 +++--
 block/blk-mq-sysfs.c   |  36 ++++++++++++------
 block/blk-mq.c         | 100 +++++++++++++++++++++++++++++++++++++------------
 block/blk-mq.h         |   3 +-
 include/linux/blk-mq.h |   1 -
 include/linux/blkdev.h |   8 ++++
 7 files changed, 116 insertions(+), 42 deletions(-)

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
-- 
1.9.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 1/7] blk-mq: avoid setting hctx->tags->cpumask before allocation
  2015-09-26 17:09 [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
@ 2015-09-26 17:09 ` Akinobu Mita
  2015-09-26 18:12   ` Christoph Hellwig
  2015-09-26 17:09 ` [PATCH v4 2/7] blk-mq: fix sysfs registration/unregistration race Akinobu Mita
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Akinobu Mita @ 2015-09-26 17:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Akinobu Mita, Keith Busch, Jens Axboe, Ming Lei, Christoph Hellwig

When unmapped hw queue is remapped after CPU topology is changed,
hctx->tags->cpumask has to be set after hctx->tags is setup in
blk_mq_map_swqueue(), otherwise it causes null pointer dereference.

Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2d67b4..2fd7283 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1811,7 +1811,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 
 		hctx = q->mq_ops->map_queue(q, i);
 		cpumask_set_cpu(i, hctx->cpumask);
-		cpumask_set_cpu(i, hctx->tags->cpumask);
 		ctx->index_hw = hctx->nr_ctx;
 		hctx->ctxs[hctx->nr_ctx++] = ctx;
 	}
@@ -1851,6 +1850,14 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 		hctx->next_cpu = cpumask_first(hctx->cpumask);
 		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
 	}
+
+	queue_for_each_ctx(q, ctx, i) {
+		if (!cpu_online(i))
+			continue;
+
+		hctx = q->mq_ops->map_queue(q, i);
+		cpumask_set_cpu(i, hctx->tags->cpumask);
+	}
 }
 
 static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 2/7] blk-mq: fix sysfs registration/unregistration race
  2015-09-26 17:09 [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
  2015-09-26 17:09 ` [PATCH v4 1/7] blk-mq: avoid setting hctx->tags->cpumask before allocation Akinobu Mita
@ 2015-09-26 17:09 ` Akinobu Mita
  2015-09-29  6:51   ` Christoph Hellwig
  2015-09-26 17:09 ` [PATCH v4 3/7] blk-mq: Fix use after of free q->mq_map Akinobu Mita
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Akinobu Mita @ 2015-09-26 17:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, Jens Axboe, Ming Lei, Christoph Hellwig

There is a race between cpu hotplug handling and adding/deleting
gendisk for blk-mq, where both are trying to register and unregister
the same sysfs entries.

null_add_dev
    --> blk_mq_init_queue
        --> blk_mq_init_allocated_queue
            --> add to 'all_q_list' (*)
    --> add_disk
        --> blk_register_queue
            --> blk_mq_register_disk (++)

null_del_dev
    --> del_gendisk
        --> blk_unregister_queue
            --> blk_mq_unregister_disk (--)
    --> blk_cleanup_queue
        --> blk_mq_free_queue
            --> del from 'all_q_list' (*)

blk_mq_queue_reinit
    --> blk_mq_sysfs_unregister (-)
    --> blk_mq_sysfs_register (+)

While the request queue is added to 'all_q_list' (*),
blk_mq_queue_reinit() can be called for the queue anytime by CPU
hotplug callback.  But blk_mq_sysfs_unregister (-) and
blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
is finished.  Because '/sys/block/*/mq/' is not exists.

There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
be used to track these sysfs stuff, but it is only fixing this issue
partially.

In order to fix it completely, we just need per-queue flag instead of
per-hctx flag with appropriate locking.  So this introduces
q->mq_sysfs_init_done which is properly protected with all_q_mutex.

Also, we need to ensure that blk_mq_map_swqueue() is called with
all_q_mutex is held.  Since hctx->nr_ctx is reset temporarily and
updated in blk_mq_map_swqueue(), so we should avoid
blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
in CPU hotplug handling or adding/deleting gendisk .

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-sysfs.c   | 30 ++++++++++++++++++++++--------
 block/blk-mq.c         |  6 +++---
 include/linux/blk-mq.h |  1 -
 include/linux/blkdev.h |  2 ++
 4 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 279c5d6..189f5ae 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -343,7 +343,7 @@ static void blk_mq_unregister_hctx(struct blk_mq_hw_ctx *hctx)
 	struct blk_mq_ctx *ctx;
 	int i;
 
-	if (!hctx->nr_ctx || !(hctx->flags & BLK_MQ_F_SYSFS_UP))
+	if (!hctx->nr_ctx)
 		return;
 
 	hctx_for_each_ctx(hctx, ctx, i)
@@ -358,7 +358,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
 	struct blk_mq_ctx *ctx;
 	int i, ret;
 
-	if (!hctx->nr_ctx || !(hctx->flags & BLK_MQ_F_SYSFS_UP))
+	if (!hctx->nr_ctx)
 		return 0;
 
 	ret = kobject_add(&hctx->kobj, &q->mq_kobj, "%u", hctx->queue_num);
@@ -381,6 +381,8 @@ void blk_mq_unregister_disk(struct gendisk *disk)
 	struct blk_mq_ctx *ctx;
 	int i, j;
 
+	blk_mq_disable_hotplug();
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		blk_mq_unregister_hctx(hctx);
 
@@ -395,6 +397,9 @@ void blk_mq_unregister_disk(struct gendisk *disk)
 	kobject_put(&q->mq_kobj);
 
 	kobject_put(&disk_to_dev(disk)->kobj);
+
+	q->mq_sysfs_init_done = false;
+	blk_mq_enable_hotplug();
 }
 
 static void blk_mq_sysfs_init(struct request_queue *q)
@@ -425,27 +430,30 @@ int blk_mq_register_disk(struct gendisk *disk)
 	struct blk_mq_hw_ctx *hctx;
 	int ret, i;
 
+	blk_mq_disable_hotplug();
+
 	blk_mq_sysfs_init(q);
 
 	ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
 	if (ret < 0)
-		return ret;
+		goto out;
 
 	kobject_uevent(&q->mq_kobj, KOBJ_ADD);
 
 	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->flags |= BLK_MQ_F_SYSFS_UP;
 		ret = blk_mq_register_hctx(hctx);
 		if (ret)
 			break;
 	}
 
-	if (ret) {
+	if (ret)
 		blk_mq_unregister_disk(disk);
-		return ret;
-	}
+	else
+		q->mq_sysfs_init_done = true;
+out:
+	blk_mq_enable_hotplug();
 
-	return 0;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(blk_mq_register_disk);
 
@@ -454,6 +462,9 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
+	if (!q->mq_sysfs_init_done)
+		return;
+
 	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_unregister_hctx(hctx);
 }
@@ -463,6 +474,9 @@ int blk_mq_sysfs_register(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i, ret = 0;
 
+	if (!q->mq_sysfs_init_done)
+		return ret;
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		ret = blk_mq_register_hctx(hctx);
 		if (ret)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2fd7283..0262131 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2035,13 +2035,13 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 		goto err_hctxs;
 
 	mutex_lock(&all_q_mutex);
-	list_add_tail(&q->all_q_node, &all_q_list);
-	mutex_unlock(&all_q_mutex);
 
+	list_add_tail(&q->all_q_node, &all_q_list);
 	blk_mq_add_queue_tag_set(set, q);
-
 	blk_mq_map_swqueue(q);
 
+	mutex_unlock(&all_q_mutex);
+
 	return q;
 
 err_hctxs:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 37d1602..b80ba45 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -145,7 +145,6 @@ enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
-	BLK_MQ_F_SYSFS_UP	= 1 << 3,
 	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 99da9eb..19c2e94 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -456,6 +456,8 @@ struct request_queue {
 	struct blk_mq_tag_set	*tag_set;
 	struct list_head	tag_set_list;
 	struct bio_set		*bio_split;
+
+	bool			mq_sysfs_init_done;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 3/7] blk-mq: Fix use after of free q->mq_map
  2015-09-26 17:09 [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
  2015-09-26 17:09 ` [PATCH v4 1/7] blk-mq: avoid setting hctx->tags->cpumask before allocation Akinobu Mita
  2015-09-26 17:09 ` [PATCH v4 2/7] blk-mq: fix sysfs registration/unregistration race Akinobu Mita
@ 2015-09-26 17:09 ` Akinobu Mita
  2015-09-29  6:52   ` Christoph Hellwig
  2015-09-26 17:09 ` [PATCH v4 4/7] blk-mq: fix q->mq_usage_counter access race Akinobu Mita
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Akinobu Mita @ 2015-09-26 17:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, Jens Axboe, Ming Lei, Christoph Hellwig

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates
q->mq_map by blk_mq_update_queue_map() for all request queues in
all_q_list.  On the other hand, q->mq_map is released before deleting
the queue from all_q_list.

So if CPU hotplug event occurs in the window, invalid memory access
can happen.  Fix it by releasing q->mq_map in blk_mq_release() to make
it happen latter than removal from all_q_list.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0262131..92648d8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1925,6 +1925,9 @@ void blk_mq_release(struct request_queue *q)
 		kfree(hctx);
 	}
 
+	kfree(q->mq_map);
+	q->mq_map = NULL;
+
 	kfree(q->queue_hw_ctx);
 
 	/* ctx kobj stays in queue_ctx */
@@ -2070,11 +2073,6 @@ void blk_mq_free_queue(struct request_queue *q)
 	blk_mq_free_hw_queues(q, set);
 
 	percpu_ref_exit(&q->mq_usage_counter);
-
-	kfree(q->mq_map);
-
-	q->mq_map = NULL;
-
 	mutex_lock(&all_q_mutex);
 	list_del_init(&q->all_q_node);
 	mutex_unlock(&all_q_mutex);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 4/7] blk-mq: fix q->mq_usage_counter access race
  2015-09-26 17:09 [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
                   ` (2 preceding siblings ...)
  2015-09-26 17:09 ` [PATCH v4 3/7] blk-mq: Fix use after of free q->mq_map Akinobu Mita
@ 2015-09-26 17:09 ` Akinobu Mita
  2015-09-29  6:53   ` Christoph Hellwig
  2015-09-26 17:09 ` [PATCH v4 5/7] blk-mq: avoid inserting requests before establishing new mapping Akinobu Mita
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Akinobu Mita @ 2015-09-26 17:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, Jens Axboe, Ming Lei, Christoph Hellwig

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) accesses
q->mq_usage_counter while freezing all request queues in all_q_list.
On the other hand, q->mq_usage_counter is deinitialized in
blk_mq_free_queue() before deleting the queue from all_q_list.

So if CPU hotplug event occurs in the window, percpu_ref_kill() is
called with q->mq_usage_counter which has already been marked dead,
and it triggers warning.  Fix it by deleting the queue from all_q_list
earlier than destroying q->mq_usage_counter.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 92648d8..3a39184 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2067,15 +2067,16 @@ void blk_mq_free_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
+	mutex_lock(&all_q_mutex);
+	list_del_init(&q->all_q_node);
+	mutex_unlock(&all_q_mutex);
+
 	blk_mq_del_queue_tag_set(q);
 
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
 	blk_mq_free_hw_queues(q, set);
 
 	percpu_ref_exit(&q->mq_usage_counter);
-	mutex_lock(&all_q_mutex);
-	list_del_init(&q->all_q_node);
-	mutex_unlock(&all_q_mutex);
 }
 
 /* Basically redo blk_mq_init_queue with queue frozen */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 5/7] blk-mq: avoid inserting requests before establishing new mapping
  2015-09-26 17:09 [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
                   ` (3 preceding siblings ...)
  2015-09-26 17:09 ` [PATCH v4 4/7] blk-mq: fix q->mq_usage_counter access race Akinobu Mita
@ 2015-09-26 17:09 ` Akinobu Mita
  2015-09-29  6:54   ` Christoph Hellwig
  2015-09-26 17:09 ` [PATCH v4 6/7] blk-mq: fix freeze queue race Akinobu Mita
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Akinobu Mita @ 2015-09-26 17:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, Jens Axboe, Ming Lei, Christoph Hellwig

Notifier callbacks for CPU_ONLINE action can be run on the other CPU
than the CPU which was just onlined.  So it is possible for the
process running on the just onlined CPU to insert request and run
hw queue before establishing new mapping which is done by
blk_mq_queue_reinit_notify().

This can cause a problem when the CPU has just been onlined first time
since the request queue was initialized.  At this time ctx->index_hw
for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
zero before blk_mq_queue_reinit_notify() is called by notifier
callbacks for CPU_ONLINE action.

For example, there is a single hw queue (hctx) and two CPU queues
(ctx0 for CPU0, and ctx1 for CPU1).  Now CPU1 is just onlined and
a request is inserted into ctx1->rq_list and set bit0 in pending
bitmap as ctx1->index_hw is still zero.

And then while running hw queue, flush_busy_ctxs() finds bit0 is set
in pending bitmap and tries to retrieve requests in
hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0, so the
request in ctx1->rq_list is ignored.

Fix it by ensuring that new mapping is established before onlined cpu
starts running.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-cpumap.c |  9 ++++----
 block/blk-mq.c        | 59 +++++++++++++++++++++++++++++++++++++++------------
 block/blk-mq.h        |  3 ++-
 3 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 1e28ddb..8764c24 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -31,7 +31,8 @@ static int get_first_sibling(unsigned int cpu)
 	return cpu;
 }
 
-int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
+int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
+			    const struct cpumask *online_mask)
 {
 	unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
 	cpumask_var_t cpus;
@@ -41,7 +42,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
 
 	cpumask_clear(cpus);
 	nr_cpus = nr_uniq_cpus = 0;
-	for_each_online_cpu(i) {
+	for_each_cpu(i, online_mask) {
 		nr_cpus++;
 		first_sibling = get_first_sibling(i);
 		if (!cpumask_test_cpu(first_sibling, cpus))
@@ -51,7 +52,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
 
 	queue = 0;
 	for_each_possible_cpu(i) {
-		if (!cpu_online(i)) {
+		if (!cpumask_test_cpu(i, online_mask)) {
 			map[i] = 0;
 			continue;
 		}
@@ -95,7 +96,7 @@ unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
 	if (!map)
 		return NULL;
 
-	if (!blk_mq_update_queue_map(map, set->nr_hw_queues))
+	if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
 		return map;
 
 	kfree(map);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3a39184..a5dbd06 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1789,7 +1789,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 	}
 }
 
-static void blk_mq_map_swqueue(struct request_queue *q)
+static void blk_mq_map_swqueue(struct request_queue *q,
+			       const struct cpumask *online_mask)
 {
 	unsigned int i;
 	struct blk_mq_hw_ctx *hctx;
@@ -1806,7 +1807,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 	 */
 	queue_for_each_ctx(q, ctx, i) {
 		/* If the cpu isn't online, the cpu is mapped to first hctx */
-		if (!cpu_online(i))
+		if (!cpumask_test_cpu(i, online_mask))
 			continue;
 
 		hctx = q->mq_ops->map_queue(q, i);
@@ -1852,7 +1853,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 	}
 
 	queue_for_each_ctx(q, ctx, i) {
-		if (!cpu_online(i))
+		if (!cpumask_test_cpu(i, online_mask))
 			continue;
 
 		hctx = q->mq_ops->map_queue(q, i);
@@ -2037,13 +2038,15 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (blk_mq_init_hw_queues(q, set))
 		goto err_hctxs;
 
+	get_online_cpus();
 	mutex_lock(&all_q_mutex);
 
 	list_add_tail(&q->all_q_node, &all_q_list);
 	blk_mq_add_queue_tag_set(set, q);
-	blk_mq_map_swqueue(q);
+	blk_mq_map_swqueue(q, cpu_online_mask);
 
 	mutex_unlock(&all_q_mutex);
+	put_online_cpus();
 
 	return q;
 
@@ -2080,13 +2083,14 @@ void blk_mq_free_queue(struct request_queue *q)
 }
 
 /* Basically redo blk_mq_init_queue with queue frozen */
-static void blk_mq_queue_reinit(struct request_queue *q)
+static void blk_mq_queue_reinit(struct request_queue *q,
+				const struct cpumask *online_mask)
 {
 	WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
 
 	blk_mq_sysfs_unregister(q);
 
-	blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues);
+	blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
 
 	/*
 	 * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
@@ -2094,7 +2098,7 @@ static void blk_mq_queue_reinit(struct request_queue *q)
 	 * involves free and re-allocate memory, worthy doing?)
 	 */
 
-	blk_mq_map_swqueue(q);
+	blk_mq_map_swqueue(q, online_mask);
 
 	blk_mq_sysfs_register(q);
 }
@@ -2103,16 +2107,43 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
 				      unsigned long action, void *hcpu)
 {
 	struct request_queue *q;
+	int cpu = (unsigned long)hcpu;
+	/*
+	 * New online cpumask which is going to be set in this hotplug event.
+	 * Declare this cpumasks as global as cpu-hotplug operation is invoked
+	 * one-by-one and dynamically allocating this could result in a failure.
+	 */
+	static struct cpumask online_new;
 
 	/*
-	 * Before new mappings are established, hotadded cpu might already
-	 * start handling requests. This doesn't break anything as we map
-	 * offline CPUs to first hardware queue. We will re-init the queue
-	 * below to get optimal settings.
+	 * Before hotadded cpu starts handling requests, new mappings must
+	 * be established.  Otherwise, these requests in hw queue might
+	 * never be dispatched.
+	 *
+	 * For example, there is a single hw queue (hctx) and two CPU queues
+	 * (ctx0 for CPU0, and ctx1 for CPU1).
+	 *
+	 * Now CPU1 is just onlined and a request is inserted into
+	 * ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is
+	 * still zero.
+	 *
+	 * And then while running hw queue, flush_busy_ctxs() finds bit0 is
+	 * set in pending bitmap and tries to retrieve requests in
+	 * hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0,
+	 * so the request in ctx1->rq_list is ignored.
 	 */
-	if (action != CPU_DEAD && action != CPU_DEAD_FROZEN &&
-	    action != CPU_ONLINE && action != CPU_ONLINE_FROZEN)
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DEAD:
+	case CPU_UP_CANCELED:
+		cpumask_copy(&online_new, cpu_online_mask);
+		break;
+	case CPU_UP_PREPARE:
+		cpumask_copy(&online_new, cpu_online_mask);
+		cpumask_set_cpu(cpu, &online_new);
+		break;
+	default:
 		return NOTIFY_OK;
+	}
 
 	mutex_lock(&all_q_mutex);
 
@@ -2136,7 +2167,7 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
 	}
 
 	list_for_each_entry(q, &all_q_list, all_q_node)
-		blk_mq_queue_reinit(q);
+		blk_mq_queue_reinit(q, &online_new);
 
 	list_for_each_entry(q, &all_q_list, all_q_node)
 		blk_mq_unfreeze_queue(q);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6a48c4c..f4fea79 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -51,7 +51,8 @@ void blk_mq_disable_hotplug(void);
  * CPU -> queue mappings
  */
 extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
-extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues);
+extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
+				   const struct cpumask *online_mask);
 extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
 
 /*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 6/7] blk-mq: fix freeze queue race
  2015-09-26 17:09 [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
                   ` (4 preceding siblings ...)
  2015-09-26 17:09 ` [PATCH v4 5/7] blk-mq: avoid inserting requests before establishing new mapping Akinobu Mita
@ 2015-09-26 17:09 ` Akinobu Mita
  2015-09-26 17:32   ` Tejun Heo
  2015-09-26 17:09 ` [PATCH v4 7/7] blk-mq: fix deadlock when reading cpu_list Akinobu Mita
  2015-09-29 17:31 ` [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling Jens Axboe
  7 siblings, 1 reply; 25+ messages in thread
From: Akinobu Mita @ 2015-09-26 17:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Akinobu Mita, Jens Axboe, Ming Lei, Tejun Heo, Christoph Hellwig

There are several race conditions while freezing queue.

When unfreezing queue, there is a small window between decrementing
q->mq_freeze_depth to zero and percpu_ref_reinit() call with
q->mq_usage_counter.  If the other calls blk_mq_freeze_queue_start()
in the window, q->mq_freeze_depth is increased from zero to one and
percpu_ref_kill() is called with q->mq_usage_counter which is already
killed.  percpu refcount should be re-initialized before killed again.

Also, there is a race condition while switching to percpu mode.
percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
executed at the same time as the following scenario is possible:

1. q->mq_usage_counter is initialized in atomic mode.
   (atomic counter: 1)

2. After the disk registration, a process like systemd-udev starts
   accessing the disk, and successfully increases refcount successfully
   by percpu_ref_tryget_live() in blk_mq_queue_enter().
   (atomic counter: 2)

3. In the final stage of initialization, q->mq_usage_counter is being
   switched to percpu mode by percpu_ref_switch_to_percpu() in
   blk_mq_finish_init().  But if CONFIG_PREEMPT_VOLUNTARY is enabled,
   the process is rescheduled in the middle of switching when calling
   wait_event() in __percpu_ref_switch_to_percpu().
   (atomic counter: 2)

4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze
   request queue.  q->mq_usage_counter is decreased and marked as
   DEAD.  Wait until all requests have finished.
   (atomic counter: 1)

5. The process rescheduled in the step 3. is resumed and finishes
   all remaining work in __percpu_ref_switch_to_percpu().
   A bias value is added to atomic counter of q->mq_usage_counter.
   (atomic counter: PERCPU_COUNT_BIAS + 1)

6. A request issed in the step 2. is finished and q->mq_usage_counter
   is decreased by blk_mq_queue_exit().  q->mq_usage_counter is DEAD,
   so atomic counter is decreased and no release handler is called.
   (atomic counter: PERCPU_COUNT_BIAS)

7. CPU hotplug handling in the step 4. will wait forever as
   q->mq_usage_counter will never be zero.

Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed
at the same time.  Because both functions could call
__percpu_ref_switch_to_percpu() which adds the bias value and
initialize percpu counter.

Fix those races by serializing with per-queue mutex.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c       | 1 +
 block/blk-mq-sysfs.c   | 2 ++
 block/blk-mq.c         | 8 ++++++++
 include/linux/blkdev.h | 6 ++++++
 4 files changed, 17 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2eb722d..1e8a41f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -689,6 +689,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
 
 	init_waitqueue_head(&q->mq_freeze_wq);
+	mutex_init(&q->mq_freeze_lock);
 
 	if (blkcg_init_queue(q))
 		goto fail_bdi;
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 189f5ae..ffc2443 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -420,7 +420,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
 /* see blk_register_queue() */
 void blk_mq_finish_init(struct request_queue *q)
 {
+	mutex_lock(&q->mq_freeze_lock);
 	percpu_ref_switch_to_percpu(&q->mq_usage_counter);
+	mutex_unlock(&q->mq_freeze_lock);
 }
 
 int blk_mq_register_disk(struct gendisk *disk)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a5dbd06..c2bb513 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -115,11 +115,15 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
+	mutex_lock(&q->mq_freeze_lock);
+
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
 		percpu_ref_kill(&q->mq_usage_counter);
 		blk_mq_run_hw_queues(q, false);
 	}
+
+	mutex_unlock(&q->mq_freeze_lock);
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
 
@@ -143,12 +147,16 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 {
 	int freeze_depth;
 
+	mutex_lock(&q->mq_freeze_lock);
+
 	freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
 		percpu_ref_reinit(&q->mq_usage_counter);
 		wake_up_all(&q->mq_freeze_wq);
 	}
+
+	mutex_unlock(&q->mq_freeze_lock);
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 19c2e94..6840b6e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -450,6 +450,12 @@ struct request_queue {
 #endif
 	struct rcu_head		rcu_head;
 	wait_queue_head_t	mq_freeze_wq;
+	/*
+	 * Protect concurrent access to mq_usage_counter by
+	 * percpu_ref_switch_to_percpu(), percpu_ref_kill(), and
+	 * percpu_ref_reinit().
+	 */
+	struct mutex		mq_freeze_lock;
 	struct percpu_ref	mq_usage_counter;
 	struct list_head	all_q_node;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 7/7] blk-mq: fix deadlock when reading cpu_list
  2015-09-26 17:09 [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
                   ` (5 preceding siblings ...)
  2015-09-26 17:09 ` [PATCH v4 6/7] blk-mq: fix freeze queue race Akinobu Mita
@ 2015-09-26 17:09 ` Akinobu Mita
  2015-09-29  6:58   ` Christoph Hellwig
  2015-09-29 17:31 ` [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling Jens Axboe
  7 siblings, 1 reply; 25+ messages in thread
From: Akinobu Mita @ 2015-09-26 17:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Akinobu Mita, Jens Axboe, Ming Lei, Christoph Hellwig, Wanpeng Li

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
entries by blk_mq_sysfs_unregister().  Removing sysfs entry needs to
be blocked until the active reference of the kernfs_node to be zero.

On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
/sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
blk_mq_hw_sysfs_cpus_show().

If these happen at the same time, a deadlock can happen.  Because one
can wait for the active reference to be zero with holding all_q_mutex,
and the other tries to acquire all_q_mutex with holding the active
reference.

The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
is to avoid reading an imcomplete hctx->cpumask.  Since reading sysfs
entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
while hctx->cpumask is being updated.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
---
 block/blk-mq-sysfs.c | 4 ----
 block/blk-mq.c       | 7 +++++++
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index ffc2443..9cbbb2a 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -229,8 +229,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 	unsigned int i, first = 1;
 	ssize_t ret = 0;
 
-	blk_mq_disable_hotplug();
-
 	for_each_cpu(i, hctx->cpumask) {
 		if (first)
 			ret += sprintf(ret + page, "%u", i);
@@ -240,8 +238,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 		first = 0;
 	}
 
-	blk_mq_enable_hotplug();
-
 	ret += sprintf(ret + page, "\n");
 	return ret;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c2bb513..d529f05 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1805,6 +1805,11 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_tag_set *set = q->tag_set;
 
+	/*
+	 * Avoid others reading imcomplete hctx->cpumask through sysfs
+	 */
+	mutex_lock(&q->sysfs_lock);
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		cpumask_clear(hctx->cpumask);
 		hctx->nr_ctx = 0;
@@ -1824,6 +1829,8 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 		hctx->ctxs[hctx->nr_ctx++] = ctx;
 	}
 
+	mutex_unlock(&q->sysfs_lock);
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		struct blk_mq_ctxmap *map = &hctx->ctx_map;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] blk-mq: fix freeze queue race
  2015-09-26 17:09 ` [PATCH v4 6/7] blk-mq: fix freeze queue race Akinobu Mita
@ 2015-09-26 17:32   ` Tejun Heo
  2015-09-27 13:06     ` Akinobu Mita
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2015-09-26 17:32 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: linux-kernel, Jens Axboe, Ming Lei, Christoph Hellwig

Hello,

On Sun, Sep 27, 2015 at 02:09:24AM +0900, Akinobu Mita wrote:
> @@ -420,7 +420,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
>  /* see blk_register_queue() */
>  void blk_mq_finish_init(struct request_queue *q)
>  {
> +	mutex_lock(&q->mq_freeze_lock);
>  	percpu_ref_switch_to_percpu(&q->mq_usage_counter);
> +	mutex_unlock(&q->mq_freeze_lock);

This looks weird to me.  What can it race against at this point?

> @@ -115,11 +115,15 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
>  {
>  	int freeze_depth;
>  
> +	mutex_lock(&q->mq_freeze_lock);
> +
>  	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);

It doesn't have to be an atomic anymore, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/7] blk-mq: avoid setting hctx->tags->cpumask before allocation
  2015-09-26 17:09 ` [PATCH v4 1/7] blk-mq: avoid setting hctx->tags->cpumask before allocation Akinobu Mita
@ 2015-09-26 18:12   ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2015-09-26 18:12 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-kernel, Keith Busch, Jens Axboe, Ming Lei, Christoph Hellwig

Looks good.  I actually just came up with exactly the same fix after
Keith told me to test hotplug for my nvme changes.

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] blk-mq: fix freeze queue race
  2015-09-26 17:32   ` Tejun Heo
@ 2015-09-27 13:06     ` Akinobu Mita
  2015-09-28 14:48       ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Akinobu Mita @ 2015-09-27 13:06 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Jens Axboe, Ming Lei, Christoph Hellwig

Hi Tejun,

2015-09-27 2:32 GMT+09:00 Tejun Heo <tj@kernel.org>:
> Hello,
>
> On Sun, Sep 27, 2015 at 02:09:24AM +0900, Akinobu Mita wrote:
>> @@ -420,7 +420,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
>>  /* see blk_register_queue() */
>>  void blk_mq_finish_init(struct request_queue *q)
>>  {
>> +     mutex_lock(&q->mq_freeze_lock);
>>       percpu_ref_switch_to_percpu(&q->mq_usage_counter);
>> +     mutex_unlock(&q->mq_freeze_lock);
>
> This looks weird to me.  What can it race against at this point?

The possible scenario is described in commit log (1. ~ 7.).  In summary,
blk_mq_finish_init() and blk_mq_freeze_queue_start() can be executed
at the same time, so this is required to serialize the execution of
percpu_ref_switch_to_percpu() by blk_mq_finish_init() and
percpu_ref_kill() by blk_mq_freeze_queue_start().

>> @@ -115,11 +115,15 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
>>  {
>>       int freeze_depth;
>>
>> +     mutex_lock(&q->mq_freeze_lock);
>> +
>>       freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
>
> It doesn't have to be an atomic anymore, right?

Yes, you are right.  I would like to make it in another patch in order to
simplify each change.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] blk-mq: fix freeze queue race
  2015-09-27 13:06     ` Akinobu Mita
@ 2015-09-28 14:48       ` Tejun Heo
  2015-09-29  6:55         ` Christoph Hellwig
  2015-09-29 15:01         ` Jens Axboe
  0 siblings, 2 replies; 25+ messages in thread
From: Tejun Heo @ 2015-09-28 14:48 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: LKML, Jens Axboe, Ming Lei, Christoph Hellwig

Hello,

On Sun, Sep 27, 2015 at 10:06:05PM +0900, Akinobu Mita wrote:
> >>  void blk_mq_finish_init(struct request_queue *q)
> >>  {
> >> +     mutex_lock(&q->mq_freeze_lock);
> >>       percpu_ref_switch_to_percpu(&q->mq_usage_counter);
> >> +     mutex_unlock(&q->mq_freeze_lock);
> >
> > This looks weird to me.  What can it race against at this point?
> 
> The possible scenario is described in commit log (1. ~ 7.).  In summary,
> blk_mq_finish_init() and blk_mq_freeze_queue_start() can be executed
> at the same time, so this is required to serialize the execution of
> percpu_ref_switch_to_percpu() by blk_mq_finish_init() and
> percpu_ref_kill() by blk_mq_freeze_queue_start().

Ah, you're right.  I was thinking that percpu_ref_switch_to_percpu()
being called after blk_mq_freeze_queue_start() would be buggy and thus
the above can't be enough but that is safe as long as the calls are
properly synchronized.  Hmmm... maybe we should add synchronization to
those operations from percpu_ref side.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/7] blk-mq: fix sysfs registration/unregistration race
  2015-09-26 17:09 ` [PATCH v4 2/7] blk-mq: fix sysfs registration/unregistration race Akinobu Mita
@ 2015-09-29  6:51   ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2015-09-29  6:51 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: linux-kernel, Jens Axboe, Ming Lei

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 3/7] blk-mq: Fix use after of free q->mq_map
  2015-09-26 17:09 ` [PATCH v4 3/7] blk-mq: Fix use after of free q->mq_map Akinobu Mita
@ 2015-09-29  6:52   ` Christoph Hellwig
  2015-10-05 23:50     ` Akinobu Mita
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2015-09-29  6:52 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: linux-kernel, Jens Axboe, Ming Lei, Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

Can you also add a patch that renames blk_mq_free_queue to
blk_mq_cleaup_queue and adds a comment that we should not free any memory
here?  We had  way too many bugs of this kinds unfortunately.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 4/7] blk-mq: fix q->mq_usage_counter access race
  2015-09-26 17:09 ` [PATCH v4 4/7] blk-mq: fix q->mq_usage_counter access race Akinobu Mita
@ 2015-09-29  6:53   ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2015-09-29  6:53 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: linux-kernel, Jens Axboe, Ming Lei

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 5/7] blk-mq: avoid inserting requests before establishing new mapping
  2015-09-26 17:09 ` [PATCH v4 5/7] blk-mq: avoid inserting requests before establishing new mapping Akinobu Mita
@ 2015-09-29  6:54   ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2015-09-29  6:54 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: linux-kernel, Jens Axboe, Ming Lei

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] blk-mq: fix freeze queue race
  2015-09-28 14:48       ` Tejun Heo
@ 2015-09-29  6:55         ` Christoph Hellwig
  2015-09-29 15:01         ` Jens Axboe
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2015-09-29  6:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Akinobu Mita, LKML, Jens Axboe, Ming Lei, Christoph Hellwig

On Mon, Sep 28, 2015 at 10:48:39AM -0400, Tejun Heo wrote:
> Ah, you're right.  I was thinking that percpu_ref_switch_to_percpu()
> being called after blk_mq_freeze_queue_start() would be buggy and thus
> the above can't be enough but that is safe as long as the calls are
> properly synchronized.  Hmmm... maybe we should add synchronization to
> those operations from percpu_ref side.

I think that would be very useful, as most users of the code will
have similar issues.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 7/7] blk-mq: fix deadlock when reading cpu_list
  2015-09-26 17:09 ` [PATCH v4 7/7] blk-mq: fix deadlock when reading cpu_list Akinobu Mita
@ 2015-09-29  6:58   ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2015-09-29  6:58 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: linux-kernel, Jens Axboe, Ming Lei, Wanpeng Li

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] blk-mq: fix freeze queue race
  2015-09-28 14:48       ` Tejun Heo
  2015-09-29  6:55         ` Christoph Hellwig
@ 2015-09-29 15:01         ` Jens Axboe
  2015-09-29 15:03           ` Tejun Heo
  1 sibling, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2015-09-29 15:01 UTC (permalink / raw)
  To: Tejun Heo, Akinobu Mita; +Cc: LKML, Ming Lei, Christoph Hellwig

On 09/28/2015 08:48 AM, Tejun Heo wrote:
> Hello,
>
> On Sun, Sep 27, 2015 at 10:06:05PM +0900, Akinobu Mita wrote:
>>>>   void blk_mq_finish_init(struct request_queue *q)
>>>>   {
>>>> +     mutex_lock(&q->mq_freeze_lock);
>>>>        percpu_ref_switch_to_percpu(&q->mq_usage_counter);
>>>> +     mutex_unlock(&q->mq_freeze_lock);
>>>
>>> This looks weird to me.  What can it race against at this point?
>>
>> The possible scenario is described in commit log (1. ~ 7.).  In summary,
>> blk_mq_finish_init() and blk_mq_freeze_queue_start() can be executed
>> at the same time, so this is required to serialize the execution of
>> percpu_ref_switch_to_percpu() by blk_mq_finish_init() and
>> percpu_ref_kill() by blk_mq_freeze_queue_start().
>
> Ah, you're right.  I was thinking that percpu_ref_switch_to_percpu()
> being called after blk_mq_freeze_queue_start() would be buggy and thus
> the above can't be enough but that is safe as long as the calls are
> properly synchronized.  Hmmm... maybe we should add synchronization to
> those operations from percpu_ref side.

I think that would be very useful, it seems sort of half-assed if the 
caller side has to provide serialization for that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] blk-mq: fix freeze queue race
  2015-09-29 15:01         ` Jens Axboe
@ 2015-09-29 15:03           ` Tejun Heo
  2015-09-29 21:50             ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2015-09-29 15:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Akinobu Mita, LKML, Ming Lei, Christoph Hellwig

Hello,

On Tue, Sep 29, 2015 at 09:01:31AM -0600, Jens Axboe wrote:
> I think that would be very useful, it seems sort of half-assed if the caller
> side has to provide serialization for that.

Yeah, the thing is init/exit are usually caller synchronized but
percpu_rwsem's kill/reinit are more of mode-switching operations which
can be performed concurrently during operation so I think the right
thing to do here is making it synchronize itself.  Will spin a patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling
  2015-09-26 17:09 [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
                   ` (6 preceding siblings ...)
  2015-09-26 17:09 ` [PATCH v4 7/7] blk-mq: fix deadlock when reading cpu_list Akinobu Mita
@ 2015-09-29 17:31 ` Jens Axboe
  7 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2015-09-29 17:31 UTC (permalink / raw)
  To: Akinobu Mita, linux-kernel
  Cc: Ming Lei, Tejun Heo, Keith Busch, Christoph Hellwig, Wanpeng Li

On 09/26/2015 11:09 AM, Akinobu Mita wrote:
> This patchset addresses several race conditions on cpu hotplug handling
> for blk-mq.  All problems can be reproducible by the following script.
>
> while true; do
> 	echo 0 > /sys/devices/system/cpu/cpu1/online
> 	echo 1 > /sys/devices/system/cpu/cpu1/online
> done &
>
> while true; do
> 	modprobe -r null_blk
> 	modprobe null_blk queue_mode=2 irqmode=1
> 	sleep 0.1
> done
>
> * Changes from v3
> - Rebased to the latest kernel
> - Add Reviewed-by tags

I've reviewed the patches, looks good to me. I've skipped 6/7, as I 
think we should fix that as part of the percpu ref counting, and not 
handle it specifically in blk-mq. If we can't get that fix before 4.3 
wraps up, then we can revisit and put that in as well.

Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] blk-mq: fix freeze queue race
  2015-09-29 15:03           ` Tejun Heo
@ 2015-09-29 21:50             ` Tejun Heo
  2015-09-30 10:32               ` Akinobu Mita
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2015-09-29 21:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Akinobu Mita, LKML, Ming Lei, Christoph Hellwig

Hello,

On Tue, Sep 29, 2015 at 11:03:46AM -0400, Tejun Heo wrote:
> On Tue, Sep 29, 2015 at 09:01:31AM -0600, Jens Axboe wrote:
> > I think that would be very useful, it seems sort of half-assed if the caller
> > side has to provide serialization for that.
> 
> Yeah, the thing is init/exit are usually caller synchronized but
> percpu_rwsem's kill/reinit are more of mode-switching operations which
> can be performed concurrently during operation so I think the right
> thing to do here is making it synchronize itself.  Will spin a patch.

Patchset posted

 http://lkml.kernel.org/g/1443563240-29306-1-git-send-email-tj@kernel.org

Unfortunately, the required changes on percpu_ref side ended up
somewhat invasive and don't seem fit for -stable backport.  I think
it'd be a good idea to apply Akinobu's patch for now and then revert
it later when the percpu_ref changes hit mainline.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] blk-mq: fix freeze queue race
  2015-09-29 21:50             ` Tejun Heo
@ 2015-09-30 10:32               ` Akinobu Mita
  0 siblings, 0 replies; 25+ messages in thread
From: Akinobu Mita @ 2015-09-30 10:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, LKML, Ming Lei, Christoph Hellwig

2015-09-30 6:50 GMT+09:00 Tejun Heo <tj@kernel.org>:
> Hello,
>
> On Tue, Sep 29, 2015 at 11:03:46AM -0400, Tejun Heo wrote:
>> On Tue, Sep 29, 2015 at 09:01:31AM -0600, Jens Axboe wrote:
>> > I think that would be very useful, it seems sort of half-assed if the caller
>> > side has to provide serialization for that.
>>
>> Yeah, the thing is init/exit are usually caller synchronized but
>> percpu_rwsem's kill/reinit are more of mode-switching operations which
>> can be performed concurrently during operation so I think the right
>> thing to do here is making it synchronize itself.  Will spin a patch.
>
> Patchset posted
>
>  http://lkml.kernel.org/g/1443563240-29306-1-git-send-email-tj@kernel.org

Thanks.  So far it's working well without any problems.

Note that we still need a part of patch 6/7.
Quoted from this patch description:

        When unfreezing queue, there is a small window between decrementing
        q->mq_freeze_depth to zero and percpu_ref_reinit() call with
        q->mq_usage_counter.  If the other calls blk_mq_freeze_queue_start()
        in the window, q->mq_freeze_depth is increased from zero to one and
        percpu_ref_kill() is called with q->mq_usage_counter which is already
        killed.  percpu refcount should be re-initialized before killed again.

So we don't need to protect percpu_ref_switch_to_percpu() in
blk_mq_finish_init() anymore by your percpu_ref patchset, but we still
need to serialize blk_mq_freeze_queue_start() and blk_mq_unfreeze_queue().
(As you suggested earlier in this thread, q->mq_freeze_depth don't have
to be atomic_t anymore)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 3/7] blk-mq: Fix use after of free q->mq_map
  2015-09-29  6:52   ` Christoph Hellwig
@ 2015-10-05 23:50     ` Akinobu Mita
  2015-10-06  9:43       ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Akinobu Mita @ 2015-10-05 23:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: LKML, Jens Axboe, Ming Lei

2015-09-29 15:52 GMT+09:00 Christoph Hellwig <hch@lst.de>:
> Looks good,
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> Can you also add a patch that renames blk_mq_free_queue to
> blk_mq_cleaup_queue and adds a comment that we should not free any memory
> here?  We had  way too many bugs of this kinds unfortunately.

Renaming blk_mq_free_queue to blk_mq_cleaup_queue sounds good because
it is called from blk_cleanup_queue().

How about adding comment like below?

/*
 * The resources allocated by blk_mq_init_allocated_queue() are released
 * by blk_mq_cleanup_queue() and blk_mq_release().
 *
 * blk_mq_cleanup_queue() is called from blk_cleanup_queue(), so
 * the resources which may be used after blk_cleanup_queue() shouldn't
 * be released here.  Instead, those should be released by blk_mq_release()
 * which is called from blk_release_queue().
 *
 * For example, loop and md drivers call del_gendisk() after
 * blk_cleanup_queue(), so the resources used when accessing sysfs entries
 * for blk-mq shouldn't be released by blk_mq_cleanup_queue() as these sysfs
 * entries can be accessible before del_gendisk() is called.
 */

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 3/7] blk-mq: Fix use after of free q->mq_map
  2015-10-05 23:50     ` Akinobu Mita
@ 2015-10-06  9:43       ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2015-10-06  9:43 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: Christoph Hellwig, LKML, Jens Axboe, Ming Lei

On Tue, Oct 06, 2015 at 08:50:47AM +0900, Akinobu Mita wrote:
> 2015-09-29 15:52 GMT+09:00 Christoph Hellwig <hch@lst.de>:
> > Looks good,
> >
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> >
> > Can you also add a patch that renames blk_mq_free_queue to
> > blk_mq_cleaup_queue and adds a comment that we should not free any memory
> > here?  We had  way too many bugs of this kinds unfortunately.
> 
> Renaming blk_mq_free_queue to blk_mq_cleaup_queue sounds good because
> it is called from blk_cleanup_queue().
> 
> How about adding comment like below?

This looks great.  It's a lot more than what I though but it looks correct
and useful!

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-10-06  9:43 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-26 17:09 [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
2015-09-26 17:09 ` [PATCH v4 1/7] blk-mq: avoid setting hctx->tags->cpumask before allocation Akinobu Mita
2015-09-26 18:12   ` Christoph Hellwig
2015-09-26 17:09 ` [PATCH v4 2/7] blk-mq: fix sysfs registration/unregistration race Akinobu Mita
2015-09-29  6:51   ` Christoph Hellwig
2015-09-26 17:09 ` [PATCH v4 3/7] blk-mq: Fix use after of free q->mq_map Akinobu Mita
2015-09-29  6:52   ` Christoph Hellwig
2015-10-05 23:50     ` Akinobu Mita
2015-10-06  9:43       ` Christoph Hellwig
2015-09-26 17:09 ` [PATCH v4 4/7] blk-mq: fix q->mq_usage_counter access race Akinobu Mita
2015-09-29  6:53   ` Christoph Hellwig
2015-09-26 17:09 ` [PATCH v4 5/7] blk-mq: avoid inserting requests before establishing new mapping Akinobu Mita
2015-09-29  6:54   ` Christoph Hellwig
2015-09-26 17:09 ` [PATCH v4 6/7] blk-mq: fix freeze queue race Akinobu Mita
2015-09-26 17:32   ` Tejun Heo
2015-09-27 13:06     ` Akinobu Mita
2015-09-28 14:48       ` Tejun Heo
2015-09-29  6:55         ` Christoph Hellwig
2015-09-29 15:01         ` Jens Axboe
2015-09-29 15:03           ` Tejun Heo
2015-09-29 21:50             ` Tejun Heo
2015-09-30 10:32               ` Akinobu Mita
2015-09-26 17:09 ` [PATCH v4 7/7] blk-mq: fix deadlock when reading cpu_list Akinobu Mita
2015-09-29  6:58   ` Christoph Hellwig
2015-09-29 17:31 ` [PATCH v4 0/7] blk-mq: fix race conditions on cpu hotplug handling Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).