[PATCH v3 0/7] blk-mq: fix race conditions on cpu hotplug handling

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/7] blk-mq: fix race conditions on cpu hotplug handling
@ 2015-07-18 16:28 Akinobu Mita
  2015-07-18 16:28 ` [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation Akinobu Mita
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Akinobu Mita @ 2015-07-18 16:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, Jens Axboe, Ming Lei, Tejun Heo, Keith Busch

This patchset addresses several race conditions on cpu hotplug handling
for blk-mq.  All problems can be reproducible by the following script.

while true; do
	echo 0 > /sys/devices/system/cpu/cpu1/online
	echo 1 > /sys/devices/system/cpu/cpu1/online
done &

while true; do
	modprobe -r null_blk
	modprobe null_blk queue_mode=2 irqmode=1
	sleep 0.1
done

* Changes from v2
- Add regression fix for hctx->tags->cpumask
- Remove BLK_MQ_F_SYSFS_UP per-hctx flag and use mq_sysfs_init_done
  per-queue flag instead with appropriate locking in order to keep
  track of 'mq' sysfs entry's life
- Add comments on non-trivial stuffs, suggested by Ming

* Changes from v1
- Release q->mq_map in blk_mq_release()
- Fix deadlock when reading cpu_list
- Fix race freeze and unfreeze

Akinobu Mita (7):
  blk-mq: avoid access hctx->tags->cpumask before allocation
  blk-mq: fix sysfs registration/unregistration race
  blk-mq: Fix use after of free q->mq_map
  blk-mq: fix q->mq_usage_counter access race
  blk-mq: avoid inserting requests before establishing new mapping
  blk-mq: fix freeze queue race
  blk-mq: fix deadlock when reading cpu_list

 block/blk-core.c       |   1 +
 block/blk-mq-cpumap.c  |   9 +++--
 block/blk-mq-sysfs.c   |  36 ++++++++++++------
 block/blk-mq.c         | 100 +++++++++++++++++++++++++++++++++++++------------
 block/blk-mq.h         |   3 +-
 include/linux/blk-mq.h |   1 -
 include/linux/blkdev.h |   8 ++++
 7 files changed, 116 insertions(+), 42 deletions(-)

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
-- 
1.9.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation
  2015-07-18 16:28 [PATCH v3 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
@ 2015-07-18 16:28 ` Akinobu Mita
  2015-07-19 10:24   ` Ming Lei
  2015-07-18 16:28 ` [PATCH v3 2/7] blk-mq: fix sysfs registration/unregistration race Akinobu Mita
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Akinobu Mita @ 2015-07-18 16:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, Keith Busch, Jens Axboe, Ming Lei

When unmapped hw queue is remapped after CPU topology is changed,
hctx->tags->cpumask is set before hctx->tags is allocated in
blk_mq_map_swqueue().

In order to fix this null pointer dereference, hctx->tags must be
allocated before configuring hctx->tags->cpumask.

Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
---
 block/blk-mq.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7d842db..f29f766 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1821,7 +1821,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 
 		hctx = q->mq_ops->map_queue(q, i);
 		cpumask_set_cpu(i, hctx->cpumask);
-		cpumask_set_cpu(i, hctx->tags->cpumask);
 		ctx->index_hw = hctx->nr_ctx;
 		hctx->ctxs[hctx->nr_ctx++] = ctx;
 	}
@@ -1861,6 +1860,14 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 		hctx->next_cpu = cpumask_first(hctx->cpumask);
 		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
 	}
+
+	queue_for_each_ctx(q, ctx, i) {
+		if (!cpu_online(i))
+			continue;
+
+		hctx = q->mq_ops->map_queue(q, i);
+		cpumask_set_cpu(i, hctx->tags->cpumask);
+	}
 }
 
 static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 2/7] blk-mq: fix sysfs registration/unregistration race
  2015-07-18 16:28 [PATCH v3 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
  2015-07-18 16:28 ` [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation Akinobu Mita
@ 2015-07-18 16:28 ` Akinobu Mita
  2015-07-19 11:47   ` Ming Lei
  2015-07-18 16:28 ` [PATCH v3 3/7] blk-mq: Fix use after of free q->mq_map Akinobu Mita
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Akinobu Mita @ 2015-07-18 16:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, Jens Axboe, Ming Lei

There is a race between cpu hotplug handling and adding/deleting
gendisk for blk-mq, where both are trying to register and unregister
the same sysfs entries.

null_add_dev
    --> blk_mq_init_queue
        --> blk_mq_init_allocated_queue
            --> add to 'all_q_list' (*)
    --> add_disk
        --> blk_register_queue
            --> blk_mq_register_disk (++)

null_del_dev
    --> del_gendisk
        --> blk_unregister_queue
            --> blk_mq_unregister_disk (--)
    --> blk_cleanup_queue
        --> blk_mq_free_queue
            --> del from 'all_q_list' (*)

blk_mq_queue_reinit
    --> blk_mq_sysfs_unregister (-)
    --> blk_mq_sysfs_register (+)

While the request queue is added to 'all_q_list' (*),
blk_mq_queue_reinit() can be called for the queue anytime by CPU
hotplug callback.  But blk_mq_sysfs_unregister (-) and
blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
is finished.  Because '/sys/block/*/mq/' is not exists.

There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
be used to track these sysfs stuff, but it is only fixing this issue
partially.

In order to fix it completely, we just need per-queue flag instead of
per-hctx flag with appropriate locking.  So this introduces
q->mq_sysfs_init_done which is properly protected with all_q_mutex.

Also, we need to ensure that blk_mq_map_swqueue() is called with
all_q_mutex is held.  Since hctx->nr_ctx is reset temporarily and
updated in blk_mq_map_swqueue(), so we should avoid
blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
in CPU hotplug handling or adding/deleting gendisk .

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
---
 block/blk-mq-sysfs.c   | 30 ++++++++++++++++++++++--------
 block/blk-mq.c         |  6 +++---
 include/linux/blk-mq.h |  1 -
 include/linux/blkdev.h |  2 ++
 4 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index b79685e..79096a6 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -332,7 +332,7 @@ static void blk_mq_unregister_hctx(struct blk_mq_hw_ctx *hctx)
 	struct blk_mq_ctx *ctx;
 	int i;
 
-	if (!hctx->nr_ctx || !(hctx->flags & BLK_MQ_F_SYSFS_UP))
+	if (!hctx->nr_ctx)
 		return;
 
 	hctx_for_each_ctx(hctx, ctx, i)
@@ -347,7 +347,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
 	struct blk_mq_ctx *ctx;
 	int i, ret;
 
-	if (!hctx->nr_ctx || !(hctx->flags & BLK_MQ_F_SYSFS_UP))
+	if (!hctx->nr_ctx)
 		return 0;
 
 	ret = kobject_add(&hctx->kobj, &q->mq_kobj, "%u", hctx->queue_num);
@@ -370,6 +370,8 @@ void blk_mq_unregister_disk(struct gendisk *disk)
 	struct blk_mq_ctx *ctx;
 	int i, j;
 
+	blk_mq_disable_hotplug();
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		blk_mq_unregister_hctx(hctx);
 
@@ -384,6 +386,9 @@ void blk_mq_unregister_disk(struct gendisk *disk)
 	kobject_put(&q->mq_kobj);
 
 	kobject_put(&disk_to_dev(disk)->kobj);
+
+	q->mq_sysfs_init_done = false;
+	blk_mq_enable_hotplug();
 }
 
 static void blk_mq_sysfs_init(struct request_queue *q)
@@ -414,27 +419,30 @@ int blk_mq_register_disk(struct gendisk *disk)
 	struct blk_mq_hw_ctx *hctx;
 	int ret, i;
 
+	blk_mq_disable_hotplug();
+
 	blk_mq_sysfs_init(q);
 
 	ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
 	if (ret < 0)
-		return ret;
+		goto out;
 
 	kobject_uevent(&q->mq_kobj, KOBJ_ADD);
 
 	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->flags |= BLK_MQ_F_SYSFS_UP;
 		ret = blk_mq_register_hctx(hctx);
 		if (ret)
 			break;
 	}
 
-	if (ret) {
+	if (ret)
 		blk_mq_unregister_disk(disk);
-		return ret;
-	}
+	else
+		q->mq_sysfs_init_done = true;
+out:
+	blk_mq_enable_hotplug();
 
-	return 0;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(blk_mq_register_disk);
 
@@ -443,6 +451,9 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
+	if (!q->mq_sysfs_init_done)
+		return;
+
 	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_unregister_hctx(hctx);
 }
@@ -452,6 +463,9 @@ int blk_mq_sysfs_register(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i, ret = 0;
 
+	if (!q->mq_sysfs_init_done)
+		return ret;
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		ret = blk_mq_register_hctx(hctx);
 		if (ret)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f29f766..68921b7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2045,13 +2045,13 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 		goto err_hctxs;
 
 	mutex_lock(&all_q_mutex);
-	list_add_tail(&q->all_q_node, &all_q_list);
-	mutex_unlock(&all_q_mutex);
 
+	list_add_tail(&q->all_q_node, &all_q_list);
 	blk_mq_add_queue_tag_set(set, q);
-
 	blk_mq_map_swqueue(q);
 
+	mutex_unlock(&all_q_mutex);
+
 	return q;
 
 err_hctxs:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 37d1602..b80ba45 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -145,7 +145,6 @@ enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
-	BLK_MQ_F_SYSFS_UP	= 1 << 3,
 	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d4068c1..b02c90b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -462,6 +462,8 @@ struct request_queue {
 
 	struct blk_mq_tag_set	*tag_set;
 	struct list_head	tag_set_list;
+
+	bool			mq_sysfs_init_done;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 3/7] blk-mq: Fix use after of free q->mq_map
  2015-07-18 16:28 [PATCH v3 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
  2015-07-18 16:28 ` [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation Akinobu Mita
  2015-07-18 16:28 ` [PATCH v3 2/7] blk-mq: fix sysfs registration/unregistration race Akinobu Mita
@ 2015-07-18 16:28 ` Akinobu Mita
  2015-07-18 16:28 ` [PATCH v3 4/7] blk-mq: fix q->mq_usage_counter access race Akinobu Mita
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Akinobu Mita @ 2015-07-18 16:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, Jens Axboe, Ming Lei

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates
q->mq_map by blk_mq_update_queue_map() for all request queues in
all_q_list.  On the other hand, q->mq_map is released before deleting
the queue from all_q_list.

So if CPU hotplug event occurs in the window, invalid memory access
can happen.  Fix it by releasing q->mq_map in blk_mq_release() to make
it happen latter than removal from all_q_list.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
---
 block/blk-mq.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 68921b7..9328405 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1935,6 +1935,9 @@ void blk_mq_release(struct request_queue *q)
 		kfree(hctx);
 	}
 
+	kfree(q->mq_map);
+	q->mq_map = NULL;
+
 	kfree(q->queue_hw_ctx);
 
 	/* ctx kobj stays in queue_ctx */
@@ -2080,11 +2083,6 @@ void blk_mq_free_queue(struct request_queue *q)
 	blk_mq_free_hw_queues(q, set);
 
 	percpu_ref_exit(&q->mq_usage_counter);
-
-	kfree(q->mq_map);
-
-	q->mq_map = NULL;
-
 	mutex_lock(&all_q_mutex);
 	list_del_init(&q->all_q_node);
 	mutex_unlock(&all_q_mutex);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 4/7] blk-mq: fix q->mq_usage_counter access race
  2015-07-18 16:28 [PATCH v3 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
                   ` (2 preceding siblings ...)
  2015-07-18 16:28 ` [PATCH v3 3/7] blk-mq: Fix use after of free q->mq_map Akinobu Mita
@ 2015-07-18 16:28 ` Akinobu Mita
  2015-07-18 16:28 ` [PATCH v3 5/7] blk-mq: avoid inserting requests before establishing new mapping Akinobu Mita
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Akinobu Mita @ 2015-07-18 16:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, Jens Axboe, Ming Lei

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) accesses
q->mq_usage_counter while freezing all request queues in all_q_list.
On the other hand, q->mq_usage_counter is deinitialized in
blk_mq_free_queue() before deleting the queue from all_q_list.

So if CPU hotplug event occurs in the window, percpu_ref_kill() is
called with q->mq_usage_counter which has already been marked dead,
and it triggers warning.  Fix it by deleting the queue from all_q_list
earlier than destroying q->mq_usage_counter.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
---
 block/blk-mq.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9328405..d5d93e0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2077,15 +2077,16 @@ void blk_mq_free_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
+	mutex_lock(&all_q_mutex);
+	list_del_init(&q->all_q_node);
+	mutex_unlock(&all_q_mutex);
+
 	blk_mq_del_queue_tag_set(q);
 
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
 	blk_mq_free_hw_queues(q, set);
 
 	percpu_ref_exit(&q->mq_usage_counter);
-	mutex_lock(&all_q_mutex);
-	list_del_init(&q->all_q_node);
-	mutex_unlock(&all_q_mutex);
 }
 
 /* Basically redo blk_mq_init_queue with queue frozen */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 5/7] blk-mq: avoid inserting requests before establishing new mapping
  2015-07-18 16:28 [PATCH v3 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
                   ` (3 preceding siblings ...)
  2015-07-18 16:28 ` [PATCH v3 4/7] blk-mq: fix q->mq_usage_counter access race Akinobu Mita
@ 2015-07-18 16:28 ` Akinobu Mita
  2015-07-19 12:12   ` Ming Lei
  2015-07-18 16:28 ` [PATCH v3 6/7] blk-mq: fix freeze queue race Akinobu Mita
  2015-07-18 16:28 ` [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list Akinobu Mita
  6 siblings, 1 reply; 17+ messages in thread
From: Akinobu Mita @ 2015-07-18 16:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, Jens Axboe, Ming Lei

Notifier callbacks for CPU_ONLINE action can be run on the other CPU
than the CPU which was just onlined.  So it is possible for the
process running on the just onlined CPU to insert request and run
hw queue before establishing new mapping which is done by
blk_mq_queue_reinit_notify().

This can cause a problem when the CPU has just been onlined first time
since the request queue was initialized.  At this time ctx->index_hw
for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
zero before blk_mq_queue_reinit_notify() is called by notifier
callbacks for CPU_ONLINE action.

For example, there is a single hw queue (hctx) and two CPU queues
(ctx0 for CPU0, and ctx1 for CPU1).  Now CPU1 is just onlined and
a request is inserted into ctx1->rq_list and set bit0 in pending
bitmap as ctx1->index_hw is still zero.

And then while running hw queue, flush_busy_ctxs() finds bit0 is set
in pending bitmap and tries to retrieve requests in
hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0, so the
request in ctx1->rq_list is ignored.

Fix it by ensuring that new mapping is established before onlined cpu
starts running.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
---
 block/blk-mq-cpumap.c |  9 ++++----
 block/blk-mq.c        | 59 +++++++++++++++++++++++++++++++++++++++------------
 block/blk-mq.h        |  3 ++-
 3 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 1e28ddb..8764c24 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -31,7 +31,8 @@ static int get_first_sibling(unsigned int cpu)
 	return cpu;
 }
 
-int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
+int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
+			    const struct cpumask *online_mask)
 {
 	unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
 	cpumask_var_t cpus;
@@ -41,7 +42,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
 
 	cpumask_clear(cpus);
 	nr_cpus = nr_uniq_cpus = 0;
-	for_each_online_cpu(i) {
+	for_each_cpu(i, online_mask) {
 		nr_cpus++;
 		first_sibling = get_first_sibling(i);
 		if (!cpumask_test_cpu(first_sibling, cpus))
@@ -51,7 +52,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
 
 	queue = 0;
 	for_each_possible_cpu(i) {
-		if (!cpu_online(i)) {
+		if (!cpumask_test_cpu(i, online_mask)) {
 			map[i] = 0;
 			continue;
 		}
@@ -95,7 +96,7 @@ unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
 	if (!map)
 		return NULL;
 
-	if (!blk_mq_update_queue_map(map, set->nr_hw_queues))
+	if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
 		return map;
 
 	kfree(map);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d5d93e0..d861c70 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1799,7 +1799,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 	}
 }
 
-static void blk_mq_map_swqueue(struct request_queue *q)
+static void blk_mq_map_swqueue(struct request_queue *q,
+			       const struct cpumask *online_mask)
 {
 	unsigned int i;
 	struct blk_mq_hw_ctx *hctx;
@@ -1816,7 +1817,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 	 */
 	queue_for_each_ctx(q, ctx, i) {
 		/* If the cpu isn't online, the cpu is mapped to first hctx */
-		if (!cpu_online(i))
+		if (!cpumask_test_cpu(i, online_mask))
 			continue;
 
 		hctx = q->mq_ops->map_queue(q, i);
@@ -1862,7 +1863,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 	}
 
 	queue_for_each_ctx(q, ctx, i) {
-		if (!cpu_online(i))
+		if (!cpumask_test_cpu(i, online_mask))
 			continue;
 
 		hctx = q->mq_ops->map_queue(q, i);
@@ -2047,13 +2048,15 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (blk_mq_init_hw_queues(q, set))
 		goto err_hctxs;
 
+	get_online_cpus();
 	mutex_lock(&all_q_mutex);
 
 	list_add_tail(&q->all_q_node, &all_q_list);
 	blk_mq_add_queue_tag_set(set, q);
-	blk_mq_map_swqueue(q);
+	blk_mq_map_swqueue(q, cpu_online_mask);
 
 	mutex_unlock(&all_q_mutex);
+	put_online_cpus();
 
 	return q;
 
@@ -2090,13 +2093,14 @@ void blk_mq_free_queue(struct request_queue *q)
 }
 
 /* Basically redo blk_mq_init_queue with queue frozen */
-static void blk_mq_queue_reinit(struct request_queue *q)
+static void blk_mq_queue_reinit(struct request_queue *q,
+				const struct cpumask *online_mask)
 {
 	WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
 
 	blk_mq_sysfs_unregister(q);
 
-	blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues);
+	blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
 
 	/*
 	 * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
@@ -2104,7 +2108,7 @@ static void blk_mq_queue_reinit(struct request_queue *q)
 	 * involves free and re-allocate memory, worthy doing?)
 	 */
 
-	blk_mq_map_swqueue(q);
+	blk_mq_map_swqueue(q, online_mask);
 
 	blk_mq_sysfs_register(q);
 }
@@ -2113,16 +2117,43 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
 				      unsigned long action, void *hcpu)
 {
 	struct request_queue *q;
+	int cpu = (unsigned long)hcpu;
+	/*
+	 * New online cpumask which is going to be set in this hotplug event.
+	 * Declare this cpumasks as global as cpu-hotplug operation is invoked
+	 * one-by-one and dynamically allocating this could result in a failure.
+	 */
+	static struct cpumask online_new;
 
 	/*
-	 * Before new mappings are established, hotadded cpu might already
-	 * start handling requests. This doesn't break anything as we map
-	 * offline CPUs to first hardware queue. We will re-init the queue
-	 * below to get optimal settings.
+	 * Before hotadded cpu starts handling requests, new mappings must
+	 * be established.  Otherwise, these requests in hw queue might
+	 * never be dispatched.
+	 *
+	 * For example, there is a single hw queue (hctx) and two CPU queues
+	 * (ctx0 for CPU0, and ctx1 for CPU1).
+	 *
+	 * Now CPU1 is just onlined and a request is inserted into
+	 * ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is
+	 * still zero.
+	 *
+	 * And then while running hw queue, flush_busy_ctxs() finds bit0 is
+	 * set in pending bitmap and tries to retrieve requests in
+	 * hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0,
+	 * so the request in ctx1->rq_list is ignored.
 	 */
-	if (action != CPU_DEAD && action != CPU_DEAD_FROZEN &&
-	    action != CPU_ONLINE && action != CPU_ONLINE_FROZEN)
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DEAD:
+	case CPU_UP_CANCELED:
+		cpumask_copy(&online_new, cpu_online_mask);
+		break;
+	case CPU_UP_PREPARE:
+		cpumask_copy(&online_new, cpu_online_mask);
+		cpumask_set_cpu(cpu, &online_new);
+		break;
+	default:
 		return NOTIFY_OK;
+	}
 
 	mutex_lock(&all_q_mutex);
 
@@ -2146,7 +2177,7 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
 	}
 
 	list_for_each_entry(q, &all_q_list, all_q_node)
-		blk_mq_queue_reinit(q);
+		blk_mq_queue_reinit(q, &online_new);
 
 	list_for_each_entry(q, &all_q_list, all_q_node)
 		blk_mq_unfreeze_queue(q);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6a48c4c..f4fea79 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -51,7 +51,8 @@ void blk_mq_disable_hotplug(void);
  * CPU -> queue mappings
  */
 extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
-extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues);
+extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
+				   const struct cpumask *online_mask);
 extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
 
 /*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 6/7] blk-mq: fix freeze queue race
  2015-07-18 16:28 [PATCH v3 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
                   ` (4 preceding siblings ...)
  2015-07-18 16:28 ` [PATCH v3 5/7] blk-mq: avoid inserting requests before establishing new mapping Akinobu Mita
@ 2015-07-18 16:28 ` Akinobu Mita
  2015-07-19 12:12   ` Ming Lei
  2015-07-18 16:28 ` [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list Akinobu Mita
  6 siblings, 1 reply; 17+ messages in thread
From: Akinobu Mita @ 2015-07-18 16:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, Jens Axboe, Ming Lei, Tejun Heo

There are several race conditions while freezing queue.

When unfreezing queue, there is a small window between decrementing
q->mq_freeze_depth to zero and percpu_ref_reinit() call with
q->mq_usage_counter.  If the other calls blk_mq_freeze_queue_start()
in the window, q->mq_freeze_depth is increased from zero to one and
percpu_ref_kill() is called with q->mq_usage_counter which is already
killed.  percpu refcount should be re-initialized before killed again.

Also, there is a race condition while switching to percpu mode.
percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
executed at the same time as the following scenario is possible:

1. q->mq_usage_counter is initialized in atomic mode.
   (atomic counter: 1)

2. After the disk registration, a process like systemd-udev starts
   accessing the disk, and successfully increases refcount successfully
   by percpu_ref_tryget_live() in blk_mq_queue_enter().
   (atomic counter: 2)

3. In the final stage of initialization, q->mq_usage_counter is being
   switched to percpu mode by percpu_ref_switch_to_percpu() in
   blk_mq_finish_init().  But if CONFIG_PREEMPT_VOLUNTARY is enabled,
   the process is rescheduled in the middle of switching when calling
   wait_event() in __percpu_ref_switch_to_percpu().
   (atomic counter: 2)

4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze
   request queue.  q->mq_usage_counter is decreased and marked as
   DEAD.  Wait until all requests have finished.
   (atomic counter: 1)

5. The process rescheduled in the step 3. is resumed and finishes
   all remaining work in __percpu_ref_switch_to_percpu().
   A bias value is added to atomic counter of q->mq_usage_counter.
   (atomic counter: PERCPU_COUNT_BIAS + 1)

6. A request issed in the step 2. is finished and q->mq_usage_counter
   is decreased by blk_mq_queue_exit().  q->mq_usage_counter is DEAD,
   so atomic counter is decreased and no release handler is called.
   (atomic counter: PERCPU_COUNT_BIAS)

7. CPU hotplug handling in the step 4. will wait forever as
   q->mq_usage_counter will never be zero.

Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed
at the same time.  Because both functions could call
__percpu_ref_switch_to_percpu() which adds the bias value and
initialize percpu counter.

Fix those races by serializing with per-queue mutex.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c       | 1 +
 block/blk-mq-sysfs.c   | 2 ++
 block/blk-mq.c         | 8 ++++++++
 include/linux/blkdev.h | 6 ++++++
 4 files changed, 17 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 627ed0c..544b237 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -687,6 +687,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);

 	init_waitqueue_head(&q->mq_freeze_wq);
+	mutex_init(&q->mq_freeze_lock);

 	if (blkcg_init_queue(q))
 		goto fail_bdi;
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 79096a6..f63b464 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -409,7 +409,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
 /* see blk_register_queue() */
 void blk_mq_finish_init(struct request_queue *q)
 {
+	mutex_lock(&q->mq_freeze_lock);
 	percpu_ref_switch_to_percpu(&q->mq_usage_counter);
+	mutex_unlock(&q->mq_freeze_lock);
 }

 int blk_mq_register_disk(struct gendisk *disk)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d861c70..b931e38 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -115,11 +115,15 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;

+	mutex_lock(&q->mq_freeze_lock);
+
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
 		percpu_ref_kill(&q->mq_usage_counter);
 		blk_mq_run_hw_queues(q, false);
 	}
+
+	mutex_unlock(&q->mq_freeze_lock);
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);

@@ -143,12 +147,16 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 {
 	int freeze_depth;

+	mutex_lock(&q->mq_freeze_lock);
+
 	freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
 		percpu_ref_reinit(&q->mq_usage_counter);
 		wake_up_all(&q->mq_freeze_wq);
 	}
+
+	mutex_unlock(&q->mq_freeze_lock);
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b02c90b..b867c32 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -457,6 +457,12 @@ struct request_queue {
 #endif
 	struct rcu_head		rcu_head;
 	wait_queue_head_t	mq_freeze_wq;
+	/*
+	 * Protect concurrent access to mq_usage_counter by
+	 * percpu_ref_switch_to_percpu(), percpu_ref_kill(), and
+	 * percpu_ref_reinit().
+	 */
+	struct mutex		mq_freeze_lock;
 	struct percpu_ref	mq_usage_counter;
 	struct list_head	all_q_node;

-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list
  2015-07-18 16:28 [PATCH v3 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
                   ` (5 preceding siblings ...)
  2015-07-18 16:28 ` [PATCH v3 6/7] blk-mq: fix freeze queue race Akinobu Mita
@ 2015-07-18 16:28 ` Akinobu Mita
  2015-07-19 12:30   ` Ming Lei
  2015-07-28  8:10   ` Wanpeng Li
  6 siblings, 2 replies; 17+ messages in thread
From: Akinobu Mita @ 2015-07-18 16:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, Jens Axboe, Ming Lei

CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
entries by blk_mq_sysfs_unregister().  Removing sysfs entry needs to
be blocked until the active reference of the kernfs_node to be zero.

On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
/sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
blk_mq_hw_sysfs_cpus_show().

If these happen at the same time, a deadlock can happen.  Because one
can wait for the active reference to be zero with holding all_q_mutex,
and the other tries to acquire all_q_mutex with holding the active
reference.

The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
is to avoid reading an imcomplete hctx->cpumask.  Since reading sysfs
entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
while hctx->cpumask is being updated.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
---
 block/blk-mq-sysfs.c | 4 ----
 block/blk-mq.c       | 7 +++++++
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index f63b464..e0f71bf 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -218,8 +218,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 	unsigned int i, first = 1;
 	ssize_t ret = 0;

-	blk_mq_disable_hotplug();
-
 	for_each_cpu(i, hctx->cpumask) {
 		if (first)
 			ret += sprintf(ret + page, "%u", i);
@@ -229,8 +227,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 		first = 0;
 	}

-	blk_mq_enable_hotplug();
-
 	ret += sprintf(ret + page, "\n");
 	return ret;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b931e38..1a5e7d1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1815,6 +1815,11 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_tag_set *set = q->tag_set;

+	/*
+	 * Avoid others reading imcomplete hctx->cpumask through sysfs
+	 */
+	mutex_lock(&q->sysfs_lock);
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		cpumask_clear(hctx->cpumask);
 		hctx->nr_ctx = 0;
@@ -1834,6 +1839,8 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 		hctx->ctxs[hctx->nr_ctx++] = ctx;
 	}

+	mutex_unlock(&q->sysfs_lock);
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		struct blk_mq_ctxmap *map = &hctx->ctx_map;

-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation
  2015-07-18 16:28 ` [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation Akinobu Mita
@ 2015-07-19 10:24   ` Ming Lei
  2015-07-21 13:58     ` Akinobu Mita
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2015-07-19 10:24 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: Linux Kernel Mailing List, Keith Busch, Jens Axboe

On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <akinobu.mita@gmail.com> wrote:
> When unmapped hw queue is remapped after CPU topology is changed,
> hctx->tags->cpumask is set before hctx->tags is allocated in
> blk_mq_map_swqueue().
>
> In order to fix this null pointer dereference, hctx->tags must be
> allocated before configuring hctx->tags->cpumask.

The root cause should be that the mapping between hctx and ctx
can be changed after CPU topo is changed, then hctx->tags can
be changed too, so hctx->tags->cpumask has to be set after
hctx->tags is setup.

>
> Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")

I am wondering if the above commit considers CPU hotplug, and
nvme uses tag->cpumask to set irq affinity hint just during
starting queue. Looks it should be reasonalbe to
introduce one callback of mapping_changed() for handling
this kind of stuff. But this isn't related with this patch.

> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> Cc: Keith Busch <keith.busch@intel.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ming Lei <tom.leiming@gmail.com>
> ---
>  block/blk-mq.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7d842db..f29f766 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1821,7 +1821,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>
>                 hctx = q->mq_ops->map_queue(q, i);
>                 cpumask_set_cpu(i, hctx->cpumask);
> -               cpumask_set_cpu(i, hctx->tags->cpumask);
>                 ctx->index_hw = hctx->nr_ctx;
>                 hctx->ctxs[hctx->nr_ctx++] = ctx;
>         }
> @@ -1861,6 +1860,14 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>                 hctx->next_cpu = cpumask_first(hctx->cpumask);
>                 hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>         }
> +
> +       queue_for_each_ctx(q, ctx, i) {
> +               if (!cpu_online(i))
> +                       continue;
> +
> +               hctx = q->mq_ops->map_queue(q, i);
> +               cpumask_set_cpu(i, hctx->tags->cpumask);

If tags->cpumask is always same with hctx->cpumaks, this
CPU iterator can be avoided.

> +       }
>  }
>
>  static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
> --
> 1.9.1
>



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/7] blk-mq: fix sysfs registration/unregistration race
  2015-07-18 16:28 ` [PATCH v3 2/7] blk-mq: fix sysfs registration/unregistration race Akinobu Mita
@ 2015-07-19 11:47   ` Ming Lei
  0 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2015-07-19 11:47 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: Linux Kernel Mailing List, Jens Axboe

On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <akinobu.mita@gmail.com> wrote:
> There is a race between cpu hotplug handling and adding/deleting
> gendisk for blk-mq, where both are trying to register and unregister
> the same sysfs entries.
>
> null_add_dev
>     --> blk_mq_init_queue
>         --> blk_mq_init_allocated_queue
>             --> add to 'all_q_list' (*)
>     --> add_disk
>         --> blk_register_queue
>             --> blk_mq_register_disk (++)
>
> null_del_dev
>     --> del_gendisk
>         --> blk_unregister_queue
>             --> blk_mq_unregister_disk (--)
>     --> blk_cleanup_queue
>         --> blk_mq_free_queue
>             --> del from 'all_q_list' (*)
>
> blk_mq_queue_reinit
>     --> blk_mq_sysfs_unregister (-)
>     --> blk_mq_sysfs_register (+)
>
> While the request queue is added to 'all_q_list' (*),
> blk_mq_queue_reinit() can be called for the queue anytime by CPU
> hotplug callback.  But blk_mq_sysfs_unregister (-) and
> blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
> before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
> is finished.  Because '/sys/block/*/mq/' is not exists.
>
> There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
> be used to track these sysfs stuff, but it is only fixing this issue
> partially.
>
> In order to fix it completely, we just need per-queue flag instead of
> per-hctx flag with appropriate locking.  So this introduces
> q->mq_sysfs_init_done which is properly protected with all_q_mutex.
>
> Also, we need to ensure that blk_mq_map_swqueue() is called with
> all_q_mutex is held.  Since hctx->nr_ctx is reset temporarily and
> updated in blk_mq_map_swqueue(), so we should avoid
> blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
> in CPU hotplug handling or adding/deleting gendisk .
>
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ming Lei <tom.leiming@gmail.com>

Reviewed-by: Ming Lei <tom.leiming@gmail.com>

> ---
>  block/blk-mq-sysfs.c   | 30 ++++++++++++++++++++++--------
>  block/blk-mq.c         |  6 +++---
>  include/linux/blk-mq.h |  1 -
>  include/linux/blkdev.h |  2 ++
>  4 files changed, 27 insertions(+), 12 deletions(-)
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index b79685e..79096a6 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -332,7 +332,7 @@ static void blk_mq_unregister_hctx(struct blk_mq_hw_ctx *hctx)
>         struct blk_mq_ctx *ctx;
>         int i;
>
> -       if (!hctx->nr_ctx || !(hctx->flags & BLK_MQ_F_SYSFS_UP))
> +       if (!hctx->nr_ctx)
>                 return;
>
>         hctx_for_each_ctx(hctx, ctx, i)
> @@ -347,7 +347,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
>         struct blk_mq_ctx *ctx;
>         int i, ret;
>
> -       if (!hctx->nr_ctx || !(hctx->flags & BLK_MQ_F_SYSFS_UP))
> +       if (!hctx->nr_ctx)
>                 return 0;
>
>         ret = kobject_add(&hctx->kobj, &q->mq_kobj, "%u", hctx->queue_num);
> @@ -370,6 +370,8 @@ void blk_mq_unregister_disk(struct gendisk *disk)
>         struct blk_mq_ctx *ctx;
>         int i, j;
>
> +       blk_mq_disable_hotplug();
> +
>         queue_for_each_hw_ctx(q, hctx, i) {
>                 blk_mq_unregister_hctx(hctx);
>
> @@ -384,6 +386,9 @@ void blk_mq_unregister_disk(struct gendisk *disk)
>         kobject_put(&q->mq_kobj);
>
>         kobject_put(&disk_to_dev(disk)->kobj);
> +
> +       q->mq_sysfs_init_done = false;
> +       blk_mq_enable_hotplug();
>  }
>
>  static void blk_mq_sysfs_init(struct request_queue *q)
> @@ -414,27 +419,30 @@ int blk_mq_register_disk(struct gendisk *disk)
>         struct blk_mq_hw_ctx *hctx;
>         int ret, i;
>
> +       blk_mq_disable_hotplug();
> +
>         blk_mq_sysfs_init(q);
>
>         ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
>         if (ret < 0)
> -               return ret;
> +               goto out;
>
>         kobject_uevent(&q->mq_kobj, KOBJ_ADD);
>
>         queue_for_each_hw_ctx(q, hctx, i) {
> -               hctx->flags |= BLK_MQ_F_SYSFS_UP;
>                 ret = blk_mq_register_hctx(hctx);
>                 if (ret)
>                         break;
>         }
>
> -       if (ret) {
> +       if (ret)
>                 blk_mq_unregister_disk(disk);
> -               return ret;
> -       }
> +       else
> +               q->mq_sysfs_init_done = true;
> +out:
> +       blk_mq_enable_hotplug();
>
> -       return 0;
> +       return ret;
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_register_disk);
>
> @@ -443,6 +451,9 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
>         struct blk_mq_hw_ctx *hctx;
>         int i;
>
> +       if (!q->mq_sysfs_init_done)
> +               return;
> +
>         queue_for_each_hw_ctx(q, hctx, i)
>                 blk_mq_unregister_hctx(hctx);
>  }
> @@ -452,6 +463,9 @@ int blk_mq_sysfs_register(struct request_queue *q)
>         struct blk_mq_hw_ctx *hctx;
>         int i, ret = 0;
>
> +       if (!q->mq_sysfs_init_done)
> +               return ret;
> +
>         queue_for_each_hw_ctx(q, hctx, i) {
>                 ret = blk_mq_register_hctx(hctx);
>                 if (ret)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f29f766..68921b7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2045,13 +2045,13 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>                 goto err_hctxs;
>
>         mutex_lock(&all_q_mutex);
> -       list_add_tail(&q->all_q_node, &all_q_list);
> -       mutex_unlock(&all_q_mutex);
>
> +       list_add_tail(&q->all_q_node, &all_q_list);
>         blk_mq_add_queue_tag_set(set, q);
> -
>         blk_mq_map_swqueue(q);
>
> +       mutex_unlock(&all_q_mutex);
> +
>         return q;
>
>  err_hctxs:
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 37d1602..b80ba45 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -145,7 +145,6 @@ enum {
>         BLK_MQ_F_SHOULD_MERGE   = 1 << 0,
>         BLK_MQ_F_TAG_SHARED     = 1 << 1,
>         BLK_MQ_F_SG_MERGE       = 1 << 2,
> -       BLK_MQ_F_SYSFS_UP       = 1 << 3,
>         BLK_MQ_F_DEFER_ISSUE    = 1 << 4,
>         BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
>         BLK_MQ_F_ALLOC_POLICY_BITS = 1,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index d4068c1..b02c90b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -462,6 +462,8 @@ struct request_queue {
>
>         struct blk_mq_tag_set   *tag_set;
>         struct list_head        tag_set_list;
> +
> +       bool                    mq_sysfs_init_done;
>  };
>
>  #define QUEUE_FLAG_QUEUED      1       /* uses generic tag queueing */
> --
> 1.9.1
>



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 5/7] blk-mq: avoid inserting requests before establishing new mapping
  2015-07-18 16:28 ` [PATCH v3 5/7] blk-mq: avoid inserting requests before establishing new mapping Akinobu Mita
@ 2015-07-19 12:12   ` Ming Lei
  0 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2015-07-19 12:12 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: Linux Kernel Mailing List, Jens Axboe

On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <akinobu.mita@gmail.com> wrote:
> Notifier callbacks for CPU_ONLINE action can be run on the other CPU
> than the CPU which was just onlined.  So it is possible for the
> process running on the just onlined CPU to insert request and run
> hw queue before establishing new mapping which is done by
> blk_mq_queue_reinit_notify().
>
> This can cause a problem when the CPU has just been onlined first time
> since the request queue was initialized.  At this time ctx->index_hw
> for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
> zero before blk_mq_queue_reinit_notify() is called by notifier
> callbacks for CPU_ONLINE action.
>
> For example, there is a single hw queue (hctx) and two CPU queues
> (ctx0 for CPU0, and ctx1 for CPU1).  Now CPU1 is just onlined and
> a request is inserted into ctx1->rq_list and set bit0 in pending
> bitmap as ctx1->index_hw is still zero.
>
> And then while running hw queue, flush_busy_ctxs() finds bit0 is set
> in pending bitmap and tries to retrieve requests in
> hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0, so the
> request in ctx1->rq_list is ignored.
>
> Fix it by ensuring that new mapping is established before onlined cpu
> starts running.
>
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ming Lei <tom.leiming@gmail.com>

Reviewed-by: Ming Lei <tom.leiming@gmail.com>


> ---
>  block/blk-mq-cpumap.c |  9 ++++----
>  block/blk-mq.c        | 59 +++++++++++++++++++++++++++++++++++++++------------
>  block/blk-mq.h        |  3 ++-
>  3 files changed, 52 insertions(+), 19 deletions(-)
>
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 1e28ddb..8764c24 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -31,7 +31,8 @@ static int get_first_sibling(unsigned int cpu)
>         return cpu;
>  }
>
> -int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
> +int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
> +                           const struct cpumask *online_mask)
>  {
>         unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
>         cpumask_var_t cpus;
> @@ -41,7 +42,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
>
>         cpumask_clear(cpus);
>         nr_cpus = nr_uniq_cpus = 0;
> -       for_each_online_cpu(i) {
> +       for_each_cpu(i, online_mask) {
>                 nr_cpus++;
>                 first_sibling = get_first_sibling(i);
>                 if (!cpumask_test_cpu(first_sibling, cpus))
> @@ -51,7 +52,7 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues)
>
>         queue = 0;
>         for_each_possible_cpu(i) {
> -               if (!cpu_online(i)) {
> +               if (!cpumask_test_cpu(i, online_mask)) {
>                         map[i] = 0;
>                         continue;
>                 }
> @@ -95,7 +96,7 @@ unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
>         if (!map)
>                 return NULL;
>
> -       if (!blk_mq_update_queue_map(map, set->nr_hw_queues))
> +       if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
>                 return map;
>
>         kfree(map);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d5d93e0..d861c70 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1799,7 +1799,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
>         }
>  }
>
> -static void blk_mq_map_swqueue(struct request_queue *q)
> +static void blk_mq_map_swqueue(struct request_queue *q,
> +                              const struct cpumask *online_mask)
>  {
>         unsigned int i;
>         struct blk_mq_hw_ctx *hctx;
> @@ -1816,7 +1817,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>          */
>         queue_for_each_ctx(q, ctx, i) {
>                 /* If the cpu isn't online, the cpu is mapped to first hctx */
> -               if (!cpu_online(i))
> +               if (!cpumask_test_cpu(i, online_mask))
>                         continue;
>
>                 hctx = q->mq_ops->map_queue(q, i);
> @@ -1862,7 +1863,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>         }
>
>         queue_for_each_ctx(q, ctx, i) {
> -               if (!cpu_online(i))
> +               if (!cpumask_test_cpu(i, online_mask))
>                         continue;
>
>                 hctx = q->mq_ops->map_queue(q, i);
> @@ -2047,13 +2048,15 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>         if (blk_mq_init_hw_queues(q, set))
>                 goto err_hctxs;
>
> +       get_online_cpus();
>         mutex_lock(&all_q_mutex);
>
>         list_add_tail(&q->all_q_node, &all_q_list);
>         blk_mq_add_queue_tag_set(set, q);
> -       blk_mq_map_swqueue(q);
> +       blk_mq_map_swqueue(q, cpu_online_mask);
>
>         mutex_unlock(&all_q_mutex);
> +       put_online_cpus();
>
>         return q;
>
> @@ -2090,13 +2093,14 @@ void blk_mq_free_queue(struct request_queue *q)
>  }
>
>  /* Basically redo blk_mq_init_queue with queue frozen */
> -static void blk_mq_queue_reinit(struct request_queue *q)
> +static void blk_mq_queue_reinit(struct request_queue *q,
> +                               const struct cpumask *online_mask)
>  {
>         WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
>
>         blk_mq_sysfs_unregister(q);
>
> -       blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues);
> +       blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
>
>         /*
>          * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
> @@ -2104,7 +2108,7 @@ static void blk_mq_queue_reinit(struct request_queue *q)
>          * involves free and re-allocate memory, worthy doing?)
>          */
>
> -       blk_mq_map_swqueue(q);
> +       blk_mq_map_swqueue(q, online_mask);
>
>         blk_mq_sysfs_register(q);
>  }
> @@ -2113,16 +2117,43 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
>                                       unsigned long action, void *hcpu)
>  {
>         struct request_queue *q;
> +       int cpu = (unsigned long)hcpu;
> +       /*
> +        * New online cpumask which is going to be set in this hotplug event.
> +        * Declare this cpumasks as global as cpu-hotplug operation is invoked
> +        * one-by-one and dynamically allocating this could result in a failure.
> +        */
> +       static struct cpumask online_new;
>
>         /*
> -        * Before new mappings are established, hotadded cpu might already
> -        * start handling requests. This doesn't break anything as we map
> -        * offline CPUs to first hardware queue. We will re-init the queue
> -        * below to get optimal settings.
> +        * Before hotadded cpu starts handling requests, new mappings must
> +        * be established.  Otherwise, these requests in hw queue might
> +        * never be dispatched.
> +        *
> +        * For example, there is a single hw queue (hctx) and two CPU queues
> +        * (ctx0 for CPU0, and ctx1 for CPU1).
> +        *
> +        * Now CPU1 is just onlined and a request is inserted into
> +        * ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is
> +        * still zero.
> +        *
> +        * And then while running hw queue, flush_busy_ctxs() finds bit0 is
> +        * set in pending bitmap and tries to retrieve requests in
> +        * hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0,
> +        * so the request in ctx1->rq_list is ignored.
>          */
> -       if (action != CPU_DEAD && action != CPU_DEAD_FROZEN &&
> -           action != CPU_ONLINE && action != CPU_ONLINE_FROZEN)
> +       switch (action & ~CPU_TASKS_FROZEN) {
> +       case CPU_DEAD:
> +       case CPU_UP_CANCELED:
> +               cpumask_copy(&online_new, cpu_online_mask);
> +               break;
> +       case CPU_UP_PREPARE:
> +               cpumask_copy(&online_new, cpu_online_mask);
> +               cpumask_set_cpu(cpu, &online_new);
> +               break;
> +       default:
>                 return NOTIFY_OK;
> +       }
>
>         mutex_lock(&all_q_mutex);
>
> @@ -2146,7 +2177,7 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
>         }
>
>         list_for_each_entry(q, &all_q_list, all_q_node)
> -               blk_mq_queue_reinit(q);
> +               blk_mq_queue_reinit(q, &online_new);
>
>         list_for_each_entry(q, &all_q_list, all_q_node)
>                 blk_mq_unfreeze_queue(q);
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 6a48c4c..f4fea79 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -51,7 +51,8 @@ void blk_mq_disable_hotplug(void);
>   * CPU -> queue mappings
>   */
>  extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
> -extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues);
> +extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
> +                                  const struct cpumask *online_mask);
>  extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
>
>  /*
> --
> 1.9.1
>



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 6/7] blk-mq: fix freeze queue race
  2015-07-18 16:28 ` [PATCH v3 6/7] blk-mq: fix freeze queue race Akinobu Mita
@ 2015-07-19 12:12   ` Ming Lei
  0 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2015-07-19 12:12 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: Linux Kernel Mailing List, Jens Axboe, Tejun Heo

On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <akinobu.mita@gmail.com> wrote:
> There are several race conditions while freezing queue.
>
> When unfreezing queue, there is a small window between decrementing
> q->mq_freeze_depth to zero and percpu_ref_reinit() call with
> q->mq_usage_counter.  If the other calls blk_mq_freeze_queue_start()
> in the window, q->mq_freeze_depth is increased from zero to one and
> percpu_ref_kill() is called with q->mq_usage_counter which is already
> killed.  percpu refcount should be re-initialized before killed again.
>
> Also, there is a race condition while switching to percpu mode.
> percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
> executed at the same time as the following scenario is possible:
>
> 1. q->mq_usage_counter is initialized in atomic mode.
>    (atomic counter: 1)
>
> 2. After the disk registration, a process like systemd-udev starts
>    accessing the disk, and successfully increases refcount successfully
>    by percpu_ref_tryget_live() in blk_mq_queue_enter().
>    (atomic counter: 2)
>
> 3. In the final stage of initialization, q->mq_usage_counter is being
>    switched to percpu mode by percpu_ref_switch_to_percpu() in
>    blk_mq_finish_init().  But if CONFIG_PREEMPT_VOLUNTARY is enabled,
>    the process is rescheduled in the middle of switching when calling
>    wait_event() in __percpu_ref_switch_to_percpu().
>    (atomic counter: 2)
>
> 4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze
>    request queue.  q->mq_usage_counter is decreased and marked as
>    DEAD.  Wait until all requests have finished.
>    (atomic counter: 1)
>
> 5. The process rescheduled in the step 3. is resumed and finishes
>    all remaining work in __percpu_ref_switch_to_percpu().
>    A bias value is added to atomic counter of q->mq_usage_counter.
>    (atomic counter: PERCPU_COUNT_BIAS + 1)
>
> 6. A request issed in the step 2. is finished and q->mq_usage_counter
>    is decreased by blk_mq_queue_exit().  q->mq_usage_counter is DEAD,
>    so atomic counter is decreased and no release handler is called.
>    (atomic counter: PERCPU_COUNT_BIAS)
>
> 7. CPU hotplug handling in the step 4. will wait forever as
>    q->mq_usage_counter will never be zero.
>
> Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed
> at the same time.  Because both functions could call
> __percpu_ref_switch_to_percpu() which adds the bias value and
> initialize percpu counter.
>
> Fix those races by serializing with per-queue mutex.
>
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> Acked-by: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ming Lei <tom.leiming@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>

Reviewed-by: Ming Lei <tom.leiming@gmail.com>

> ---
>  block/blk-core.c       | 1 +
>  block/blk-mq-sysfs.c   | 2 ++
>  block/blk-mq.c         | 8 ++++++++
>  include/linux/blkdev.h | 6 ++++++
>  4 files changed, 17 insertions(+)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 627ed0c..544b237 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -687,6 +687,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>         __set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
>
>         init_waitqueue_head(&q->mq_freeze_wq);
> +       mutex_init(&q->mq_freeze_lock);
>
>         if (blkcg_init_queue(q))
>                 goto fail_bdi;
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 79096a6..f63b464 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -409,7 +409,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
>  /* see blk_register_queue() */
>  void blk_mq_finish_init(struct request_queue *q)
>  {
> +       mutex_lock(&q->mq_freeze_lock);
>         percpu_ref_switch_to_percpu(&q->mq_usage_counter);
> +       mutex_unlock(&q->mq_freeze_lock);
>  }
>
>  int blk_mq_register_disk(struct gendisk *disk)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d861c70..b931e38 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -115,11 +115,15 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
>  {
>         int freeze_depth;
>
> +       mutex_lock(&q->mq_freeze_lock);
> +
>         freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
>         if (freeze_depth == 1) {
>                 percpu_ref_kill(&q->mq_usage_counter);
>                 blk_mq_run_hw_queues(q, false);
>         }
> +
> +       mutex_unlock(&q->mq_freeze_lock);
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
>
> @@ -143,12 +147,16 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
>  {
>         int freeze_depth;
>
> +       mutex_lock(&q->mq_freeze_lock);
> +
>         freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
>         WARN_ON_ONCE(freeze_depth < 0);
>         if (!freeze_depth) {
>                 percpu_ref_reinit(&q->mq_usage_counter);
>                 wake_up_all(&q->mq_freeze_wq);
>         }
> +
> +       mutex_unlock(&q->mq_freeze_lock);
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index b02c90b..b867c32 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -457,6 +457,12 @@ struct request_queue {
>  #endif
>         struct rcu_head         rcu_head;
>         wait_queue_head_t       mq_freeze_wq;
> +       /*
> +        * Protect concurrent access to mq_usage_counter by
> +        * percpu_ref_switch_to_percpu(), percpu_ref_kill(), and
> +        * percpu_ref_reinit().
> +        */
> +       struct mutex            mq_freeze_lock;
>         struct percpu_ref       mq_usage_counter;
>         struct list_head        all_q_node;
>
> --
> 1.9.1
>



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list
  2015-07-18 16:28 ` [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list Akinobu Mita
@ 2015-07-19 12:30   ` Ming Lei
  2015-07-28  8:10   ` Wanpeng Li
  1 sibling, 0 replies; 17+ messages in thread
From: Ming Lei @ 2015-07-19 12:30 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: Linux Kernel Mailing List, Jens Axboe

On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <akinobu.mita@gmail.com> wrote:
> CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
> all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
> entries by blk_mq_sysfs_unregister().  Removing sysfs entry needs to
> be blocked until the active reference of the kernfs_node to be zero.
>
> On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
> /sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
> blk_mq_hw_sysfs_cpus_show().
>
> If these happen at the same time, a deadlock can happen.  Because one
> can wait for the active reference to be zero with holding all_q_mutex,
> and the other tries to acquire all_q_mutex with holding the active
> reference.
>
> The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
> is to avoid reading an imcomplete hctx->cpumask.  Since reading sysfs
> entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
> and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
> while hctx->cpumask is being updated.
>
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ming Lei <tom.leiming@gmail.com>

Reviewed-by: Ming Lei <tom.leiming@gmail.com>

> ---
>  block/blk-mq-sysfs.c | 4 ----
>  block/blk-mq.c       | 7 +++++++
>  2 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index f63b464..e0f71bf 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -218,8 +218,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
>         unsigned int i, first = 1;
>         ssize_t ret = 0;
>
> -       blk_mq_disable_hotplug();
> -
>         for_each_cpu(i, hctx->cpumask) {
>                 if (first)
>                         ret += sprintf(ret + page, "%u", i);
> @@ -229,8 +227,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
>                 first = 0;
>         }
>
> -       blk_mq_enable_hotplug();
> -
>         ret += sprintf(ret + page, "\n");
>         return ret;
>  }
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b931e38..1a5e7d1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1815,6 +1815,11 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>         struct blk_mq_ctx *ctx;
>         struct blk_mq_tag_set *set = q->tag_set;
>
> +       /*
> +        * Avoid others reading imcomplete hctx->cpumask through sysfs
> +        */
> +       mutex_lock(&q->sysfs_lock);
> +
>         queue_for_each_hw_ctx(q, hctx, i) {
>                 cpumask_clear(hctx->cpumask);
>                 hctx->nr_ctx = 0;
> @@ -1834,6 +1839,8 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>                 hctx->ctxs[hctx->nr_ctx++] = ctx;
>         }
>
> +       mutex_unlock(&q->sysfs_lock);
> +
>         queue_for_each_hw_ctx(q, hctx, i) {
>                 struct blk_mq_ctxmap *map = &hctx->ctx_map;
>
> --
> 1.9.1
>



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation
  2015-07-19 10:24   ` Ming Lei
@ 2015-07-21 13:58     ` Akinobu Mita
  2015-07-27  9:49       ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Akinobu Mita @ 2015-07-21 13:58 UTC (permalink / raw)
  To: Ming Lei; +Cc: Linux Kernel Mailing List, Keith Busch, Jens Axboe

2015-07-19 (日) の 18:24 +0800 に Ming Lei さんは書きました:
> On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <akinobu.mita@gmail.com> wrote:
> > When unmapped hw queue is remapped after CPU topology is changed,
> > hctx->tags->cpumask is set before hctx->tags is allocated in
> > blk_mq_map_swqueue().
> >
> > In order to fix this null pointer dereference, hctx->tags must be
> > allocated before configuring hctx->tags->cpumask.
> 
> The root cause should be that the mapping between hctx and ctx
> can be changed after CPU topo is changed, then hctx->tags can
> be changed too, so hctx->tags->cpumask has to be set after
> hctx->tags is setup.
> 
> >
> > Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
> 
> I am wondering if the above commit considers CPU hotplug, and
> nvme uses tag->cpumask to set irq affinity hint just during
> starting queue. Looks it should be reasonalbe to
> introduce one callback of mapping_changed() for handling
> this kind of stuff. But this isn't related with this patch.
> 
> > Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> > Cc: Keith Busch <keith.busch@intel.com>
> > Cc: Jens Axboe <axboe@kernel.dk>
> > Cc: Ming Lei <tom.leiming@gmail.com>
> > ---
> >  block/blk-mq.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 7d842db..f29f766 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1821,7 +1821,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> >
> >                 hctx = q->mq_ops->map_queue(q, i);
> >                 cpumask_set_cpu(i, hctx->cpumask);
> > -               cpumask_set_cpu(i, hctx->tags->cpumask);
> >                 ctx->index_hw = hctx->nr_ctx;
> >                 hctx->ctxs[hctx->nr_ctx++] = ctx;
> >         }
> > @@ -1861,6 +1860,14 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> >                 hctx->next_cpu = cpumask_first(hctx->cpumask);
> >                 hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
> >         }
> > +
> > +       queue_for_each_ctx(q, ctx, i) {
> > +               if (!cpu_online(i))
> > +                       continue;
> > +
> > +               hctx = q->mq_ops->map_queue(q, i);
> > +               cpumask_set_cpu(i, hctx->tags->cpumask);
> 
> If tags->cpumask is always same with hctx->cpumaks, this
> CPU iterator can be avoided.

How about this patch?
Or should we use cpumask_or() instead cpumask_copy?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7d842db..56f814a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1821,7 +1821,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 
 		hctx = q->mq_ops->map_queue(q, i);
 		cpumask_set_cpu(i, hctx->cpumask);
-		cpumask_set_cpu(i, hctx->tags->cpumask);
 		ctx->index_hw = hctx->nr_ctx;
 		hctx->ctxs[hctx->nr_ctx++] = ctx;
 	}
@@ -1846,7 +1845,10 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 		if (!set->tags[i])
 			set->tags[i] = blk_mq_init_rq_map(set, i);
 		hctx->tags = set->tags[i];
-		WARN_ON(!hctx->tags);
+		if (hctx->tags)
+			cpumask_copy(hctx->tags->cpumask, hctx->cpumask);
+		else
+			WARN_ON(1);
 
 		/*
 		 * Set the map size to the number of mapped software queues.



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation
  2015-07-21 13:58     ` Akinobu Mita
@ 2015-07-27  9:49       ` Ming Lei
  0 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2015-07-27  9:49 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: Linux Kernel Mailing List, Keith Busch, Jens Axboe

On Tue, Jul 21, 2015 at 9:58 AM, Akinobu Mita <akinobu.mita@gmail.com> wrote:
> 2015-07-19 (日) の 18:24 +0800 に Ming Lei さんは書きました:
>> On Sun, Jul 19, 2015 at 12:28 AM, Akinobu Mita <akinobu.mita@gmail.com> wrote:
>> > When unmapped hw queue is remapped after CPU topology is changed,
>> > hctx->tags->cpumask is set before hctx->tags is allocated in
>> > blk_mq_map_swqueue().
>> >
>> > In order to fix this null pointer dereference, hctx->tags must be
>> > allocated before configuring hctx->tags->cpumask.
>>
>> The root cause should be that the mapping between hctx and ctx
>> can be changed after CPU topo is changed, then hctx->tags can
>> be changed too, so hctx->tags->cpumask has to be set after
>> hctx->tags is setup.
>>
>> >
>> > Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
>>
>> I am wondering if the above commit considers CPU hotplug, and
>> nvme uses tag->cpumask to set irq affinity hint just during
>> starting queue. Looks it should be reasonalbe to
>> introduce one callback of mapping_changed() for handling
>> this kind of stuff. But this isn't related with this patch.
>>
>> > Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
>> > Cc: Keith Busch <keith.busch@intel.com>
>> > Cc: Jens Axboe <axboe@kernel.dk>
>> > Cc: Ming Lei <tom.leiming@gmail.com>
>> > ---
>> >  block/blk-mq.c | 9 ++++++++-
>> >  1 file changed, 8 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/block/blk-mq.c b/block/blk-mq.c
>> > index 7d842db..f29f766 100644
>> > --- a/block/blk-mq.c
>> > +++ b/block/blk-mq.c
>> > @@ -1821,7 +1821,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>> >
>> >                 hctx = q->mq_ops->map_queue(q, i);
>> >                 cpumask_set_cpu(i, hctx->cpumask);
>> > -               cpumask_set_cpu(i, hctx->tags->cpumask);
>> >                 ctx->index_hw = hctx->nr_ctx;
>> >                 hctx->ctxs[hctx->nr_ctx++] = ctx;
>> >         }
>> > @@ -1861,6 +1860,14 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>> >                 hctx->next_cpu = cpumask_first(hctx->cpumask);
>> >                 hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>> >         }
>> > +
>> > +       queue_for_each_ctx(q, ctx, i) {
>> > +               if (!cpu_online(i))
>> > +                       continue;
>> > +
>> > +               hctx = q->mq_ops->map_queue(q, i);
>> > +               cpumask_set_cpu(i, hctx->tags->cpumask);
>>
>> If tags->cpumask is always same with hctx->cpumaks, this
>> CPU iterator can be avoided.
>
> How about this patch?
> Or should we use cpumask_or() instead cpumask_copy?

I guess tags->cpumask need to be fixed in future, so better to
just take the current patch:

[PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation

Thanks,

>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7d842db..56f814a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1821,7 +1821,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>
>                 hctx = q->mq_ops->map_queue(q, i);
>                 cpumask_set_cpu(i, hctx->cpumask);
> -               cpumask_set_cpu(i, hctx->tags->cpumask);
>                 ctx->index_hw = hctx->nr_ctx;
>                 hctx->ctxs[hctx->nr_ctx++] = ctx;
>         }
> @@ -1846,7 +1845,10 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>                 if (!set->tags[i])
>                         set->tags[i] = blk_mq_init_rq_map(set, i);
>                 hctx->tags = set->tags[i];
> -               WARN_ON(!hctx->tags);
> +               if (hctx->tags)
> +                       cpumask_copy(hctx->tags->cpumask, hctx->cpumask);
> +               else
> +                       WARN_ON(1);
>
>                 /*
>                  * Set the map size to the number of mapped software queues.
>
>



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list
  2015-07-18 16:28 ` [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list Akinobu Mita
  2015-07-19 12:30   ` Ming Lei
@ 2015-07-28  8:10   ` Wanpeng Li
  2015-07-28 23:37     ` Akinobu Mita
  1 sibling, 1 reply; 17+ messages in thread
From: Wanpeng Li @ 2015-07-28  8:10 UTC (permalink / raw)
  To: Akinobu Mita, linux-kernel; +Cc: Jens Axboe, Ming Lei



On 7/19/15 12:28 AM, Akinobu Mita wrote:
> CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
> all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
> entries by blk_mq_sysfs_unregister().  Removing sysfs entry needs to
> be blocked until the active reference of the kernfs_node to be zero.
>
> On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
> /sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
> blk_mq_hw_sysfs_cpus_show().
>
> If these happen at the same time, a deadlock can happen.  Because one
> can wait for the active reference to be zero with holding all_q_mutex,
> and the other tries to acquire all_q_mutex with holding the active
> reference.
>
> The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
> is to avoid reading an imcomplete hctx->cpumask.  Since reading sysfs
> entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
> and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
> while hctx->cpumask is being updated.

Dump ctx attrs will be excluded with map software to hardware queue 
operations which makes people confusing after your patch.

Regards,
Wanpeng Li

>
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ming Lei <tom.leiming@gmail.com>
> ---
>   block/blk-mq-sysfs.c | 4 ----
>   block/blk-mq.c       | 7 +++++++
>   2 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index f63b464..e0f71bf 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -218,8 +218,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
>   	unsigned int i, first = 1;
>   	ssize_t ret = 0;
>   
> -	blk_mq_disable_hotplug();
> -
>   	for_each_cpu(i, hctx->cpumask) {
>   		if (first)
>   			ret += sprintf(ret + page, "%u", i);
> @@ -229,8 +227,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
>   		first = 0;
>   	}
>   
> -	blk_mq_enable_hotplug();
> -
>   	ret += sprintf(ret + page, "\n");
>   	return ret;
>   }
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b931e38..1a5e7d1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1815,6 +1815,11 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>   	struct blk_mq_ctx *ctx;
>   	struct blk_mq_tag_set *set = q->tag_set;
>   
> +	/*
> +	 * Avoid others reading imcomplete hctx->cpumask through sysfs
> +	 */
> +	mutex_lock(&q->sysfs_lock);
> +
>   	queue_for_each_hw_ctx(q, hctx, i) {
>   		cpumask_clear(hctx->cpumask);
>   		hctx->nr_ctx = 0;
> @@ -1834,6 +1839,8 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>   		hctx->ctxs[hctx->nr_ctx++] = ctx;
>   	}
>   
> +	mutex_unlock(&q->sysfs_lock);
> +
>   	queue_for_each_hw_ctx(q, hctx, i) {
>   		struct blk_mq_ctxmap *map = &hctx->ctx_map;
>   


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list
  2015-07-28  8:10   ` Wanpeng Li
@ 2015-07-28 23:37     ` Akinobu Mita
  0 siblings, 0 replies; 17+ messages in thread
From: Akinobu Mita @ 2015-07-28 23:37 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: LKML, Jens Axboe, Ming Lei

2015-07-28 17:10 GMT+09:00 Wanpeng Li <wanpeng.li@hotmail.com>:
>
>
> On 7/19/15 12:28 AM, Akinobu Mita wrote:
>>
>> CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
>> all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
>> entries by blk_mq_sysfs_unregister().  Removing sysfs entry needs to
>> be blocked until the active reference of the kernfs_node to be zero.
>>
>> On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
>> /sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
>> blk_mq_hw_sysfs_cpus_show().
>>
>> If these happen at the same time, a deadlock can happen.  Because one
>> can wait for the active reference to be zero with holding all_q_mutex,
>> and the other tries to acquire all_q_mutex with holding the active
>> reference.
>>
>> The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
>> is to avoid reading an imcomplete hctx->cpumask.  Since reading sysfs
>> entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
>> and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
>> while hctx->cpumask is being updated.
>
>
> Dump ctx attrs will be excluded with map software to hardware queue
> operations which makes people confusing after your patch.

Do you suggest that we shouldn't use q->sysfs_lock in
blk_mq_map_swqueue() in this patch?

>>
>> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: Ming Lei <tom.leiming@gmail.com>
>>
>> ---
>>   block/blk-mq-sysfs.c | 4 ----
>>   block/blk-mq.c       | 7 +++++++
>>   2 files changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
>> index f63b464..e0f71bf 100644
>> --- a/block/blk-mq-sysfs.c
>> +++ b/block/blk-mq-sysfs.c
>> @@ -218,8 +218,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct
>> blk_mq_hw_ctx *hctx, char *page)
>>         unsigned int i, first = 1;
>>         ssize_t ret = 0;
>>   -     blk_mq_disable_hotplug();
>> -
>>         for_each_cpu(i, hctx->cpumask) {
>>                 if (first)
>>                         ret += sprintf(ret + page, "%u", i);
>> @@ -229,8 +227,6 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct
>> blk_mq_hw_ctx *hctx, char *page)
>>                 first = 0;
>>         }
>>   -     blk_mq_enable_hotplug();
>> -
>>         ret += sprintf(ret + page, "\n");
>>         return ret;
>>   }
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index b931e38..1a5e7d1 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1815,6 +1815,11 @@ static void blk_mq_map_swqueue(struct request_queue
>> *q,
>>         struct blk_mq_ctx *ctx;
>>         struct blk_mq_tag_set *set = q->tag_set;
>>   +     /*
>> +        * Avoid others reading imcomplete hctx->cpumask through sysfs
>> +        */
>> +       mutex_lock(&q->sysfs_lock);
>> +
>>         queue_for_each_hw_ctx(q, hctx, i) {
>>                 cpumask_clear(hctx->cpumask);
>>                 hctx->nr_ctx = 0;
>> @@ -1834,6 +1839,8 @@ static void blk_mq_map_swqueue(struct request_queue
>> *q,
>>                 hctx->ctxs[hctx->nr_ctx++] = ctx;
>>         }
>>   +     mutex_unlock(&q->sysfs_lock);
>> +
>>         queue_for_each_hw_ctx(q, hctx, i) {
>>                 struct blk_mq_ctxmap *map = &hctx->ctx_map;
>>
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-07-28 23:37 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-18 16:28 [PATCH v3 0/7] blk-mq: fix race conditions on cpu hotplug handling Akinobu Mita
2015-07-18 16:28 ` [PATCH v3 1/7] blk-mq: avoid access hctx->tags->cpumask before allocation Akinobu Mita
2015-07-19 10:24   ` Ming Lei
2015-07-21 13:58     ` Akinobu Mita
2015-07-27  9:49       ` Ming Lei
2015-07-18 16:28 ` [PATCH v3 2/7] blk-mq: fix sysfs registration/unregistration race Akinobu Mita
2015-07-19 11:47   ` Ming Lei
2015-07-18 16:28 ` [PATCH v3 3/7] blk-mq: Fix use after of free q->mq_map Akinobu Mita
2015-07-18 16:28 ` [PATCH v3 4/7] blk-mq: fix q->mq_usage_counter access race Akinobu Mita
2015-07-18 16:28 ` [PATCH v3 5/7] blk-mq: avoid inserting requests before establishing new mapping Akinobu Mita
2015-07-19 12:12   ` Ming Lei
2015-07-18 16:28 ` [PATCH v3 6/7] blk-mq: fix freeze queue race Akinobu Mita
2015-07-19 12:12   ` Ming Lei
2015-07-18 16:28 ` [PATCH v3 7/7] blk-mq: fix deadlock when reading cpu_list Akinobu Mita
2015-07-19 12:30   ` Ming Lei
2015-07-28  8:10   ` Wanpeng Li
2015-07-28 23:37     ` Akinobu Mita

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.