[PATCH V2 0/6] block: don't acquire .sysfs_lock before removing mq & iosched kobjects

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2 0/6] block: don't acquire .sysfs_lock before removing mq & iosched kobjects
@ 2019-08-21  9:15 Ming Lei
  2019-08-21  9:15 ` [PATCH V2 1/6] block: Remove blk_mq_register_dev() Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Ming Lei @ 2019-08-21  9:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer, Bart Van Assche

Hi,

The 1st 4 patches cleans up current uses on q->sysfs_lock.

The 5th patch adds one helper for checking if queue is registered.

The last patch splits .sysfs_lock into two locks: one is only for
sync .store/.show from sysfs, the other one is for pretecting kobjects
registering/unregistering. Meantime avoid to acquire .sysfs_lock when
removing mq & iosched kobjects, so that the reported deadlock can
be fixed.

V2:
	- remove several uses on .sysfs_lock
	- Remove blk_mq_register_dev()
	- add one helper for checking queue registered
	- split .sysfs_lock into two locks

Bart Van Assche (1):
  block: Remove blk_mq_register_dev()

Ming Lei (5):
  block: don't hold q->sysfs_lock in elevator_init_mq
  blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue
  blk-mq: don't hold q->sysfs_lock in blk_mq_realloc_hw_ctxs()
  block: add helper for checking if queue is registered
  block: split .sysfs_lock into two locks

 block/blk-core.c       |  1 +
 block/blk-mq-sysfs.c   | 23 ++++------------
 block/blk-mq.c         | 10 -------
 block/blk-sysfs.c      | 50 +++++++++++++++++++++-------------
 block/blk-wbt.c        |  2 +-
 block/blk.h            |  2 +-
 block/elevator.c       | 62 +++++++++++++++++++++++++++++++-----------
 include/linux/blk-mq.h |  1 -
 include/linux/blkdev.h |  2 ++
 9 files changed, 88 insertions(+), 65 deletions(-)

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>


-- 
2.20.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH V2 1/6] block: Remove blk_mq_register_dev()
  2019-08-21  9:15 [PATCH V2 0/6] block: don't acquire .sysfs_lock before removing mq & iosched kobjects Ming Lei
@ 2019-08-21  9:15 ` Ming Lei
  2019-08-21  9:15 ` [PATCH V2 2/6] block: don't hold q->sysfs_lock in elevator_init_mq Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Ming Lei @ 2019-08-21  9:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Bart Van Assche, Christoph Hellwig, Ming Lei,
	Hannes Reinecke

From: Bart Van Assche <bvanassche@acm.org>

This function has no callers. Hence remove it.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq-sysfs.c   | 11 -----------
 include/linux/blk-mq.h |  1 -
 2 files changed, 12 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index e0b97c22726c..31bbf10d8149 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -326,17 +326,6 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
 	return ret;
 }
 
-int blk_mq_register_dev(struct device *dev, struct request_queue *q)
-{
-	int ret;
-
-	mutex_lock(&q->sysfs_lock);
-	ret = __blk_mq_register_dev(dev, q);
-	mutex_unlock(&q->sysfs_lock);
-
-	return ret;
-}
-
 void blk_mq_sysfs_unregister(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 21cebe901ac0..62a3bb715899 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -253,7 +253,6 @@ struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
 						const struct blk_mq_ops *ops,
 						unsigned int queue_depth,
 						unsigned int set_flags);
-int blk_mq_register_dev(struct device *, struct request_queue *);
 void blk_mq_unregister_dev(struct device *, struct request_queue *);
 
 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V2 2/6] block: don't hold q->sysfs_lock in elevator_init_mq
  2019-08-21  9:15 [PATCH V2 0/6] block: don't acquire .sysfs_lock before removing mq & iosched kobjects Ming Lei
  2019-08-21  9:15 ` [PATCH V2 1/6] block: Remove blk_mq_register_dev() Ming Lei
@ 2019-08-21  9:15 ` Ming Lei
  2019-08-21 15:51   ` Bart Van Assche
  2019-08-21  9:15 ` [PATCH V2 3/6] blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2019-08-21  9:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer, Bart Van Assche

The original comment says:

	q->sysfs_lock must be held to provide mutual exclusion between
	elevator_switch() and here.

Which is simply wrong. elevator_init_mq() is only called from
blk_mq_init_allocated_queue, which is always called before the request
queue is registered via blk_register_queue(), for dm-rq or normal rq
based driver. However, queue's kobject is just exposed added to sysfs
in blk_register_queue(). So there isn't such race between elevator_switch()
and elevator_init_mq().

So avoid to hold q->sysfs_lock in elevator_init_mq().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 2f17d66d0e61..37b918dc4676 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -608,22 +608,22 @@ int elevator_init_mq(struct request_queue *q)
 		return 0;
 
 	/*
-	 * q->sysfs_lock must be held to provide mutual exclusion between
-	 * elevator_switch() and here.
+	 * We are called from blk_mq_init_allocated_queue() only, at that
+	 * time the request queue isn't registered yet, so the queue
+	 * kobject isn't exposed to userspace. No need to worry about race
+	 * with elevator_switch(), and no need to hold q->sysfs_lock.
 	 */
-	mutex_lock(&q->sysfs_lock);
 	if (unlikely(q->elevator))
-		goto out_unlock;
+		goto out;
 
 	e = elevator_get(q, "mq-deadline", false);
 	if (!e)
-		goto out_unlock;
+		goto out;
 
 	err = blk_mq_init_sched(q, e);
 	if (err)
 		elevator_put(e);
-out_unlock:
-	mutex_unlock(&q->sysfs_lock);
+out:
 	return err;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V2 3/6] blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue
  2019-08-21  9:15 [PATCH V2 0/6] block: don't acquire .sysfs_lock before removing mq & iosched kobjects Ming Lei
  2019-08-21  9:15 ` [PATCH V2 1/6] block: Remove blk_mq_register_dev() Ming Lei
  2019-08-21  9:15 ` [PATCH V2 2/6] block: don't hold q->sysfs_lock in elevator_init_mq Ming Lei
@ 2019-08-21  9:15 ` Ming Lei
  2019-08-21 15:53   ` Bart Van Assche
  2019-08-21  9:15 ` [PATCH V2 4/6] blk-mq: don't hold q->sysfs_lock in blk_mq_realloc_hw_ctxs() Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2019-08-21  9:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer, Bart Van Assche

blk_mq_map_swqueue() is called from blk_mq_init_allocated_queue()
and blk_mq_update_nr_hw_queues(). For the former caller, the kobject
isn't exposed to userspace yet. For the latter caller, sysfs/debugfs
is un-registered before updating nr_hw_queues.

On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after
request queue is freed") moves freeing hctx into queue's release
handler, so there won't be race with queue release path too.

So don't hold q->sysfs_lock in blk_mq_map_swqueue().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6968de9d7402..b0ee0cac737f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2456,11 +2456,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_tag_set *set = q->tag_set;
 
-	/*
-	 * Avoid others reading imcomplete hctx->cpumask through sysfs
-	 */
-	mutex_lock(&q->sysfs_lock);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		cpumask_clear(hctx->cpumask);
 		hctx->nr_ctx = 0;
@@ -2521,8 +2516,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 					HCTX_TYPE_DEFAULT, i);
 	}
 
-	mutex_unlock(&q->sysfs_lock);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		/*
 		 * If no software queues are mapped to this hardware queue,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V2 4/6] blk-mq: don't hold q->sysfs_lock in blk_mq_realloc_hw_ctxs()
  2019-08-21  9:15 [PATCH V2 0/6] block: don't acquire .sysfs_lock before removing mq & iosched kobjects Ming Lei
                   ` (2 preceding siblings ...)
  2019-08-21  9:15 ` [PATCH V2 3/6] blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue Ming Lei
@ 2019-08-21  9:15 ` Ming Lei
  2019-08-21 15:56   ` Bart Van Assche
  2019-08-21  9:15 ` [PATCH V2 5/6] block: add helper for checking if queue is registered Ming Lei
  2019-08-21  9:15 ` [PATCH V2 6/6] block: split .sysfs_lock into two locks Ming Lei
  5 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2019-08-21  9:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer, Bart Van Assche

blk_mq_realloc_hw_ctxs() is called from blk_mq_init_allocated_queue()
and blk_mq_update_nr_hw_queues(). For the former caller, the kobject
isn't exposed to userspace yet. For the latter caller, sysfs/debugfs
is un-registered before updating nr_hw_queues.

On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after
request queue is freed") moves freeing hctx into queue's release
handler, so there won't be race with queue release path too.

So don't hold q->sysfs_lock in blk_mq_realloc_hw_ctxs().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b0ee0cac737f..d4c8692aca1f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2768,8 +2768,6 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	int i, j, end;
 	struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;
 
-	/* protect against switching io scheduler  */
-	mutex_lock(&q->sysfs_lock);
 	for (i = 0; i < set->nr_hw_queues; i++) {
 		int node;
 		struct blk_mq_hw_ctx *hctx;
@@ -2820,7 +2818,6 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 			hctxs[j] = NULL;
 		}
 	}
-	mutex_unlock(&q->sysfs_lock);
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V2 5/6] block: add helper for checking if queue is registered
  2019-08-21  9:15 [PATCH V2 0/6] block: don't acquire .sysfs_lock before removing mq & iosched kobjects Ming Lei
                   ` (3 preceding siblings ...)
  2019-08-21  9:15 ` [PATCH V2 4/6] blk-mq: don't hold q->sysfs_lock in blk_mq_realloc_hw_ctxs() Ming Lei
@ 2019-08-21  9:15 ` Ming Lei
  2019-08-21 15:57   ` Bart Van Assche
  2019-08-21  9:15 ` [PATCH V2 6/6] block: split .sysfs_lock into two locks Ming Lei
  5 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2019-08-21  9:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer, Bart Van Assche

There are 4 users which check if queue is registered, so add one helper
to check it.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-sysfs.c      | 4 ++--
 block/blk-wbt.c        | 2 +-
 block/elevator.c       | 2 +-
 include/linux/blkdev.h | 1 +
 4 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 977c659dcd18..5b0b5224cfd4 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -942,7 +942,7 @@ int blk_register_queue(struct gendisk *disk)
 	if (WARN_ON(!q))
 		return -ENXIO;
 
-	WARN_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags),
+	WARN_ONCE(blk_queue_registered(q),
 		  "%s is registering an already registered queue\n",
 		  kobject_name(&dev->kobj));
 	blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
@@ -1026,7 +1026,7 @@ void blk_unregister_queue(struct gendisk *disk)
 		return;
 
 	/* Return early if disk->queue was never registered. */
-	if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags))
+	if (!blk_queue_registered(q))
 		return;
 
 	/*
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 313f45a37e9d..c4d3089e47f7 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -656,7 +656,7 @@ void wbt_enable_default(struct request_queue *q)
 		return;
 
 	/* Queue not registered? Maybe shutting down... */
-	if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags))
+	if (!blk_queue_registered(q))
 		return;
 
 	if (queue_is_mq(q) && IS_ENABLED(CONFIG_BLK_WBT_MQ))
diff --git a/block/elevator.c b/block/elevator.c
index 37b918dc4676..7449a5836b52 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -660,7 +660,7 @@ static int __elevator_change(struct request_queue *q, const char *name)
 	struct elevator_type *e;
 
 	/* Make sure queue is not in the middle of being removed */
-	if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags))
+	if (!blk_queue_registered(q))
 		return -ENOENT;
 
 	/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 167bf879f072..6041755984f4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -647,6 +647,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_quiesced(q)	test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
 #define blk_queue_pm_only(q)	atomic_read(&(q)->pm_only)
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
+#define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V2 6/6] block: split .sysfs_lock into two locks
  2019-08-21  9:15 [PATCH V2 0/6] block: don't acquire .sysfs_lock before removing mq & iosched kobjects Ming Lei
                   ` (4 preceding siblings ...)
  2019-08-21  9:15 ` [PATCH V2 5/6] block: add helper for checking if queue is registered Ming Lei
@ 2019-08-21  9:15 ` Ming Lei
  2019-08-21 16:18   ` Bart Van Assche
  2019-08-23 16:46   ` Bart Van Assche
  5 siblings, 2 replies; 20+ messages in thread
From: Ming Lei @ 2019-08-21  9:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer, Bart Van Assche

Split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.

sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.

The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.

However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].

On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.

[1]  lockdep warning
    ======================================================
    WARNING: possible circular locking dependency detected
    5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
    ------------------------------------------------------
    rmmod/777 is trying to acquire lock:
    00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72

    but task is already holding lock:
    00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&q->sysfs_lock){+.+.}:
           __lock_acquire+0x95f/0xa2f
           lock_acquire+0x1b4/0x1e8
           __mutex_lock+0x14a/0xa9b
           blk_mq_hw_sysfs_show+0x63/0xb6
           sysfs_kf_seq_show+0x11f/0x196
           seq_read+0x2cd/0x5f2
           vfs_read+0xc7/0x18c
           ksys_read+0xc4/0x13e
           do_syscall_64+0xa7/0x295
           entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (kn->count#202){++++}:
           check_prev_add+0x5d2/0xc45
           validate_chain+0xed3/0xf94
           __lock_acquire+0x95f/0xa2f
           lock_acquire+0x1b4/0x1e8
           __kernfs_remove+0x237/0x40b
           kernfs_remove_by_name_ns+0x59/0x72
           remove_files+0x61/0x96
           sysfs_remove_group+0x81/0xa4
           sysfs_remove_groups+0x3b/0x44
           kobject_del+0x44/0x94
           blk_mq_unregister_dev+0x83/0xdd
           blk_unregister_queue+0xa0/0x10b
           del_gendisk+0x259/0x3fa
           null_del_dev+0x8b/0x1c3 [null_blk]
           null_exit+0x5c/0x95 [null_blk]
           __se_sys_delete_module+0x204/0x337
           do_syscall_64+0xa7/0x295
           entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(&q->sysfs_lock);
                                   lock(kn->count#202);
                                   lock(&q->sysfs_lock);
      lock(kn->count#202);

     *** DEADLOCK ***

    2 locks held by rmmod/777:
     #0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
     #1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    stack backtrace:
    CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
    Call Trace:
     dump_stack+0x9a/0xe6
     check_noncircular+0x207/0x251
     ? print_circular_bug+0x32a/0x32a
     ? find_usage_backwards+0x84/0xb0
     check_prev_add+0x5d2/0xc45
     validate_chain+0xed3/0xf94
     ? check_prev_add+0xc45/0xc45
     ? mark_lock+0x11b/0x804
     ? check_usage_forwards+0x1ca/0x1ca
     __lock_acquire+0x95f/0xa2f
     lock_acquire+0x1b4/0x1e8
     ? kernfs_remove_by_name_ns+0x59/0x72
     __kernfs_remove+0x237/0x40b
     ? kernfs_remove_by_name_ns+0x59/0x72
     ? kernfs_next_descendant_post+0x7d/0x7d
     ? strlen+0x10/0x23
     ? strcmp+0x22/0x44
     kernfs_remove_by_name_ns+0x59/0x72
     remove_files+0x61/0x96
     sysfs_remove_group+0x81/0xa4
     sysfs_remove_groups+0x3b/0x44
     kobject_del+0x44/0x94
     blk_mq_unregister_dev+0x83/0xdd
     blk_unregister_queue+0xa0/0x10b
     del_gendisk+0x259/0x3fa
     ? disk_events_poll_msecs_store+0x12b/0x12b
     ? check_flags+0x1ea/0x204
     ? mark_held_locks+0x1f/0x7a
     null_del_dev+0x8b/0x1c3 [null_blk]
     null_exit+0x5c/0x95 [null_blk]
     __se_sys_delete_module+0x204/0x337
     ? free_module+0x39f/0x39f
     ? blkcg_maybe_throttle_current+0x8a/0x718
     ? rwlock_bug+0x62/0x62
     ? __blkcg_punt_bio_submit+0xd0/0xd0
     ? trace_hardirqs_on_thunk+0x1a/0x20
     ? mark_held_locks+0x1f/0x7a
     ? do_syscall_64+0x4c/0x295
     do_syscall_64+0xa7/0x295
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fb696cdbe6b
    Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
    RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
    RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
    R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
    R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       |  1 +
 block/blk-mq-sysfs.c   | 12 +++++------
 block/blk-sysfs.c      | 46 ++++++++++++++++++++++++++----------------
 block/blk.h            |  2 +-
 block/elevator.c       | 46 ++++++++++++++++++++++++++++++++++--------
 include/linux/blkdev.h |  1 +
 6 files changed, 76 insertions(+), 32 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 919629ce4015..2792f7cf7bef 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -520,6 +520,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	mutex_init(&q->blk_trace_mutex);
 #endif
 	mutex_init(&q->sysfs_lock);
+	mutex_init(&q->sysfs_dir_lock);
 	spin_lock_init(&q->queue_lock);
 
 	init_waitqueue_head(&q->mq_freeze_wq);
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 31bbf10d8149..a4cc40ddda86 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -247,7 +247,7 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
-	lockdep_assert_held(&q->sysfs_lock);
+	lockdep_assert_held(&q->sysfs_dir_lock);
 
 	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_unregister_hctx(hctx);
@@ -297,7 +297,7 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
 	int ret, i;
 
 	WARN_ON_ONCE(!q->kobj.parent);
-	lockdep_assert_held(&q->sysfs_lock);
+	lockdep_assert_held(&q->sysfs_dir_lock);
 
 	ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
 	if (ret < 0)
@@ -331,7 +331,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
-	mutex_lock(&q->sysfs_lock);
+	mutex_lock(&q->sysfs_dir_lock);
 	if (!q->mq_sysfs_init_done)
 		goto unlock;
 
@@ -339,7 +339,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
 		blk_mq_unregister_hctx(hctx);
 
 unlock:
-	mutex_unlock(&q->sysfs_lock);
+	mutex_unlock(&q->sysfs_dir_lock);
 }
 
 int blk_mq_sysfs_register(struct request_queue *q)
@@ -347,7 +347,7 @@ int blk_mq_sysfs_register(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i, ret = 0;
 
-	mutex_lock(&q->sysfs_lock);
+	mutex_lock(&q->sysfs_dir_lock);
 	if (!q->mq_sysfs_init_done)
 		goto unlock;
 
@@ -358,7 +358,7 @@ int blk_mq_sysfs_register(struct request_queue *q)
 	}
 
 unlock:
-	mutex_unlock(&q->sysfs_lock);
+	mutex_unlock(&q->sysfs_dir_lock);
 
 	return ret;
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 5b0b5224cfd4..5941a0176f87 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -938,6 +938,7 @@ int blk_register_queue(struct gendisk *disk)
 	int ret;
 	struct device *dev = disk_to_dev(disk);
 	struct request_queue *q = disk->queue;
+	bool has_elevator = false;
 
 	if (WARN_ON(!q))
 		return -ENXIO;
@@ -945,7 +946,6 @@ int blk_register_queue(struct gendisk *disk)
 	WARN_ONCE(blk_queue_registered(q),
 		  "%s is registering an already registered queue\n",
 		  kobject_name(&dev->kobj));
-	blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
 
 	/*
 	 * SCSI probing may synchronously create and destroy a lot of
@@ -966,7 +966,7 @@ int blk_register_queue(struct gendisk *disk)
 		return ret;
 
 	/* Prevent changes through sysfs until registration is completed. */
-	mutex_lock(&q->sysfs_lock);
+	mutex_lock(&q->sysfs_dir_lock);
 
 	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
 	if (ret < 0) {
@@ -987,26 +987,37 @@ int blk_register_queue(struct gendisk *disk)
 		blk_mq_debugfs_register(q);
 	}
 
-	kobject_uevent(&q->kobj, KOBJ_ADD);
-
-	wbt_enable_default(q);
-
-	blk_throtl_register_queue(q);
-
+	/*
+	 * The queue's kobject ADD uevent isn't sent out, also the
+	 * flag of QUEUE_FLAG_REGISTERED isn't set yet, so elevator
+	 * switch won't happen at all.
+	 */
 	if (q->elevator) {
-		ret = elv_register_queue(q);
+		ret = elv_register_queue(q, false);
 		if (ret) {
-			mutex_unlock(&q->sysfs_lock);
-			kobject_uevent(&q->kobj, KOBJ_REMOVE);
+			mutex_unlock(&q->sysfs_dir_lock);
 			kobject_del(&q->kobj);
 			blk_trace_remove_sysfs(dev);
 			kobject_put(&dev->kobj);
 			return ret;
 		}
+		has_elevator = true;
 	}
+
+	mutex_lock(&q->sysfs_lock);
+	blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
+	wbt_enable_default(q);
+	blk_throtl_register_queue(q);
+	mutex_unlock(&q->sysfs_lock);
+
+	/* Now everything is ready and send out KOBJ_ADD uevent */
+	kobject_uevent(&q->kobj, KOBJ_ADD);
+	if (has_elevator)
+		kobject_uevent(&q->elevator->kobj, KOBJ_ADD);
+
 	ret = 0;
 unlock:
-	mutex_unlock(&q->sysfs_lock);
+	mutex_unlock(&q->sysfs_dir_lock);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blk_register_queue);
@@ -1021,6 +1032,7 @@ EXPORT_SYMBOL_GPL(blk_register_queue);
 void blk_unregister_queue(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
+	bool has_elevator;
 
 	if (WARN_ON(!q))
 		return;
@@ -1035,25 +1047,25 @@ void blk_unregister_queue(struct gendisk *disk)
 	 * concurrent elv_iosched_store() calls.
 	 */
 	mutex_lock(&q->sysfs_lock);
-
 	blk_queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
+	has_elevator = !!q->elevator;
+	mutex_unlock(&q->sysfs_lock);
 
+	mutex_lock(&q->sysfs_dir_lock);
 	/*
 	 * Remove the sysfs attributes before unregistering the queue data
 	 * structures that can be modified through sysfs.
 	 */
 	if (queue_is_mq(q))
 		blk_mq_unregister_dev(disk_to_dev(disk), q);
-	mutex_unlock(&q->sysfs_lock);
 
 	kobject_uevent(&q->kobj, KOBJ_REMOVE);
 	kobject_del(&q->kobj);
 	blk_trace_remove_sysfs(disk_to_dev(disk));
 
-	mutex_lock(&q->sysfs_lock);
-	if (q->elevator)
+	if (has_elevator)
 		elv_unregister_queue(q);
-	mutex_unlock(&q->sysfs_lock);
+	mutex_unlock(&q->sysfs_dir_lock);
 
 	kobject_put(&disk_to_dev(disk)->kobj);
 }
diff --git a/block/blk.h b/block/blk.h
index de6b2e146d6e..e4619fc5c99a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -188,7 +188,7 @@ int elevator_init_mq(struct request_queue *q);
 int elevator_switch_mq(struct request_queue *q,
 			      struct elevator_type *new_e);
 void __elevator_exit(struct request_queue *, struct elevator_queue *);
-int elv_register_queue(struct request_queue *q);
+int elv_register_queue(struct request_queue *q, bool uevent);
 void elv_unregister_queue(struct request_queue *q);
 
 static inline void elevator_exit(struct request_queue *q,
diff --git a/block/elevator.c b/block/elevator.c
index 7449a5836b52..68040c45ce13 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -470,13 +470,16 @@ static struct kobj_type elv_ktype = {
 	.release	= elevator_release,
 };
 
-int elv_register_queue(struct request_queue *q)
+/*
+ * elv_register_queue is called from either blk_register_queue or
+ * elevator_switch, elevator switch is prevented from being happen
+ * in the two paths, so it is safe to not hold q->sysfs_lock.
+ */
+int elv_register_queue(struct request_queue *q, bool uevent)
 {
 	struct elevator_queue *e = q->elevator;
 	int error;
 
-	lockdep_assert_held(&q->sysfs_lock);
-
 	error = kobject_add(&e->kobj, &q->kobj, "%s", "iosched");
 	if (!error) {
 		struct elv_fs_entry *attr = e->type->elevator_attrs;
@@ -487,24 +490,34 @@ int elv_register_queue(struct request_queue *q)
 				attr++;
 			}
 		}
-		kobject_uevent(&e->kobj, KOBJ_ADD);
+		if (uevent)
+			kobject_uevent(&e->kobj, KOBJ_ADD);
+
+		mutex_lock(&q->sysfs_lock);
 		e->registered = 1;
+		mutex_unlock(&q->sysfs_lock);
 	}
 	return error;
 }
 
+/*
+ * elv_unregister_queue is called from either blk_unregister_queue or
+ * elevator_switch, elevator switch is prevented from being happen
+ * in the two paths, so it is safe to not hold q->sysfs_lock.
+ */
 void elv_unregister_queue(struct request_queue *q)
 {
-	lockdep_assert_held(&q->sysfs_lock);
-
 	if (q) {
 		struct elevator_queue *e = q->elevator;
 
 		kobject_uevent(&e->kobj, KOBJ_REMOVE);
 		kobject_del(&e->kobj);
+
+		mutex_lock(&q->sysfs_lock);
 		e->registered = 0;
 		/* Re-enable throttling in case elevator disabled it */
 		wbt_enable_default(q);
+		mutex_unlock(&q->sysfs_lock);
 	}
 }
 
@@ -567,10 +580,23 @@ int elevator_switch_mq(struct request_queue *q,
 	lockdep_assert_held(&q->sysfs_lock);
 
 	if (q->elevator) {
-		if (q->elevator->registered)
+		if (q->elevator->registered) {
+			mutex_unlock(&q->sysfs_lock);
+
 			elv_unregister_queue(q);
+
+			mutex_lock(&q->sysfs_lock);
+		}
 		ioc_clear_queue(q);
 		elevator_exit(q, q->elevator);
+
+		/*
+		 * sysfs_lock may be dropped, so re-check if queue is
+		 * unregistered. If yes, don't switch to new elevator
+		 * any more
+		 */
+		if (!blk_queue_registered(q))
+			return 0;
 	}
 
 	ret = blk_mq_init_sched(q, new_e);
@@ -578,7 +604,11 @@ int elevator_switch_mq(struct request_queue *q,
 		goto out;
 
 	if (new_e) {
-		ret = elv_register_queue(q);
+		mutex_unlock(&q->sysfs_lock);
+
+		ret = elv_register_queue(q, true);
+
+		mutex_lock(&q->sysfs_lock);
 		if (ret) {
 			elevator_exit(q, q->elevator);
 			goto out;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6041755984f4..e271c3a176fa 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -539,6 +539,7 @@ struct request_queue {
 	struct delayed_work	requeue_work;
 
 	struct mutex		sysfs_lock;
+	struct mutex		sysfs_dir_lock;
 
 	/*
 	 * for reusing dead hctx instance in case of updating
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 2/6] block: don't hold q->sysfs_lock in elevator_init_mq
  2019-08-21  9:15 ` [PATCH V2 2/6] block: don't hold q->sysfs_lock in elevator_init_mq Ming Lei
@ 2019-08-21 15:51   ` Bart Van Assche
  0 siblings, 0 replies; 20+ messages in thread
From: Bart Van Assche @ 2019-08-21 15:51 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Hannes Reinecke, Greg KH, Mike Snitzer

On 8/21/19 2:15 AM, Ming Lei wrote:
> The original comment says:
> 
> 	q->sysfs_lock must be held to provide mutual exclusion between
> 	elevator_switch() and here.
> 
> Which is simply wrong. elevator_init_mq() is only called from
> blk_mq_init_allocated_queue, which is always called before the request
> queue is registered via blk_register_queue(), for dm-rq or normal rq
> based driver. However, queue's kobject is just exposed added to sysfs
                                             ^^^^^^^^^^^^
                                             only?
> in blk_register_queue(). So there isn't such race between elevator_switch()
> and elevator_init_mq().
> 
> So avoid to hold q->sysfs_lock in elevator_init_mq().
[ ... ]
>   	/*
> -	 * q->sysfs_lock must be held to provide mutual exclusion between
> -	 * elevator_switch() and here.
> +	 * We are called from blk_mq_init_allocated_queue() only, at that
> +	 * time the request queue isn't registered yet, so the queue
> +	 * kobject isn't exposed to userspace. No need to worry about race
> +	 * with elevator_switch(), and no need to hold q->sysfs_lock.
>   	 */

How about replacing this comment with the following:

WARN_ON_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags));

Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 3/6] blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue
  2019-08-21  9:15 ` [PATCH V2 3/6] blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue Ming Lei
@ 2019-08-21 15:53   ` Bart Van Assche
  2019-08-26  2:11     ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2019-08-21 15:53 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Hannes Reinecke, Greg KH, Mike Snitzer

On 8/21/19 2:15 AM, Ming Lei wrote:
> blk_mq_map_swqueue() is called from blk_mq_init_allocated_queue()
> and blk_mq_update_nr_hw_queues(). For the former caller, the kobject
> isn't exposed to userspace yet. For the latter caller, sysfs/debugfs
> is un-registered before updating nr_hw_queues.
> 
> On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after
> request queue is freed") moves freeing hctx into queue's release
> handler, so there won't be race with queue release path too.
> 
> So don't hold q->sysfs_lock in blk_mq_map_swqueue().
> 
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 7 -------
>   1 file changed, 7 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6968de9d7402..b0ee0cac737f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2456,11 +2456,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>   	struct blk_mq_ctx *ctx;
>   	struct blk_mq_tag_set *set = q->tag_set;
>   
> -	/*
> -	 * Avoid others reading imcomplete hctx->cpumask through sysfs
> -	 */
> -	mutex_lock(&q->sysfs_lock);
> -
>   	queue_for_each_hw_ctx(q, hctx, i) {
>   		cpumask_clear(hctx->cpumask);
>   		hctx->nr_ctx = 0;
> @@ -2521,8 +2516,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>   					HCTX_TYPE_DEFAULT, i);
>   	}
>   
> -	mutex_unlock(&q->sysfs_lock);
> -
>   	queue_for_each_hw_ctx(q, hctx, i) {
>   		/*
>   		 * If no software queues are mapped to this hardware queue,
> 

How about adding WARN_ON_ONCE(test_bit(QUEUE_FLAG_REGISTERED, 
&q->queue_flags)) ?

Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 4/6] blk-mq: don't hold q->sysfs_lock in blk_mq_realloc_hw_ctxs()
  2019-08-21  9:15 ` [PATCH V2 4/6] blk-mq: don't hold q->sysfs_lock in blk_mq_realloc_hw_ctxs() Ming Lei
@ 2019-08-21 15:56   ` Bart Van Assche
  2019-08-26  2:25     ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2019-08-21 15:56 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Hannes Reinecke, Greg KH, Mike Snitzer

On 8/21/19 2:15 AM, Ming Lei wrote:
> blk_mq_realloc_hw_ctxs() is called from blk_mq_init_allocated_queue()
> and blk_mq_update_nr_hw_queues(). For the former caller, the kobject
> isn't exposed to userspace yet. For the latter caller, sysfs/debugfs
> is un-registered before updating nr_hw_queues.
> 
> On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after
> request queue is freed") moves freeing hctx into queue's release
> handler, so there won't be race with queue release path too.
> 
> So don't hold q->sysfs_lock in blk_mq_realloc_hw_ctxs().

How about mentioning that the locking at the start of 
blk_mq_update_nr_hw_queues() serializes all blk_mq_realloc_hw_ctxs() 
calls that happen after a queue has been registered in sysfs?

Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 5/6] block: add helper for checking if queue is registered
  2019-08-21  9:15 ` [PATCH V2 5/6] block: add helper for checking if queue is registered Ming Lei
@ 2019-08-21 15:57   ` Bart Van Assche
  0 siblings, 0 replies; 20+ messages in thread
From: Bart Van Assche @ 2019-08-21 15:57 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Hannes Reinecke, Greg KH, Mike Snitzer

On 8/21/19 2:15 AM, Ming Lei wrote:
> There are 4 users which check if queue is registered, so add one helper
> to check it.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 6/6] block: split .sysfs_lock into two locks
  2019-08-21  9:15 ` [PATCH V2 6/6] block: split .sysfs_lock into two locks Ming Lei
@ 2019-08-21 16:18   ` Bart Van Assche
  2019-08-22  1:28     ` Ming Lei
  2019-08-23 16:46   ` Bart Van Assche
  1 sibling, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2019-08-21 16:18 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Hannes Reinecke, Greg KH, Mike Snitzer

On 8/21/19 2:15 AM, Ming Lei wrote:
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 31bbf10d8149..a4cc40ddda86 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -247,7 +247,7 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
>   	struct blk_mq_hw_ctx *hctx;
>   	int i;
>   
> -	lockdep_assert_held(&q->sysfs_lock);
> +	lockdep_assert_held(&q->sysfs_dir_lock);
>   
>   	queue_for_each_hw_ctx(q, hctx, i)
>   		blk_mq_unregister_hctx(hctx);
> @@ -297,7 +297,7 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
>   	int ret, i;
>   
>   	WARN_ON_ONCE(!q->kobj.parent);
> -	lockdep_assert_held(&q->sysfs_lock);
> +	lockdep_assert_held(&q->sysfs_dir_lock);
>   
>   	ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
>   	if (ret < 0)

blk_mq_unregister_dev and __blk_mq_register_dev() are only used by 
blk_register_queue() and blk_unregister_queue(). It is the 
responsibility of the callers of these function to serialize request 
queue registration and unregistration. Is it really necessary to hold a 
mutex around the blk_mq_unregister_dev and __blk_mq_register_dev() 
calls? Or in other words, can it ever happen that multiple threads 
invoke one or both functions concurrently?

> @@ -331,7 +331,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
>   	struct blk_mq_hw_ctx *hctx;
>   	int i;
>   
> -	mutex_lock(&q->sysfs_lock);
> +	mutex_lock(&q->sysfs_dir_lock);
>   	if (!q->mq_sysfs_init_done)
>   		goto unlock;
>   
> @@ -339,7 +339,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
>   		blk_mq_unregister_hctx(hctx);
>   
>   unlock:
> -	mutex_unlock(&q->sysfs_lock);
> +	mutex_unlock(&q->sysfs_dir_lock);
>   }
>   
>   int blk_mq_sysfs_register(struct request_queue *q)
> @@ -347,7 +347,7 @@ int blk_mq_sysfs_register(struct request_queue *q)
>   	struct blk_mq_hw_ctx *hctx;
>   	int i, ret = 0;
>   
> -	mutex_lock(&q->sysfs_lock);
> +	mutex_lock(&q->sysfs_dir_lock);
>   	if (!q->mq_sysfs_init_done)
>   		goto unlock;
>   
> @@ -358,7 +358,7 @@ int blk_mq_sysfs_register(struct request_queue *q)
>   	}
>   
>   unlock:
> -	mutex_unlock(&q->sysfs_lock);
> +	mutex_unlock(&q->sysfs_dir_lock);
>   
>   	return ret;
>   }

blk_mq_sysfs_unregister() and blk_mq_sysfs_register() are only used by 
__blk_mq_update_nr_hw_queues(). Calls to that function are serialized by 
the tag_list_lock mutex. Is it really necessary to use any locking 
inside these functions?

> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 5b0b5224cfd4..5941a0176f87 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -938,6 +938,7 @@ int blk_register_queue(struct gendisk *disk)
>   	int ret;
>   	struct device *dev = disk_to_dev(disk);
>   	struct request_queue *q = disk->queue;
> +	bool has_elevator = false;
>   
>   	if (WARN_ON(!q))
>   		return -ENXIO;
> @@ -945,7 +946,6 @@ int blk_register_queue(struct gendisk *disk)
>   	WARN_ONCE(blk_queue_registered(q),
>   		  "%s is registering an already registered queue\n",
>   		  kobject_name(&dev->kobj));
> -	blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
>   
>   	/*
>   	 * SCSI probing may synchronously create and destroy a lot of
> @@ -966,7 +966,7 @@ int blk_register_queue(struct gendisk *disk)
>   		return ret;
>   
>   	/* Prevent changes through sysfs until registration is completed. */
> -	mutex_lock(&q->sysfs_lock);
> +	mutex_lock(&q->sysfs_dir_lock);
>   
>   	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
>   	if (ret < 0) {
> @@ -987,26 +987,37 @@ int blk_register_queue(struct gendisk *disk)
>   		blk_mq_debugfs_register(q);
>   	}
>   
> -	kobject_uevent(&q->kobj, KOBJ_ADD);
> -
> -	wbt_enable_default(q);
> -
> -	blk_throtl_register_queue(q);
> -
> +	/*
> +	 * The queue's kobject ADD uevent isn't sent out, also the
> +	 * flag of QUEUE_FLAG_REGISTERED isn't set yet, so elevator
> +	 * switch won't happen at all.
> +	 */
>   	if (q->elevator) {
> -		ret = elv_register_queue(q);
> +		ret = elv_register_queue(q, false);
>   		if (ret) {
> -			mutex_unlock(&q->sysfs_lock);
> -			kobject_uevent(&q->kobj, KOBJ_REMOVE);
> +			mutex_unlock(&q->sysfs_dir_lock);
>   			kobject_del(&q->kobj);
>   			blk_trace_remove_sysfs(dev);
>   			kobject_put(&dev->kobj);
>   			return ret;
>   		}
> +		has_elevator = true;
>   	}
> +
> +	mutex_lock(&q->sysfs_lock);
> +	blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
> +	wbt_enable_default(q);
> +	blk_throtl_register_queue(q);
> +	mutex_unlock(&q->sysfs_lock);
> +
> +	/* Now everything is ready and send out KOBJ_ADD uevent */
> +	kobject_uevent(&q->kobj, KOBJ_ADD);
> +	if (has_elevator)
> +		kobject_uevent(&q->elevator->kobj, KOBJ_ADD);
> +
>   	ret = 0;
>   unlock:
> -	mutex_unlock(&q->sysfs_lock);
> +	mutex_unlock(&q->sysfs_dir_lock);
>   	return ret;
>   }

My understanding is that the mutex_lock() / mutex_unlock() calls in this 
function are necessary today to prevent concurrent changes of the 
scheduler from this function and from sysfs. If the 
kobject_uevent(KOBJ_ADD) call is moved, does that mean that all 
mutex_lock() / mutex_unlock() calls can be left out from this function?

>   EXPORT_SYMBOL_GPL(blk_register_queue);
> @@ -1021,6 +1032,7 @@ EXPORT_SYMBOL_GPL(blk_register_queue);
>   void blk_unregister_queue(struct gendisk *disk)
>   {
>   	struct request_queue *q = disk->queue;
> +	bool has_elevator;
>   
>   	if (WARN_ON(!q))
>   		return;
> @@ -1035,25 +1047,25 @@ void blk_unregister_queue(struct gendisk *disk)
>   	 * concurrent elv_iosched_store() calls.
>   	 */
>   	mutex_lock(&q->sysfs_lock);
> -
>   	blk_queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
> +	has_elevator = !!q->elevator;
> +	mutex_unlock(&q->sysfs_lock);
>   
> +	mutex_lock(&q->sysfs_dir_lock);
>   	/*
>   	 * Remove the sysfs attributes before unregistering the queue data
>   	 * structures that can be modified through sysfs.
>   	 */
>   	if (queue_is_mq(q))
>   		blk_mq_unregister_dev(disk_to_dev(disk), q);
> -	mutex_unlock(&q->sysfs_lock);
>   
>   	kobject_uevent(&q->kobj, KOBJ_REMOVE);
>   	kobject_del(&q->kobj);
>   	blk_trace_remove_sysfs(disk_to_dev(disk));
>   
> -	mutex_lock(&q->sysfs_lock);
> -	if (q->elevator)
> +	if (has_elevator)
>   		elv_unregister_queue(q);
> -	mutex_unlock(&q->sysfs_lock);
> +	mutex_unlock(&q->sysfs_dir_lock);
>   
>   	kobject_put(&disk_to_dev(disk)->kobj);
>   }

If this function would call kobject_del(&q->kobj) before doing anything 
else, does that mean that all mutex_lock() / mutex_unlock() calls can be 
left out from this function?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 6/6] block: split .sysfs_lock into two locks
  2019-08-21 16:18   ` Bart Van Assche
@ 2019-08-22  1:28     ` Ming Lei
  2019-08-22 19:52       ` Bart Van Assche
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2019-08-22  1:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer

On Wed, Aug 21, 2019 at 09:18:08AM -0700, Bart Van Assche wrote:
> On 8/21/19 2:15 AM, Ming Lei wrote:
> > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> > index 31bbf10d8149..a4cc40ddda86 100644
> > --- a/block/blk-mq-sysfs.c
> > +++ b/block/blk-mq-sysfs.c
> > @@ -247,7 +247,7 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
> >   	struct blk_mq_hw_ctx *hctx;
> >   	int i;
> > -	lockdep_assert_held(&q->sysfs_lock);
> > +	lockdep_assert_held(&q->sysfs_dir_lock);
> >   	queue_for_each_hw_ctx(q, hctx, i)
> >   		blk_mq_unregister_hctx(hctx);
> > @@ -297,7 +297,7 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
> >   	int ret, i;
> >   	WARN_ON_ONCE(!q->kobj.parent);
> > -	lockdep_assert_held(&q->sysfs_lock);
> > +	lockdep_assert_held(&q->sysfs_dir_lock);
> >   	ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
> >   	if (ret < 0)
> 
> blk_mq_unregister_dev and __blk_mq_register_dev() are only used by
> blk_register_queue() and blk_unregister_queue(). It is the responsibility of
> the callers of these function to serialize request queue registration and
> unregistration. Is it really necessary to hold a mutex around the
> blk_mq_unregister_dev and __blk_mq_register_dev() calls? Or in other words,
> can it ever happen that multiple threads invoke one or both functions
> concurrently?

hctx kobjects can be removed and re-added via blk_mq_update_nr_hw_queues()
which may be called at the same time when queue is registering or
un-registering.

Also the change can be simpler to use a new lock to replace the old one.

> 
> > @@ -331,7 +331,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
> >   	struct blk_mq_hw_ctx *hctx;
> >   	int i;
> > -	mutex_lock(&q->sysfs_lock);
> > +	mutex_lock(&q->sysfs_dir_lock);
> >   	if (!q->mq_sysfs_init_done)
> >   		goto unlock;
> > @@ -339,7 +339,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
> >   		blk_mq_unregister_hctx(hctx);
> >   unlock:
> > -	mutex_unlock(&q->sysfs_lock);
> > +	mutex_unlock(&q->sysfs_dir_lock);
> >   }
> >   int blk_mq_sysfs_register(struct request_queue *q)
> > @@ -347,7 +347,7 @@ int blk_mq_sysfs_register(struct request_queue *q)
> >   	struct blk_mq_hw_ctx *hctx;
> >   	int i, ret = 0;
> > -	mutex_lock(&q->sysfs_lock);
> > +	mutex_lock(&q->sysfs_dir_lock);
> >   	if (!q->mq_sysfs_init_done)
> >   		goto unlock;
> > @@ -358,7 +358,7 @@ int blk_mq_sysfs_register(struct request_queue *q)
> >   	}
> >   unlock:
> > -	mutex_unlock(&q->sysfs_lock);
> > +	mutex_unlock(&q->sysfs_dir_lock);
> >   	return ret;
> >   }
> 
> blk_mq_sysfs_unregister() and blk_mq_sysfs_register() are only used by
> __blk_mq_update_nr_hw_queues(). Calls to that function are serialized by the
> tag_list_lock mutex. Is it really necessary to use any locking inside these
> functions?

hctx kobjects can be removed and re-added via blk_mq_update_nr_hw_queues()
which may be called at the same time when queue is registering or
un-registering.

Also the change can be simpler to use a new lock to replace the old one.

> 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index 5b0b5224cfd4..5941a0176f87 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -938,6 +938,7 @@ int blk_register_queue(struct gendisk *disk)
> >   	int ret;
> >   	struct device *dev = disk_to_dev(disk);
> >   	struct request_queue *q = disk->queue;
> > +	bool has_elevator = false;
> >   	if (WARN_ON(!q))
> >   		return -ENXIO;
> > @@ -945,7 +946,6 @@ int blk_register_queue(struct gendisk *disk)
> >   	WARN_ONCE(blk_queue_registered(q),
> >   		  "%s is registering an already registered queue\n",
> >   		  kobject_name(&dev->kobj));
> > -	blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
> >   	/*
> >   	 * SCSI probing may synchronously create and destroy a lot of
> > @@ -966,7 +966,7 @@ int blk_register_queue(struct gendisk *disk)
> >   		return ret;
> >   	/* Prevent changes through sysfs until registration is completed. */
> > -	mutex_lock(&q->sysfs_lock);
> > +	mutex_lock(&q->sysfs_dir_lock);
> >   	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
> >   	if (ret < 0) {
> > @@ -987,26 +987,37 @@ int blk_register_queue(struct gendisk *disk)
> >   		blk_mq_debugfs_register(q);
> >   	}
> > -	kobject_uevent(&q->kobj, KOBJ_ADD);
> > -
> > -	wbt_enable_default(q);
> > -
> > -	blk_throtl_register_queue(q);
> > -
> > +	/*
> > +	 * The queue's kobject ADD uevent isn't sent out, also the
> > +	 * flag of QUEUE_FLAG_REGISTERED isn't set yet, so elevator
> > +	 * switch won't happen at all.
> > +	 */
> >   	if (q->elevator) {
> > -		ret = elv_register_queue(q);
> > +		ret = elv_register_queue(q, false);
> >   		if (ret) {
> > -			mutex_unlock(&q->sysfs_lock);
> > -			kobject_uevent(&q->kobj, KOBJ_REMOVE);
> > +			mutex_unlock(&q->sysfs_dir_lock);
> >   			kobject_del(&q->kobj);
> >   			blk_trace_remove_sysfs(dev);
> >   			kobject_put(&dev->kobj);
> >   			return ret;
> >   		}
> > +		has_elevator = true;
> >   	}
> > +
> > +	mutex_lock(&q->sysfs_lock);
> > +	blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
> > +	wbt_enable_default(q);
> > +	blk_throtl_register_queue(q);
> > +	mutex_unlock(&q->sysfs_lock);
> > +
> > +	/* Now everything is ready and send out KOBJ_ADD uevent */
> > +	kobject_uevent(&q->kobj, KOBJ_ADD);
> > +	if (has_elevator)
> > +		kobject_uevent(&q->elevator->kobj, KOBJ_ADD);
> > +
> >   	ret = 0;
> >   unlock:
> > -	mutex_unlock(&q->sysfs_lock);
> > +	mutex_unlock(&q->sysfs_dir_lock);
> >   	return ret;
> >   }
> 
> My understanding is that the mutex_lock() / mutex_unlock() calls in this
> function are necessary today to prevent concurrent changes of the scheduler
> from this function and from sysfs. If the kobject_uevent(KOBJ_ADD) call is
> moved, does that mean that all mutex_lock() / mutex_unlock() calls can be
> left out from this function?


hctx kobjects can be removed and re-added via blk_mq_update_nr_hw_queues()
which may be called at the same time when queue is registering or
un-registering.

Also the change can be simpler to use a new lock to replace the old one.

> 
> >   EXPORT_SYMBOL_GPL(blk_register_queue);
> > @@ -1021,6 +1032,7 @@ EXPORT_SYMBOL_GPL(blk_register_queue);
> >   void blk_unregister_queue(struct gendisk *disk)
> >   {
> >   	struct request_queue *q = disk->queue;
> > +	bool has_elevator;
> >   	if (WARN_ON(!q))
> >   		return;
> > @@ -1035,25 +1047,25 @@ void blk_unregister_queue(struct gendisk *disk)
> >   	 * concurrent elv_iosched_store() calls.
> >   	 */
> >   	mutex_lock(&q->sysfs_lock);
> > -
> >   	blk_queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
> > +	has_elevator = !!q->elevator;
> > +	mutex_unlock(&q->sysfs_lock);
> > +	mutex_lock(&q->sysfs_dir_lock);
> >   	/*
> >   	 * Remove the sysfs attributes before unregistering the queue data
> >   	 * structures that can be modified through sysfs.
> >   	 */
> >   	if (queue_is_mq(q))
> >   		blk_mq_unregister_dev(disk_to_dev(disk), q);
> > -	mutex_unlock(&q->sysfs_lock);
> >   	kobject_uevent(&q->kobj, KOBJ_REMOVE);
> >   	kobject_del(&q->kobj);
> >   	blk_trace_remove_sysfs(disk_to_dev(disk));
> > -	mutex_lock(&q->sysfs_lock);
> > -	if (q->elevator)
> > +	if (has_elevator)
> >   		elv_unregister_queue(q);
> > -	mutex_unlock(&q->sysfs_lock);
> > +	mutex_unlock(&q->sysfs_dir_lock);
> >   	kobject_put(&disk_to_dev(disk)->kobj);
> >   }
> 
> If this function would call kobject_del(&q->kobj) before doing anything
> else, does that mean that all mutex_lock() / mutex_unlock() calls can be
> left out from this function?

As I mentioned above, we need to sync between registering/un-registering
queue and updating nr_hw_queues, so the lock of sysfs_dir_lock is needed.

Thanks, 
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 6/6] block: split .sysfs_lock into two locks
  2019-08-22  1:28     ` Ming Lei
@ 2019-08-22 19:52       ` Bart Van Assche
  2019-08-23  1:08         ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2019-08-22 19:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer

On 8/21/19 6:28 PM, Ming Lei wrote:
> On Wed, Aug 21, 2019 at 09:18:08AM -0700, Bart Van Assche wrote:
>> On 8/21/19 2:15 AM, Ming Lei wrote:
>>> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
>>> index 31bbf10d8149..a4cc40ddda86 100644
>>> --- a/block/blk-mq-sysfs.c
>>> +++ b/block/blk-mq-sysfs.c
>>> @@ -247,7 +247,7 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
>>>    	struct blk_mq_hw_ctx *hctx;
>>>    	int i;
>>> -	lockdep_assert_held(&q->sysfs_lock);
>>> +	lockdep_assert_held(&q->sysfs_dir_lock);
>>>    	queue_for_each_hw_ctx(q, hctx, i)
>>>    		blk_mq_unregister_hctx(hctx);
>>> @@ -297,7 +297,7 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
>>>    	int ret, i;
>>>    	WARN_ON_ONCE(!q->kobj.parent);
>>> -	lockdep_assert_held(&q->sysfs_lock);
>>> +	lockdep_assert_held(&q->sysfs_dir_lock);
>>>    	ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
>>>    	if (ret < 0)
>>
>> blk_mq_unregister_dev and __blk_mq_register_dev() are only used by
>> blk_register_queue() and blk_unregister_queue(). It is the responsibility of
>> the callers of these function to serialize request queue registration and
>> unregistration. Is it really necessary to hold a mutex around the
>> blk_mq_unregister_dev and __blk_mq_register_dev() calls? Or in other words,
>> can it ever happen that multiple threads invoke one or both functions
>> concurrently?
> 
> hctx kobjects can be removed and re-added via blk_mq_update_nr_hw_queues()
> which may be called at the same time when queue is registering or
> un-registering.

Shouldn't blk_register_queue() and blk_unregister_queue() be serialized 
against blk_mq_update_nr_hw_queues()? Allowing these calls to proceed 
concurrently complicates the block layer and makes the block layer code 
harder to review than necessary. I don't think that it would help any 
block driver to allow these calls to proceed concurrently.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 6/6] block: split .sysfs_lock into two locks
  2019-08-22 19:52       ` Bart Van Assche
@ 2019-08-23  1:08         ` Ming Lei
  2019-08-23 16:36           ` Bart Van Assche
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2019-08-23  1:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer

On Thu, Aug 22, 2019 at 12:52:54PM -0700, Bart Van Assche wrote:
> On 8/21/19 6:28 PM, Ming Lei wrote:
> > On Wed, Aug 21, 2019 at 09:18:08AM -0700, Bart Van Assche wrote:
> > > On 8/21/19 2:15 AM, Ming Lei wrote:
> > > > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> > > > index 31bbf10d8149..a4cc40ddda86 100644
> > > > --- a/block/blk-mq-sysfs.c
> > > > +++ b/block/blk-mq-sysfs.c
> > > > @@ -247,7 +247,7 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
> > > >    	struct blk_mq_hw_ctx *hctx;
> > > >    	int i;
> > > > -	lockdep_assert_held(&q->sysfs_lock);
> > > > +	lockdep_assert_held(&q->sysfs_dir_lock);
> > > >    	queue_for_each_hw_ctx(q, hctx, i)
> > > >    		blk_mq_unregister_hctx(hctx);
> > > > @@ -297,7 +297,7 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
> > > >    	int ret, i;
> > > >    	WARN_ON_ONCE(!q->kobj.parent);
> > > > -	lockdep_assert_held(&q->sysfs_lock);
> > > > +	lockdep_assert_held(&q->sysfs_dir_lock);
> > > >    	ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
> > > >    	if (ret < 0)
> > > 
> > > blk_mq_unregister_dev and __blk_mq_register_dev() are only used by
> > > blk_register_queue() and blk_unregister_queue(). It is the responsibility of
> > > the callers of these function to serialize request queue registration and
> > > unregistration. Is it really necessary to hold a mutex around the
> > > blk_mq_unregister_dev and __blk_mq_register_dev() calls? Or in other words,
> > > can it ever happen that multiple threads invoke one or both functions
> > > concurrently?
> > 
> > hctx kobjects can be removed and re-added via blk_mq_update_nr_hw_queues()
> > which may be called at the same time when queue is registering or
> > un-registering.
> 
> Shouldn't blk_register_queue() and blk_unregister_queue() be serialized
> against blk_mq_update_nr_hw_queues()? Allowing these calls to proceed

It can be easy to say than done. We depends on users for sync
between blk_register_queue() and blk_unregister_queue(), also
there are several locks involved in blk_mq_update_nr_hw_queues().

Now, the sync is done via .sysfs_lock, and so far not see issues in this
area. This patch just converts the .sysfs_lock into .sysfs_dir_lock for
same purpose.

If you have simple and workable patch to serialize blk_register_queue() and
blk_unregister_queue() against blk_mq_update_nr_hw_queues(), I am happy to
review. Otherwise please consider to do it in future, and it shouldn't a
blocker for fixing this deadlock, should it?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 6/6] block: split .sysfs_lock into two locks
  2019-08-23  1:08         ` Ming Lei
@ 2019-08-23 16:36           ` Bart Van Assche
  0 siblings, 0 replies; 20+ messages in thread
From: Bart Van Assche @ 2019-08-23 16:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer

On 8/22/19 6:08 PM, Ming Lei wrote:
> On Thu, Aug 22, 2019 at 12:52:54PM -0700, Bart Van Assche wrote:
>> Shouldn't blk_register_queue() and blk_unregister_queue() be serialized
>> against blk_mq_update_nr_hw_queues()? Allowing these calls to proceed
> 
> It can be easy to say than done. We depends on users for sync
> between blk_register_queue() and blk_unregister_queue(), also
> there are several locks involved in blk_mq_update_nr_hw_queues().
> 
> Now, the sync is done via .sysfs_lock, and so far not see issues in this
> area. This patch just converts the .sysfs_lock into .sysfs_dir_lock for
> same purpose.
> 
> If you have simple and workable patch to serialize blk_register_queue() and
> blk_unregister_queue() against blk_mq_update_nr_hw_queues(), I am happy to
> review. Otherwise please consider to do it in future, and it shouldn't a
> blocker for fixing this deadlock, should it?

Since what I requested would result in serialization across request 
queues of I/O scheduler changes, let's keep this for a later time.

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 6/6] block: split .sysfs_lock into two locks
  2019-08-21  9:15 ` [PATCH V2 6/6] block: split .sysfs_lock into two locks Ming Lei
  2019-08-21 16:18   ` Bart Van Assche
@ 2019-08-23 16:46   ` Bart Van Assche
  2019-08-23 22:49     ` Ming Lei
  1 sibling, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2019-08-23 16:46 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Hannes Reinecke, Greg KH, Mike Snitzer

On 8/21/19 2:15 AM, Ming Lei wrote:
> @@ -966,7 +966,7 @@ int blk_register_queue(struct gendisk *disk)
>   		return ret;
>   
>   	/* Prevent changes through sysfs until registration is completed. */
> -	mutex_lock(&q->sysfs_lock);
> +	mutex_lock(&q->sysfs_dir_lock);
>   
>   	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
>   	if (ret < 0) {
> @@ -987,26 +987,37 @@ int blk_register_queue(struct gendisk *disk)
>   		blk_mq_debugfs_register(q);
>   	}
>   
> -	kobject_uevent(&q->kobj, KOBJ_ADD);
> -
> -	wbt_enable_default(q);
> -
> -	blk_throtl_register_queue(q);
> -
> +	/*
> +	 * The queue's kobject ADD uevent isn't sent out, also the
> +	 * flag of QUEUE_FLAG_REGISTERED isn't set yet, so elevator
> +	 * switch won't happen at all.
> +	 */
>   	if (q->elevator) {
> -		ret = elv_register_queue(q);
> +		ret = elv_register_queue(q, false);
>   		if (ret) {

The above changes seems risky to me. In contrast with what the comment 
suggests, user space code is not required to wait for KOBJ_ADD event to 
start using sysfs attributes. I think user space code *can* write into 
the request queue I/O scheduler sysfs attribute after the kobject_add() 
call has finished and before kobject_uevent(&q->kobj, KOBJ_ADD) is called.

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 6/6] block: split .sysfs_lock into two locks
  2019-08-23 16:46   ` Bart Van Assche
@ 2019-08-23 22:49     ` Ming Lei
  0 siblings, 0 replies; 20+ messages in thread
From: Ming Lei @ 2019-08-23 22:49 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer

On Fri, Aug 23, 2019 at 09:46:48AM -0700, Bart Van Assche wrote:
> On 8/21/19 2:15 AM, Ming Lei wrote:
> > @@ -966,7 +966,7 @@ int blk_register_queue(struct gendisk *disk)
> >   		return ret;
> >   	/* Prevent changes through sysfs until registration is completed. */
> > -	mutex_lock(&q->sysfs_lock);
> > +	mutex_lock(&q->sysfs_dir_lock);
> >   	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
> >   	if (ret < 0) {
> > @@ -987,26 +987,37 @@ int blk_register_queue(struct gendisk *disk)
> >   		blk_mq_debugfs_register(q);
> >   	}
> > -	kobject_uevent(&q->kobj, KOBJ_ADD);
> > -
> > -	wbt_enable_default(q);
> > -
> > -	blk_throtl_register_queue(q);
> > -
> > +	/*
> > +	 * The queue's kobject ADD uevent isn't sent out, also the
> > +	 * flag of QUEUE_FLAG_REGISTERED isn't set yet, so elevator
> > +	 * switch won't happen at all.
> > +	 */
> >   	if (q->elevator) {
> > -		ret = elv_register_queue(q);
> > +		ret = elv_register_queue(q, false);
> >   		if (ret) {
> 
> The above changes seems risky to me. In contrast with what the comment
> suggests, user space code is not required to wait for KOBJ_ADD event to
> start using sysfs attributes. I think user space code *can* write into the
> request queue I/O scheduler sysfs attribute after the kobject_add() call has
> finished and before kobject_uevent(&q->kobj, KOBJ_ADD) is called.

Yeah, one crazy userspace may simply poll on sysfs entres and start to
READ/WRITE before seeing the KOBJ_ADD event.

However, we have another protection via the queue flag QUEUE_FLAG_REGISTERED,
which is set after everything is done. So if userspace's early store
comes, elevator switch still can't happen because the flag is checked in
__elevator_change().

thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 3/6] blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue
  2019-08-21 15:53   ` Bart Van Assche
@ 2019-08-26  2:11     ` Ming Lei
  0 siblings, 0 replies; 20+ messages in thread
From: Ming Lei @ 2019-08-26  2:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer

On Wed, Aug 21, 2019 at 08:53:52AM -0700, Bart Van Assche wrote:
> On 8/21/19 2:15 AM, Ming Lei wrote:
> > blk_mq_map_swqueue() is called from blk_mq_init_allocated_queue()
> > and blk_mq_update_nr_hw_queues(). For the former caller, the kobject
> > isn't exposed to userspace yet. For the latter caller, sysfs/debugfs
> > is un-registered before updating nr_hw_queues.
> > 
> > On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after
> > request queue is freed") moves freeing hctx into queue's release
> > handler, so there won't be race with queue release path too.
> > 
> > So don't hold q->sysfs_lock in blk_mq_map_swqueue().
> > 
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Greg KH <gregkh@linuxfoundation.org>
> > Cc: Mike Snitzer <snitzer@redhat.com>
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq.c | 7 -------
> >   1 file changed, 7 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 6968de9d7402..b0ee0cac737f 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2456,11 +2456,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> >   	struct blk_mq_ctx *ctx;
> >   	struct blk_mq_tag_set *set = q->tag_set;
> > -	/*
> > -	 * Avoid others reading imcomplete hctx->cpumask through sysfs
> > -	 */
> > -	mutex_lock(&q->sysfs_lock);
> > -
> >   	queue_for_each_hw_ctx(q, hctx, i) {
> >   		cpumask_clear(hctx->cpumask);
> >   		hctx->nr_ctx = 0;
> > @@ -2521,8 +2516,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> >   					HCTX_TYPE_DEFAULT, i);
> >   	}
> > -	mutex_unlock(&q->sysfs_lock);
> > -
> >   	queue_for_each_hw_ctx(q, hctx, i) {
> >   		/*
> >   		 * If no software queues are mapped to this hardware queue,
> > 
> 
> How about adding WARN_ON_ONCE(test_bit(QUEUE_FLAG_REGISTERED,
> &q->queue_flags)) ?

q->kobject isn't un-registered before updating nr_hw_queues, and only
hctx->kobj is un-registered, so we can't add the warn here.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V2 4/6] blk-mq: don't hold q->sysfs_lock in blk_mq_realloc_hw_ctxs()
  2019-08-21 15:56   ` Bart Van Assche
@ 2019-08-26  2:25     ` Ming Lei
  0 siblings, 0 replies; 20+ messages in thread
From: Ming Lei @ 2019-08-26  2:25 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Hannes Reinecke,
	Greg KH, Mike Snitzer

On Wed, Aug 21, 2019 at 08:56:36AM -0700, Bart Van Assche wrote:
> On 8/21/19 2:15 AM, Ming Lei wrote:
> > blk_mq_realloc_hw_ctxs() is called from blk_mq_init_allocated_queue()
> > and blk_mq_update_nr_hw_queues(). For the former caller, the kobject
> > isn't exposed to userspace yet. For the latter caller, sysfs/debugfs
> > is un-registered before updating nr_hw_queues.
> > 
> > On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after
> > request queue is freed") moves freeing hctx into queue's release
> > handler, so there won't be race with queue release path too.
> > 
> > So don't hold q->sysfs_lock in blk_mq_realloc_hw_ctxs().
> 
> How about mentioning that the locking at the start of
> blk_mq_update_nr_hw_queues() serializes all blk_mq_realloc_hw_ctxs() calls
> that happen after a queue has been registered in sysfs?

This patch is actually wrong because elevator switch still may happen
during updating nr_hw_queues, since only hctx sysfs entries are
un-registered, and "queue/scheduler" is still visible to userspace.

So I will drop this patch in V3.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-08-26  2:25 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-21  9:15 [PATCH V2 0/6] block: don't acquire .sysfs_lock before removing mq & iosched kobjects Ming Lei
2019-08-21  9:15 ` [PATCH V2 1/6] block: Remove blk_mq_register_dev() Ming Lei
2019-08-21  9:15 ` [PATCH V2 2/6] block: don't hold q->sysfs_lock in elevator_init_mq Ming Lei
2019-08-21 15:51   ` Bart Van Assche
2019-08-21  9:15 ` [PATCH V2 3/6] blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue Ming Lei
2019-08-21 15:53   ` Bart Van Assche
2019-08-26  2:11     ` Ming Lei
2019-08-21  9:15 ` [PATCH V2 4/6] blk-mq: don't hold q->sysfs_lock in blk_mq_realloc_hw_ctxs() Ming Lei
2019-08-21 15:56   ` Bart Van Assche
2019-08-26  2:25     ` Ming Lei
2019-08-21  9:15 ` [PATCH V2 5/6] block: add helper for checking if queue is registered Ming Lei
2019-08-21 15:57   ` Bart Van Assche
2019-08-21  9:15 ` [PATCH V2 6/6] block: split .sysfs_lock into two locks Ming Lei
2019-08-21 16:18   ` Bart Van Assche
2019-08-22  1:28     ` Ming Lei
2019-08-22 19:52       ` Bart Van Assche
2019-08-23  1:08         ` Ming Lei
2019-08-23 16:36           ` Bart Van Assche
2019-08-23 16:46   ` Bart Van Assche
2019-08-23 22:49     ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).