stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [stable-4.19 0/8] Some misc fixes
@ 2021-02-03 13:20 Jack Wang
  2021-02-03 13:20 ` [stable-4.19 1/8] block: fix NULL pointer dereference in register_disk Jack Wang
                   ` (7 more replies)
  0 siblings, 8 replies; 11+ messages in thread
From: Jack Wang @ 2021-02-03 13:20 UTC (permalink / raw)
  To: gregkh, sashal, stable

Hi Greg, hi Sasha,

Please consider to include following fixes in to stable tree.
The 6 patches from Ming was fixing a deadlock, they are included around kernel
5.3/4. Would be good to included in 4.19, we hit it during testing, with the fix
we no longer hit the deadlock.

The md fixes is for a panic regarding handle flush.

the last one is simple NULL pointer deref fix.

Thanks!
Jack Wang @ IONOS Cloud.

Ming Lei (6):
  block: don't hold q->sysfs_lock in elevator_init_mq
  blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue
  block: add helper for checking if queue is registered
  block: split .sysfs_lock into two locks
  block: fix race between switching elevator and removing queues
  block: don't release queue's sysfs lock during switching elevator

Xiao Ni (1):
  md: Set prev_flush_start and flush_bio in an atomic way

zhengbin (1):
  block: fix NULL pointer dereference in register_disk

 block/blk-core.c       |  1 +
 block/blk-mq-sysfs.c   | 12 +++++------
 block/blk-mq.c         |  7 ------
 block/blk-sysfs.c      | 49 +++++++++++++++++++++++++++---------------
 block/blk-wbt.c        |  2 +-
 block/blk.h            |  2 +-
 block/elevator.c       | 44 +++++++++++++++++++++----------------
 block/genhd.c          | 10 +++++----
 drivers/md/md.c        |  2 ++
 include/linux/blkdev.h |  2 ++
 10 files changed, 76 insertions(+), 55 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [stable-4.19 1/8] block: fix NULL pointer dereference in register_disk
  2021-02-03 13:20 [stable-4.19 0/8] Some misc fixes Jack Wang
@ 2021-02-03 13:20 ` Jack Wang
  2021-02-03 13:20 ` [stable-4.19 2/8] block: don't hold q->sysfs_lock in elevator_init_mq Jack Wang
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Jack Wang @ 2021-02-03 13:20 UTC (permalink / raw)
  To: gregkh, sashal, stable; +Cc: zhengbin, Jens Axboe

From: zhengbin <zhengbin13@huawei.com>

If __device_add_disk-->bdi_register_owner-->bdi_register-->
bdi_register_va-->device_create_vargs fails, bdi->dev is still
NULL, __device_add_disk-->register_disk will visit bdi->dev->kobj.
This patch fixes that.

Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 4d7c1d3fd7c7eda7dea351f071945e843a46c145)
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 block/genhd.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 2b536ab570ac..6965dde96373 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -652,10 +652,12 @@ static void register_disk(struct device *parent, struct gendisk *disk)
 		kobject_uevent(&part_to_dev(part)->kobj, KOBJ_ADD);
 	disk_part_iter_exit(&piter);
 
-	err = sysfs_create_link(&ddev->kobj,
-				&disk->queue->backing_dev_info->dev->kobj,
-				"bdi");
-	WARN_ON(err);
+	if (disk->queue->backing_dev_info->dev) {
+		err = sysfs_create_link(&ddev->kobj,
+			  &disk->queue->backing_dev_info->dev->kobj,
+			  "bdi");
+		WARN_ON(err);
+	}
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [stable-4.19 2/8] block: don't hold q->sysfs_lock in elevator_init_mq
  2021-02-03 13:20 [stable-4.19 0/8] Some misc fixes Jack Wang
  2021-02-03 13:20 ` [stable-4.19 1/8] block: fix NULL pointer dereference in register_disk Jack Wang
@ 2021-02-03 13:20 ` Jack Wang
  2021-02-03 13:20 ` [stable-4.19 3/8] blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue Jack Wang
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Jack Wang @ 2021-02-03 13:20 UTC (permalink / raw)
  To: gregkh, sashal, stable
  Cc: Ming Lei, Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Bart Van Assche, Damien Le Moal, Jens Axboe

From: Ming Lei <ming.lei@redhat.com>

The original comment says:

	q->sysfs_lock must be held to provide mutual exclusion between
	elevator_switch() and here.

Which is simply wrong. elevator_init_mq() is only called from
blk_mq_init_allocated_queue, which is always called before the request
queue is registered via blk_register_queue(), for dm-rq or normal rq
based driver. However, queue's kobject is only exposed and added to sysfs
in blk_register_queue(). So there isn't such race between elevator_switch()
and elevator_init_mq().

So avoid to hold q->sysfs_lock in elevator_init_mq().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit c48dac137a62a5d6fa1ef3fa445cbd9c43655a76)
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 block/elevator.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index fae58b2f906f..ddbcd36616a8 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -980,23 +980,19 @@ int elevator_init_mq(struct request_queue *q)
 	if (q->nr_hw_queues != 1)
 		return 0;
 
-	/*
-	 * q->sysfs_lock must be held to provide mutual exclusion between
-	 * elevator_switch() and here.
-	 */
-	mutex_lock(&q->sysfs_lock);
+	WARN_ON_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags));
+
 	if (unlikely(q->elevator))
-		goto out_unlock;
+		goto out;
 
 	e = elevator_get(q, "mq-deadline", false);
 	if (!e)
-		goto out_unlock;
+		goto out;
 
 	err = blk_mq_init_sched(q, e);
 	if (err)
 		elevator_put(e);
-out_unlock:
-	mutex_unlock(&q->sysfs_lock);
+out:
 	return err;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [stable-4.19 3/8] blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue
  2021-02-03 13:20 [stable-4.19 0/8] Some misc fixes Jack Wang
  2021-02-03 13:20 ` [stable-4.19 1/8] block: fix NULL pointer dereference in register_disk Jack Wang
  2021-02-03 13:20 ` [stable-4.19 2/8] block: don't hold q->sysfs_lock in elevator_init_mq Jack Wang
@ 2021-02-03 13:20 ` Jack Wang
  2021-02-03 13:20 ` [stable-4.19 4/8] block: add helper for checking if queue is registered Jack Wang
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Jack Wang @ 2021-02-03 13:20 UTC (permalink / raw)
  To: gregkh, sashal, stable
  Cc: Ming Lei, Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Bart Van Assche, Jens Axboe

From: Ming Lei <ming.lei@redhat.com>

blk_mq_map_swqueue() is called from blk_mq_init_allocated_queue()
and blk_mq_update_nr_hw_queues(). For the former caller, the kobject
isn't exposed to userspace yet. For the latter caller, hctx sysfs entries
and debugfs are un-registered before updating nr_hw_queues.

On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after
request queue is freed") moves freeing hctx into queue's release
handler, so there won't be race with queue release path too.

So don't hold q->sysfs_lock in blk_mq_map_swqueue().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit c6ba933358f0d7a6a042b894dba20cc70396a6d3)
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 block/blk-mq.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0df43515ff94..195526b93895 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2324,11 +2324,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_tag_set *set = q->tag_set;
 
-	/*
-	 * Avoid others reading imcomplete hctx->cpumask through sysfs
-	 */
-	mutex_lock(&q->sysfs_lock);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		cpumask_clear(hctx->cpumask);
 		hctx->nr_ctx = 0;
@@ -2362,8 +2357,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 		hctx->ctxs[hctx->nr_ctx++] = ctx;
 	}
 
-	mutex_unlock(&q->sysfs_lock);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		/*
 		 * If no software queues are mapped to this hardware queue,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [stable-4.19 4/8] block: add helper for checking if queue is registered
  2021-02-03 13:20 [stable-4.19 0/8] Some misc fixes Jack Wang
                   ` (2 preceding siblings ...)
  2021-02-03 13:20 ` [stable-4.19 3/8] blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue Jack Wang
@ 2021-02-03 13:20 ` Jack Wang
  2021-02-03 13:20 ` [stable-4.19 5/8] block: split .sysfs_lock into two locks Jack Wang
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Jack Wang @ 2021-02-03 13:20 UTC (permalink / raw)
  To: gregkh, sashal, stable
  Cc: Ming Lei, Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Bart Van Assche, Jens Axboe

From: Ming Lei <ming.lei@redhat.com>

There are 4 users which check if queue is registered, so add one helper
to check it.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 58c898ba370e68d39470cd0d932b524682c1f9be)
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 block/blk-sysfs.c      | 4 ++--
 block/blk-wbt.c        | 2 +-
 block/elevator.c       | 2 +-
 include/linux/blkdev.h | 1 +
 4 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 8286640d4d66..0a7636d24563 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -896,7 +896,7 @@ int blk_register_queue(struct gendisk *disk)
 	if (WARN_ON(!q))
 		return -ENXIO;
 
-	WARN_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags),
+	WARN_ONCE(blk_queue_registered(q),
 		  "%s is registering an already registered queue\n",
 		  kobject_name(&dev->kobj));
 	queue_flag_set_unlocked(QUEUE_FLAG_REGISTERED, q);
@@ -973,7 +973,7 @@ void blk_unregister_queue(struct gendisk *disk)
 		return;
 
 	/* Return early if disk->queue was never registered. */
-	if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags))
+	if (!blk_queue_registered(q))
 		return;
 
 	/*
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index f1de8ba483a9..50f2abfa1a60 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -708,7 +708,7 @@ void wbt_enable_default(struct request_queue *q)
 		return;
 
 	/* Queue not registered? Maybe shutting down... */
-	if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags))
+	if (!blk_queue_registered(q))
 		return;
 
 	if ((q->mq_ops && IS_ENABLED(CONFIG_BLK_WBT_MQ)) ||
diff --git a/block/elevator.c b/block/elevator.c
index ddbcd36616a8..9bffe4558929 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -1083,7 +1083,7 @@ static int __elevator_change(struct request_queue *q, const char *name)
 	struct elevator_type *e;
 
 	/* Make sure queue is not in the middle of being removed */
-	if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags))
+	if (!blk_queue_registered(q))
 		return -ENOENT;
 
 	/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 745b2d0dcf78..3a2b34c2c82b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -743,6 +743,7 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
 #define blk_queue_quiesced(q)	test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
 #define blk_queue_pm_only(q)	atomic_read(&(q)->pm_only)
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
+#define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [stable-4.19 5/8] block: split .sysfs_lock into two locks
  2021-02-03 13:20 [stable-4.19 0/8] Some misc fixes Jack Wang
                   ` (3 preceding siblings ...)
  2021-02-03 13:20 ` [stable-4.19 4/8] block: add helper for checking if queue is registered Jack Wang
@ 2021-02-03 13:20 ` Jack Wang
  2021-02-03 13:20 ` [stable-4.19 6/8] block: fix race between switching elevator and removing queues Jack Wang
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Jack Wang @ 2021-02-03 13:20 UTC (permalink / raw)
  To: gregkh, sashal, stable
  Cc: Ming Lei, Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Bart Van Assche, Jens Axboe

From: Ming Lei <ming.lei@redhat.com>

The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.

However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].

On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.

So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.

sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.

[1]  lockdep warning
    ======================================================
    WARNING: possible circular locking dependency detected
    5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
    ------------------------------------------------------
    rmmod/777 is trying to acquire lock:
    00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72

    but task is already holding lock:
    00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&q->sysfs_lock){+.+.}:
           __lock_acquire+0x95f/0xa2f
           lock_acquire+0x1b4/0x1e8
           __mutex_lock+0x14a/0xa9b
           blk_mq_hw_sysfs_show+0x63/0xb6
           sysfs_kf_seq_show+0x11f/0x196
           seq_read+0x2cd/0x5f2
           vfs_read+0xc7/0x18c
           ksys_read+0xc4/0x13e
           do_syscall_64+0xa7/0x295
           entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (kn->count#202){++++}:
           check_prev_add+0x5d2/0xc45
           validate_chain+0xed3/0xf94
           __lock_acquire+0x95f/0xa2f
           lock_acquire+0x1b4/0x1e8
           __kernfs_remove+0x237/0x40b
           kernfs_remove_by_name_ns+0x59/0x72
           remove_files+0x61/0x96
           sysfs_remove_group+0x81/0xa4
           sysfs_remove_groups+0x3b/0x44
           kobject_del+0x44/0x94
           blk_mq_unregister_dev+0x83/0xdd
           blk_unregister_queue+0xa0/0x10b
           del_gendisk+0x259/0x3fa
           null_del_dev+0x8b/0x1c3 [null_blk]
           null_exit+0x5c/0x95 [null_blk]
           __se_sys_delete_module+0x204/0x337
           do_syscall_64+0xa7/0x295
           entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(&q->sysfs_lock);
                                   lock(kn->count#202);
                                   lock(&q->sysfs_lock);
      lock(kn->count#202);

     *** DEADLOCK ***

    2 locks held by rmmod/777:
     #0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
     #1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    stack backtrace:
    CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
    Call Trace:
     dump_stack+0x9a/0xe6
     check_noncircular+0x207/0x251
     ? print_circular_bug+0x32a/0x32a
     ? find_usage_backwards+0x84/0xb0
     check_prev_add+0x5d2/0xc45
     validate_chain+0xed3/0xf94
     ? check_prev_add+0xc45/0xc45
     ? mark_lock+0x11b/0x804
     ? check_usage_forwards+0x1ca/0x1ca
     __lock_acquire+0x95f/0xa2f
     lock_acquire+0x1b4/0x1e8
     ? kernfs_remove_by_name_ns+0x59/0x72
     __kernfs_remove+0x237/0x40b
     ? kernfs_remove_by_name_ns+0x59/0x72
     ? kernfs_next_descendant_post+0x7d/0x7d
     ? strlen+0x10/0x23
     ? strcmp+0x22/0x44
     kernfs_remove_by_name_ns+0x59/0x72
     remove_files+0x61/0x96
     sysfs_remove_group+0x81/0xa4
     sysfs_remove_groups+0x3b/0x44
     kobject_del+0x44/0x94
     blk_mq_unregister_dev+0x83/0xdd
     blk_unregister_queue+0xa0/0x10b
     del_gendisk+0x259/0x3fa
     ? disk_events_poll_msecs_store+0x12b/0x12b
     ? check_flags+0x1ea/0x204
     ? mark_held_locks+0x1f/0x7a
     null_del_dev+0x8b/0x1c3 [null_blk]
     null_exit+0x5c/0x95 [null_blk]
     __se_sys_delete_module+0x204/0x337
     ? free_module+0x39f/0x39f
     ? blkcg_maybe_throttle_current+0x8a/0x718
     ? rwlock_bug+0x62/0x62
     ? __blkcg_punt_bio_submit+0xd0/0xd0
     ? trace_hardirqs_on_thunk+0x1a/0x20
     ? mark_held_locks+0x1f/0x7a
     ? do_syscall_64+0x4c/0x295
     do_syscall_64+0xa7/0x295
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fb696cdbe6b
    Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
    RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
    RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
    R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
    R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(jwang:cherry picked from commit cecf5d87ff2035127bb5a9ee054d0023a4a7cad3,
adjust ctx for 4,19)
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 block/blk-core.c       |  1 +
 block/blk-mq-sysfs.c   | 12 ++++-----
 block/blk-sysfs.c      | 44 +++++++++++++++++++------------
 block/blk.h            |  2 +-
 block/elevator.c       | 59 +++++++++++++++++++++++++++++++++++-------
 include/linux/blkdev.h |  1 +
 6 files changed, 86 insertions(+), 33 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ce3710404544..80f3e729fdd4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1059,6 +1059,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
 	mutex_init(&q->blk_trace_mutex);
 #endif
 	mutex_init(&q->sysfs_lock);
+	mutex_init(&q->sysfs_dir_lock);
 	spin_lock_init(&q->__queue_lock);
 
 	if (!q->mq_ops)
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 5006a0d00990..5e4b7ed1e897 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -264,7 +264,7 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
-	lockdep_assert_held(&q->sysfs_lock);
+	lockdep_assert_held(&q->sysfs_dir_lock);
 
 	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_unregister_hctx(hctx);
@@ -312,7 +312,7 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
 	int ret, i;
 
 	WARN_ON_ONCE(!q->kobj.parent);
-	lockdep_assert_held(&q->sysfs_lock);
+	lockdep_assert_held(&q->sysfs_dir_lock);
 
 	ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
 	if (ret < 0)
@@ -358,7 +358,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
-	mutex_lock(&q->sysfs_lock);
+	mutex_lock(&q->sysfs_dir_lock);
 	if (!q->mq_sysfs_init_done)
 		goto unlock;
 
@@ -366,7 +366,7 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
 		blk_mq_unregister_hctx(hctx);
 
 unlock:
-	mutex_unlock(&q->sysfs_lock);
+	mutex_unlock(&q->sysfs_dir_lock);
 }
 
 int blk_mq_sysfs_register(struct request_queue *q)
@@ -374,7 +374,7 @@ int blk_mq_sysfs_register(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i, ret = 0;
 
-	mutex_lock(&q->sysfs_lock);
+	mutex_lock(&q->sysfs_dir_lock);
 	if (!q->mq_sysfs_init_done)
 		goto unlock;
 
@@ -385,7 +385,7 @@ int blk_mq_sysfs_register(struct request_queue *q)
 	}
 
 unlock:
-	mutex_unlock(&q->sysfs_lock);
+	mutex_unlock(&q->sysfs_dir_lock);
 
 	return ret;
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0a7636d24563..b2208b69f04a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -892,6 +892,7 @@ int blk_register_queue(struct gendisk *disk)
 	int ret;
 	struct device *dev = disk_to_dev(disk);
 	struct request_queue *q = disk->queue;
+	bool has_elevator = false;
 
 	if (WARN_ON(!q))
 		return -ENXIO;
@@ -899,7 +900,6 @@ int blk_register_queue(struct gendisk *disk)
 	WARN_ONCE(blk_queue_registered(q),
 		  "%s is registering an already registered queue\n",
 		  kobject_name(&dev->kobj));
-	queue_flag_set_unlocked(QUEUE_FLAG_REGISTERED, q);
 
 	/*
 	 * SCSI probing may synchronously create and destroy a lot of
@@ -920,8 +920,7 @@ int blk_register_queue(struct gendisk *disk)
 	if (ret)
 		return ret;
 
-	/* Prevent changes through sysfs until registration is completed. */
-	mutex_lock(&q->sysfs_lock);
+	mutex_lock(&q->sysfs_dir_lock);
 
 	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
 	if (ret < 0) {
@@ -934,26 +933,36 @@ int blk_register_queue(struct gendisk *disk)
 		blk_mq_debugfs_register(q);
 	}
 
-	kobject_uevent(&q->kobj, KOBJ_ADD);
-
-	wbt_enable_default(q);
-
-	blk_throtl_register_queue(q);
-
+	/*
+	 * The flag of QUEUE_FLAG_REGISTERED isn't set yet, so elevator
+	 * switch won't happen at all.
+	 */
 	if (q->request_fn || (q->mq_ops && q->elevator)) {
-		ret = elv_register_queue(q);
+		ret = elv_register_queue(q, false);
 		if (ret) {
-			mutex_unlock(&q->sysfs_lock);
-			kobject_uevent(&q->kobj, KOBJ_REMOVE);
+			mutex_unlock(&q->sysfs_dir_lock);
 			kobject_del(&q->kobj);
 			blk_trace_remove_sysfs(dev);
 			kobject_put(&dev->kobj);
 			return ret;
 		}
+		has_elevator = true;
 	}
+
+	mutex_lock(&q->sysfs_lock);
+	blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
+	wbt_enable_default(q);
+	blk_throtl_register_queue(q);
+
+	/* Now everything is ready and send out KOBJ_ADD uevent */
+	kobject_uevent(&q->kobj, KOBJ_ADD);
+	if (has_elevator)
+		kobject_uevent(&q->elevator->kobj, KOBJ_ADD);
+	mutex_unlock(&q->sysfs_lock);
+
 	ret = 0;
 unlock:
-	mutex_unlock(&q->sysfs_lock);
+	mutex_unlock(&q->sysfs_dir_lock);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blk_register_queue);
@@ -968,6 +977,7 @@ EXPORT_SYMBOL_GPL(blk_register_queue);
 void blk_unregister_queue(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
+	bool has_elevator;
 
 	if (WARN_ON(!q))
 		return;
@@ -982,25 +992,27 @@ void blk_unregister_queue(struct gendisk *disk)
 	 * concurrent elv_iosched_store() calls.
 	 */
 	mutex_lock(&q->sysfs_lock);
-
 	blk_queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
+	has_elevator = !!q->elevator;
+	mutex_unlock(&q->sysfs_lock);
 
+	mutex_lock(&q->sysfs_dir_lock);
 	/*
 	 * Remove the sysfs attributes before unregistering the queue data
 	 * structures that can be modified through sysfs.
 	 */
 	if (q->mq_ops)
 		blk_mq_unregister_dev(disk_to_dev(disk), q);
-	mutex_unlock(&q->sysfs_lock);
 
 	kobject_uevent(&q->kobj, KOBJ_REMOVE);
 	kobject_del(&q->kobj);
 	blk_trace_remove_sysfs(disk_to_dev(disk));
 
 	mutex_lock(&q->sysfs_lock);
-	if (q->request_fn || (q->mq_ops && q->elevator))
+	if (q->request_fn || has_elevator)
 		elv_unregister_queue(q);
 	mutex_unlock(&q->sysfs_lock);
+	mutex_unlock(&q->sysfs_dir_lock);
 
 	kobject_put(&disk_to_dev(disk)->kobj);
 }
diff --git a/block/blk.h b/block/blk.h
index 1a5b67b57e6b..ae87e2a5f2bd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -244,7 +244,7 @@ int elevator_init_mq(struct request_queue *q);
 int elevator_switch_mq(struct request_queue *q,
 			      struct elevator_type *new_e);
 void elevator_exit(struct request_queue *, struct elevator_queue *);
-int elv_register_queue(struct request_queue *q);
+int elv_register_queue(struct request_queue *q, bool uevent);
 void elv_unregister_queue(struct request_queue *q);
 
 struct hd_struct *__disk_get_part(struct gendisk *disk, int partno);
diff --git a/block/elevator.c b/block/elevator.c
index 9bffe4558929..2ff0859e8b35 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -833,13 +833,16 @@ static struct kobj_type elv_ktype = {
 	.release	= elevator_release,
 };
 
-int elv_register_queue(struct request_queue *q)
+/*
+ * elv_register_queue is called from either blk_register_queue or
+ * elevator_switch, elevator switch is prevented from being happen
+ * in the two paths, so it is safe to not hold q->sysfs_lock.
+ */
+int elv_register_queue(struct request_queue *q, bool uevent)
 {
 	struct elevator_queue *e = q->elevator;
 	int error;
 
-	lockdep_assert_held(&q->sysfs_lock);
-
 	error = kobject_add(&e->kobj, &q->kobj, "%s", "iosched");
 	if (!error) {
 		struct elv_fs_entry *attr = e->type->elevator_attrs;
@@ -850,26 +853,36 @@ int elv_register_queue(struct request_queue *q)
 				attr++;
 			}
 		}
-		kobject_uevent(&e->kobj, KOBJ_ADD);
+		if (uevent)
+			kobject_uevent(&e->kobj, KOBJ_ADD);
+
+		mutex_lock(&q->sysfs_lock);
 		e->registered = 1;
 		if (!e->uses_mq && e->type->ops.sq.elevator_registered_fn)
 			e->type->ops.sq.elevator_registered_fn(q);
+		mutex_unlock(&q->sysfs_lock);
 	}
 	return error;
 }
 
+/*
+ * elv_unregister_queue is called from either blk_unregister_queue or
+ * elevator_switch, elevator switch is prevented from being happen
+ * in the two paths, so it is safe to not hold q->sysfs_lock.
+ */
 void elv_unregister_queue(struct request_queue *q)
 {
-	lockdep_assert_held(&q->sysfs_lock);
-
 	if (q) {
 		struct elevator_queue *e = q->elevator;
 
 		kobject_uevent(&e->kobj, KOBJ_REMOVE);
 		kobject_del(&e->kobj);
+
+		mutex_lock(&q->sysfs_lock);
 		e->registered = 0;
 		/* Re-enable throttling in case elevator disabled it */
 		wbt_enable_default(q);
+		mutex_unlock(&q->sysfs_lock);
 	}
 }
 
@@ -940,10 +953,32 @@ int elevator_switch_mq(struct request_queue *q,
 	lockdep_assert_held(&q->sysfs_lock);
 
 	if (q->elevator) {
-		if (q->elevator->registered)
+		if (q->elevator->registered) {
+			mutex_unlock(&q->sysfs_lock);
+
+			/*
+			 * Concurrent elevator switch can't happen becasue
+			 * sysfs write is always exclusively on same file.
+			 *
+			 * Also the elevator queue won't be freed after
+			 * sysfs_lock is released becasue kobject_del() in
+			 * blk_unregister_queue() waits for completion of
+			 * .store & .show on its attributes.
+			 */
 			elv_unregister_queue(q);
+
+			mutex_lock(&q->sysfs_lock);
+		}
 		ioc_clear_queue(q);
 		elevator_exit(q, q->elevator);
+
+		/*
+		 * sysfs_lock may be dropped, so re-check if queue is
+		 * unregistered. If yes, don't switch to new elevator
+		 * any more
+		 */
+		if (!blk_queue_registered(q))
+			return 0;
 	}
 
 	ret = blk_mq_init_sched(q, new_e);
@@ -951,7 +986,11 @@ int elevator_switch_mq(struct request_queue *q,
 		goto out;
 
 	if (new_e) {
-		ret = elv_register_queue(q);
+		mutex_unlock(&q->sysfs_lock);
+
+		ret = elv_register_queue(q, true);
+
+		mutex_lock(&q->sysfs_lock);
 		if (ret) {
 			elevator_exit(q, q->elevator);
 			goto out;
@@ -1047,7 +1086,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (err)
 		goto fail_init;
 
-	err = elv_register_queue(q);
+	err = elv_register_queue(q, true);
 	if (err)
 		goto fail_register;
 
@@ -1067,7 +1106,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/* switch failed, restore and re-register old elevator */
 	if (old) {
 		q->elevator = old;
-		elv_register_queue(q);
+		elv_register_queue(q, true);
 		blk_queue_bypass_end(q);
 	}
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3a2b34c2c82b..209ba8e7bd31 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -637,6 +637,7 @@ struct request_queue {
 	struct delayed_work	requeue_work;
 
 	struct mutex		sysfs_lock;
+	struct mutex		sysfs_dir_lock;
 
 	int			bypass_depth;
 	atomic_t		mq_freeze_depth;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [stable-4.19 6/8] block: fix race between switching elevator and removing queues
  2021-02-03 13:20 [stable-4.19 0/8] Some misc fixes Jack Wang
                   ` (4 preceding siblings ...)
  2021-02-03 13:20 ` [stable-4.19 5/8] block: split .sysfs_lock into two locks Jack Wang
@ 2021-02-03 13:20 ` Jack Wang
  2021-02-03 13:20 ` [stable-4.19 7/8] block: don't release queue's sysfs lock during switching elevator Jack Wang
  2021-02-03 13:20 ` [stable-4.19 8/8] md: Set prev_flush_start and flush_bio in an atomic way Jack Wang
  7 siblings, 0 replies; 11+ messages in thread
From: Jack Wang @ 2021-02-03 13:20 UTC (permalink / raw)
  To: gregkh, sashal, stable
  Cc: Ming Lei, Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Bart Van Assche, Jens Axboe

From: Ming Lei <ming.lei@redhat.com>

cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to
release & actuire sysfs_lock again during switching elevator. So it
isn't enough to prevent switching elevator from happening by simply
clearing QUEUE_FLAG_REGISTERED with holding sysfs_lock, because
in-progress switch still can move on after re-acquiring the lock,
meantime the flag of QUEUE_FLAG_REGISTERED won't get checked.

Fixes this issue by checking 'q->elevator' directly & locklessly after
q->kobj is removed in blk_unregister_queue(), this way is safe because
q->elevator can't be changed at that time.

Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks")
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 0a67b5a926e63ff5492c3c675eab5900580d056d)
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 block/blk-sysfs.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b2208b69f04a..899987152701 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -977,7 +977,6 @@ EXPORT_SYMBOL_GPL(blk_register_queue);
 void blk_unregister_queue(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
-	bool has_elevator;
 
 	if (WARN_ON(!q))
 		return;
@@ -993,7 +992,6 @@ void blk_unregister_queue(struct gendisk *disk)
 	 */
 	mutex_lock(&q->sysfs_lock);
 	blk_queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
-	has_elevator = !!q->elevator;
 	mutex_unlock(&q->sysfs_lock);
 
 	mutex_lock(&q->sysfs_dir_lock);
@@ -1009,7 +1007,11 @@ void blk_unregister_queue(struct gendisk *disk)
 	blk_trace_remove_sysfs(disk_to_dev(disk));
 
 	mutex_lock(&q->sysfs_lock);
-	if (q->request_fn || has_elevator)
+	/*
+	 * q->kobj has been removed, so it is safe to check if elevator
+	 * exists without holding q->sysfs_lock.
+	 */
+	if (q->request_fn || q->elevator)
 		elv_unregister_queue(q);
 	mutex_unlock(&q->sysfs_lock);
 	mutex_unlock(&q->sysfs_dir_lock);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [stable-4.19 7/8] block: don't release queue's sysfs lock during switching elevator
  2021-02-03 13:20 [stable-4.19 0/8] Some misc fixes Jack Wang
                   ` (5 preceding siblings ...)
  2021-02-03 13:20 ` [stable-4.19 6/8] block: fix race between switching elevator and removing queues Jack Wang
@ 2021-02-03 13:20 ` Jack Wang
  2021-02-03 13:20 ` [stable-4.19 8/8] md: Set prev_flush_start and flush_bio in an atomic way Jack Wang
  7 siblings, 0 replies; 11+ messages in thread
From: Jack Wang @ 2021-02-03 13:20 UTC (permalink / raw)
  To: gregkh, sashal, stable
  Cc: Ming Lei, Christoph Hellwig, Hannes Reinecke, Mike Snitzer,
	Bart Van Assche, Jens Axboe

From: Ming Lei <ming.lei@redhat.com>

cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to
release & acquire sysfs_lock before registering/un-registering elevator
queue during switching elevator for avoiding potential deadlock from
showing & storing 'queue/iosched' attributes and removing elevator's
kobject.

Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
required in .show & .store of queue/iosched's attributes, and just
elevator's sysfs lock is acquired in elv_iosched_store() and
elv_iosched_show(). So it is safe to hold queue's sysfs lock when
registering/un-registering elevator queue.

The biggest issue is that commit cecf5d87ff20 assumes that concurrent
write on 'queue/scheduler' can't happen. However, this assumption isn't
true, because kernfs_fop_write() only guarantees that concurrent write
aren't called on the same open file, but the write could be from
different open on the file. So we can't release & re-acquire queue's
sysfs lock during switching elevator, otherwise use-after-free on
elevator could be triggered.

Fixes the issue by not releasing queue's sysfs lock during switching
elevator.

Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks")
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(jwang:cherry picked from commit b89f625e28d44552083f43752f62d8621ded0a04
adjust ctx for 4.19)
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 block/blk-sysfs.c |  3 ++-
 block/elevator.c  | 31 +------------------------------
 2 files changed, 3 insertions(+), 31 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 899987152701..07494deb1a26 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -933,6 +933,7 @@ int blk_register_queue(struct gendisk *disk)
 		blk_mq_debugfs_register(q);
 	}
 
+	mutex_lock(&q->sysfs_lock);
 	/*
 	 * The flag of QUEUE_FLAG_REGISTERED isn't set yet, so elevator
 	 * switch won't happen at all.
@@ -940,6 +941,7 @@ int blk_register_queue(struct gendisk *disk)
 	if (q->request_fn || (q->mq_ops && q->elevator)) {
 		ret = elv_register_queue(q, false);
 		if (ret) {
+			mutex_unlock(&q->sysfs_lock);
 			mutex_unlock(&q->sysfs_dir_lock);
 			kobject_del(&q->kobj);
 			blk_trace_remove_sysfs(dev);
@@ -949,7 +951,6 @@ int blk_register_queue(struct gendisk *disk)
 		has_elevator = true;
 	}
 
-	mutex_lock(&q->sysfs_lock);
 	blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
 	wbt_enable_default(q);
 	blk_throtl_register_queue(q);
diff --git a/block/elevator.c b/block/elevator.c
index 2ff0859e8b35..5b51bc5fad9f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -856,11 +856,9 @@ int elv_register_queue(struct request_queue *q, bool uevent)
 		if (uevent)
 			kobject_uevent(&e->kobj, KOBJ_ADD);
 
-		mutex_lock(&q->sysfs_lock);
 		e->registered = 1;
 		if (!e->uses_mq && e->type->ops.sq.elevator_registered_fn)
 			e->type->ops.sq.elevator_registered_fn(q);
-		mutex_unlock(&q->sysfs_lock);
 	}
 	return error;
 }
@@ -878,11 +876,9 @@ void elv_unregister_queue(struct request_queue *q)
 		kobject_uevent(&e->kobj, KOBJ_REMOVE);
 		kobject_del(&e->kobj);
 
-		mutex_lock(&q->sysfs_lock);
 		e->registered = 0;
 		/* Re-enable throttling in case elevator disabled it */
 		wbt_enable_default(q);
-		mutex_unlock(&q->sysfs_lock);
 	}
 }
 
@@ -953,32 +949,11 @@ int elevator_switch_mq(struct request_queue *q,
 	lockdep_assert_held(&q->sysfs_lock);
 
 	if (q->elevator) {
-		if (q->elevator->registered) {
-			mutex_unlock(&q->sysfs_lock);
-
-			/*
-			 * Concurrent elevator switch can't happen becasue
-			 * sysfs write is always exclusively on same file.
-			 *
-			 * Also the elevator queue won't be freed after
-			 * sysfs_lock is released becasue kobject_del() in
-			 * blk_unregister_queue() waits for completion of
-			 * .store & .show on its attributes.
-			 */
+		if (q->elevator->registered)
 			elv_unregister_queue(q);
 
-			mutex_lock(&q->sysfs_lock);
-		}
 		ioc_clear_queue(q);
 		elevator_exit(q, q->elevator);
-
-		/*
-		 * sysfs_lock may be dropped, so re-check if queue is
-		 * unregistered. If yes, don't switch to new elevator
-		 * any more
-		 */
-		if (!blk_queue_registered(q))
-			return 0;
 	}
 
 	ret = blk_mq_init_sched(q, new_e);
@@ -986,11 +961,7 @@ int elevator_switch_mq(struct request_queue *q,
 		goto out;
 
 	if (new_e) {
-		mutex_unlock(&q->sysfs_lock);
-
 		ret = elv_register_queue(q, true);
-
-		mutex_lock(&q->sysfs_lock);
 		if (ret) {
 			elevator_exit(q, q->elevator);
 			goto out;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [stable-4.19 8/8] md: Set prev_flush_start and flush_bio in an atomic way
  2021-02-03 13:20 [stable-4.19 0/8] Some misc fixes Jack Wang
                   ` (6 preceding siblings ...)
  2021-02-03 13:20 ` [stable-4.19 7/8] block: don't release queue's sysfs lock during switching elevator Jack Wang
@ 2021-02-03 13:20 ` Jack Wang
  2021-02-04 15:12   ` Greg KH
  7 siblings, 1 reply; 11+ messages in thread
From: Jack Wang @ 2021-02-03 13:20 UTC (permalink / raw)
  To: gregkh, sashal, stable; +Cc: Xiao Ni, David Jeffery, Song Liu

From: Xiao Ni <xni@redhat.com>

One customer reports a crash problem which causes by flush request. It
triggers a warning before crash.

        /* new request after previous flush is completed */
        if (ktime_after(req_start, mddev->prev_flush_start)) {
                WARN_ON(mddev->flush_bio);
                mddev->flush_bio = bio;
                bio = NULL;
        }

The WARN_ON is triggered. We use spin lock to protect prev_flush_start and
flush_bio in md_flush_request. But there is no lock protection in
md_submit_flush_data. It can set flush_bio to NULL first because of
compiler reordering write instructions.

For example, flush bio1 sets flush bio to NULL first in
md_submit_flush_data. An interrupt or vmware causing an extended stall
happen between updating flush_bio and prev_flush_start. Because flush_bio
is NULL, flush bio2 can get the lock and submit to underlayer disks. Then
flush bio1 updates prev_flush_start after the interrupt or extended stall.

Then flush bio3 enters in md_flush_request. The start time req_start is
behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't
finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK
again. INIT_WORK() will re-initialize the list pointers in the
work_struct, which then can result in a corrupted work list and the
work_struct queued a second time. With the work list corrupted, it can
lead in invalid work items being used and cause a crash in
process_one_work.

We need to make sure only one flush bio can be handled at one same time.
So add spin lock in md_submit_flush_data to protect prev_flush_start and
flush_bio in an atomic way.

Reviewed-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
[jwang: backport dc5d17a3c39b06aef866afca19245a9cfb533a79 to 4.19]
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/md/md.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 80ca13594c18..09f0d8e70b70 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -474,8 +474,10 @@ static void md_submit_flush_data(struct work_struct *ws)
 	 * could wait for this and below md_handle_request could wait for those
 	 * bios because of suspend check
 	 */
+	spin_lock_irq(&mddev->lock);
 	mddev->last_flush = mddev->start_flush;
 	mddev->flush_bio = NULL;
+	spin_unlock_irq(&mddev->lock);
 	wake_up(&mddev->sb_wait);
 
 	if (bio->bi_iter.bi_size == 0) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [stable-4.19 8/8] md: Set prev_flush_start and flush_bio in an atomic way
  2021-02-03 13:20 ` [stable-4.19 8/8] md: Set prev_flush_start and flush_bio in an atomic way Jack Wang
@ 2021-02-04 15:12   ` Greg KH
  2021-02-05  6:57     ` Jinpu Wang
  0 siblings, 1 reply; 11+ messages in thread
From: Greg KH @ 2021-02-04 15:12 UTC (permalink / raw)
  To: Jack Wang; +Cc: sashal, stable, Xiao Ni, David Jeffery, Song Liu

On Wed, Feb 03, 2021 at 02:20:22PM +0100, Jack Wang wrote:
> From: Xiao Ni <xni@redhat.com>
> 
> One customer reports a crash problem which causes by flush request. It
> triggers a warning before crash.
> 
>         /* new request after previous flush is completed */
>         if (ktime_after(req_start, mddev->prev_flush_start)) {
>                 WARN_ON(mddev->flush_bio);
>                 mddev->flush_bio = bio;
>                 bio = NULL;
>         }
> 
> The WARN_ON is triggered. We use spin lock to protect prev_flush_start and
> flush_bio in md_flush_request. But there is no lock protection in
> md_submit_flush_data. It can set flush_bio to NULL first because of
> compiler reordering write instructions.
> 
> For example, flush bio1 sets flush bio to NULL first in
> md_submit_flush_data. An interrupt or vmware causing an extended stall
> happen between updating flush_bio and prev_flush_start. Because flush_bio
> is NULL, flush bio2 can get the lock and submit to underlayer disks. Then
> flush bio1 updates prev_flush_start after the interrupt or extended stall.
> 
> Then flush bio3 enters in md_flush_request. The start time req_start is
> behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't
> finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK
> again. INIT_WORK() will re-initialize the list pointers in the
> work_struct, which then can result in a corrupted work list and the
> work_struct queued a second time. With the work list corrupted, it can
> lead in invalid work items being used and cause a crash in
> process_one_work.
> 
> We need to make sure only one flush bio can be handled at one same time.
> So add spin lock in md_submit_flush_data to protect prev_flush_start and
> flush_bio in an atomic way.
> 
> Reviewed-by: David Jeffery <djeffery@redhat.com>
> Signed-off-by: Xiao Ni <xni@redhat.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> [jwang: backport dc5d17a3c39b06aef866afca19245a9cfb533a79 to 4.19]

I can not take patches backported to older kernels that "skip" kernel
releases.

For example, if I take this into 4.19.y, and then someone moves to 5.4
or 5.10, they will hit the same issue.

So please provide a backported series for all affected releases, back as
far as you want, but never skip releases.

I can't take this series, I'll drop it for now and wait for an updated
set of patches.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [stable-4.19 8/8] md: Set prev_flush_start and flush_bio in an atomic way
  2021-02-04 15:12   ` Greg KH
@ 2021-02-05  6:57     ` Jinpu Wang
  0 siblings, 0 replies; 11+ messages in thread
From: Jinpu Wang @ 2021-02-05  6:57 UTC (permalink / raw)
  To: Greg KH; +Cc: Sasha Levin, stable, Xiao Ni, David Jeffery, Song Liu

On Thu, Feb 4, 2021 at 4:12 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Wed, Feb 03, 2021 at 02:20:22PM +0100, Jack Wang wrote:
> > From: Xiao Ni <xni@redhat.com>
> >
> > One customer reports a crash problem which causes by flush request. It
> > triggers a warning before crash.
> >
> >         /* new request after previous flush is completed */
> >         if (ktime_after(req_start, mddev->prev_flush_start)) {
> >                 WARN_ON(mddev->flush_bio);
> >                 mddev->flush_bio = bio;
> >                 bio = NULL;
> >         }
> >
> > The WARN_ON is triggered. We use spin lock to protect prev_flush_start and
> > flush_bio in md_flush_request. But there is no lock protection in
> > md_submit_flush_data. It can set flush_bio to NULL first because of
> > compiler reordering write instructions.
> >
> > For example, flush bio1 sets flush bio to NULL first in
> > md_submit_flush_data. An interrupt or vmware causing an extended stall
> > happen between updating flush_bio and prev_flush_start. Because flush_bio
> > is NULL, flush bio2 can get the lock and submit to underlayer disks. Then
> > flush bio1 updates prev_flush_start after the interrupt or extended stall.
> >
> > Then flush bio3 enters in md_flush_request. The start time req_start is
> > behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't
> > finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK
> > again. INIT_WORK() will re-initialize the list pointers in the
> > work_struct, which then can result in a corrupted work list and the
> > work_struct queued a second time. With the work list corrupted, it can
> > lead in invalid work items being used and cause a crash in
> > process_one_work.
> >
> > We need to make sure only one flush bio can be handled at one same time.
> > So add spin lock in md_submit_flush_data to protect prev_flush_start and
> > flush_bio in an atomic way.
> >
> > Reviewed-by: David Jeffery <djeffery@redhat.com>
> > Signed-off-by: Xiao Ni <xni@redhat.com>
> > Signed-off-by: Song Liu <songliubraving@fb.com>
> > [jwang: backport dc5d17a3c39b06aef866afca19245a9cfb533a79 to 4.19]
>
> I can not take patches backported to older kernels that "skip" kernel
> releases.
>
> For example, if I take this into 4.19.y, and then someone moves to 5.4
> or 5.10, they will hit the same issue.
>
> So please provide a backported series for all affected releases, back as
> far as you want, but never skip releases.
>
> I can't take this series, I'll drop it for now and wait for an updated
> set of patches.
>
> thanks,
>
> greg k-h
Hi Greg,

Thanks for reply, only this patch should be backported also to
5.4/5.10, this backport can be applied cleanly to stable/linux-5.4.y
and stable/linux-5.10.y,

I will send the backport for them later today!

Thanks!

J

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-02-05  6:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-03 13:20 [stable-4.19 0/8] Some misc fixes Jack Wang
2021-02-03 13:20 ` [stable-4.19 1/8] block: fix NULL pointer dereference in register_disk Jack Wang
2021-02-03 13:20 ` [stable-4.19 2/8] block: don't hold q->sysfs_lock in elevator_init_mq Jack Wang
2021-02-03 13:20 ` [stable-4.19 3/8] blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue Jack Wang
2021-02-03 13:20 ` [stable-4.19 4/8] block: add helper for checking if queue is registered Jack Wang
2021-02-03 13:20 ` [stable-4.19 5/8] block: split .sysfs_lock into two locks Jack Wang
2021-02-03 13:20 ` [stable-4.19 6/8] block: fix race between switching elevator and removing queues Jack Wang
2021-02-03 13:20 ` [stable-4.19 7/8] block: don't release queue's sysfs lock during switching elevator Jack Wang
2021-02-03 13:20 ` [stable-4.19 8/8] md: Set prev_flush_start and flush_bio in an atomic way Jack Wang
2021-02-04 15:12   ` Greg KH
2021-02-05  6:57     ` Jinpu Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).