[PATCH 0/6] blk-mq: fix race related with device deletion/reset/switching sched

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] blk-mq: fix race related with device deletion/reset/switching sched
@ 2017-12-12 11:02 ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei

Hi,

The 1st patch fixes one kernel oops triggered by IOs vs. deleting SCSI
device, and this issue can be triggered in less than 1min on scsi_debug.

The other 5 patch fixes recent Yi Zhang's reports about his NVMe stress
tests, most of them are related with switching io sched, NVMe reset or
updating nr_hw_queues.

Thanks,
Ming


Ming Lei (6):
  blk-mq: quiesce queue before freeing queue
  blk-mq: support concurrent blk_mq_quiesce_queue()
  blk-mq: quiesce queue during switching io sched and updating
    nr_requests
  blk-mq: avoid to map CPU into stale hw queue
  blk-mq: fix race between updating nr_hw_queues and switching io sched
  nvme: remove .init_request callback

 block/blk-core.c         |  9 +++++++++
 block/blk-mq-pci.c       |  8 ++++++++
 block/blk-mq.c           | 52 ++++++++++++++++++++++++++++++++++++++++++------
 block/elevator.c         |  2 ++
 drivers/nvme/host/core.c |  4 ++--
 drivers/nvme/host/pci.c  | 40 ++++++++++++++++++-------------------
 include/linux/blk-mq.h   |  7 ++++++-
 include/linux/blkdev.h   |  2 ++
 8 files changed, 95 insertions(+), 29 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/6] blk-mq: fix race related with device deletion/reset/switching sched
@ 2017-12-12 11:02 ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)


Hi,

The 1st patch fixes one kernel oops triggered by IOs vs. deleting SCSI
device, and this issue can be triggered in less than 1min on scsi_debug.

The other 5 patch fixes recent Yi Zhang's reports about his NVMe stress
tests, most of them are related with switching io sched, NVMe reset or
updating nr_hw_queues.

Thanks,
Ming


Ming Lei (6):
  blk-mq: quiesce queue before freeing queue
  blk-mq: support concurrent blk_mq_quiesce_queue()
  blk-mq: quiesce queue during switching io sched and updating
    nr_requests
  blk-mq: avoid to map CPU into stale hw queue
  blk-mq: fix race between updating nr_hw_queues and switching io sched
  nvme: remove .init_request callback

 block/blk-core.c         |  9 +++++++++
 block/blk-mq-pci.c       |  8 ++++++++
 block/blk-mq.c           | 52 ++++++++++++++++++++++++++++++++++++++++++------
 block/elevator.c         |  2 ++
 drivers/nvme/host/core.c |  4 ++--
 drivers/nvme/host/pci.c  | 40 ++++++++++++++++++-------------------
 include/linux/blk-mq.h   |  7 ++++++-
 include/linux/blkdev.h   |  2 ++
 8 files changed, 95 insertions(+), 29 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] blk-mq: quiesce queue before freeing queue
  2017-12-12 11:02 ` Ming Lei
@ 2017-12-12 11:02   ` Ming Lei
  -1 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei, stable

After queue is frozen, dispatch still may happen, for example:

1) requests are submitted from several contexts
2) requests from all these contexts are inserted to queue, but may dispatch
to LLD in one of these paths, but other paths sill need to move on even all
these requests are completed(that means blk_mq_freeze_queue_wait() returns
at that time)
3) dispatch after queue freezing still moves on and causes use-after-free,
because request queue is freed

This patch quiesces queue after it is frozen, and makes sure all
in-progress dispatch are completed.

This patch fixes the following kernel crash when running heavy IOs vs.
deleting device:

[   36.719251] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[   36.720318] IP: kyber_has_work+0x14/0x40
[   36.720847] PGD 254bf5067 P4D 254bf5067 PUD 255e6a067 PMD 0
[   36.721584] Oops: 0000 [#1] PREEMPT SMP
[   36.722105] Dumping ftrace buffer:
[   36.722570]    (ftrace buffer empty)
[   36.723057] Modules linked in: scsi_debug ebtable_filter ebtables ip6table_filter ip6_tables tcm_loop iscsi_target_mod target_core_file target_core_iblock target_core_pscsi target_core_mod xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc fuse iptable_filter ip_tables sd_mod sg btrfs xor zstd_decompress zstd_compress xxhash raid6_pq mptsas mptscsih bcache crc32c_intel ahci mptbase libahci serio_raw scsi_transport_sas nvme libata shpchp lpc_ich virtio_scsi nvme_core binfmt_misc dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
[   36.733438] CPU: 2 PID: 2374 Comm: fio Not tainted 4.15.0-rc2.blk_mq_quiesce+ #714
[   36.735143] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014
[   36.736688] RIP: 0010:kyber_has_work+0x14/0x40
[   36.737515] RSP: 0018:ffffc9000209bca0 EFLAGS: 00010202
[   36.738431] RAX: 0000000000000008 RBX: ffff88025578bfc8 RCX: ffff880257bf4ed0
[   36.739581] RDX: 0000000000000038 RSI: ffffffff81a98c6d RDI: ffff88025578bfc8
[   36.740730] RBP: ffff880253cebfc8 R08: ffffc9000209bda0 R09: ffff8802554f3480
[   36.741885] R10: ffffc9000209be60 R11: ffff880263f72538 R12: ffff88025573e9e8
[   36.743036] R13: ffff88025578bfd0 R14: 0000000000000001 R15: 0000000000000000
[   36.744189] FS:  00007f9b9bee67c0(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
[   36.746617] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   36.748483] CR2: 0000000000000008 CR3: 0000000254bf4001 CR4: 00000000003606e0
[   36.750164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   36.751455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   36.752796] Call Trace:
[   36.753992]  blk_mq_do_dispatch_sched+0x7f/0xe0
[   36.755110]  blk_mq_sched_dispatch_requests+0x119/0x190
[   36.756179]  __blk_mq_run_hw_queue+0x83/0x90
[   36.757144]  __blk_mq_delay_run_hw_queue+0xaf/0x110
[   36.758046]  blk_mq_run_hw_queue+0x24/0x70
[   36.758845]  blk_mq_flush_plug_list+0x1e7/0x270
[   36.759676]  blk_flush_plug_list+0xd6/0x240
[   36.760463]  blk_finish_plug+0x27/0x40
[   36.761195]  do_io_submit+0x19b/0x780
[   36.761921]  ? entry_SYSCALL_64_fastpath+0x1a/0x7d
[   36.762788]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[   36.763639] RIP: 0033:0x7f9b9699f697
[   36.764352] RSP: 002b:00007ffc10f991b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000d1
[   36.765773] RAX: ffffffffffffffda RBX: 00000000008f6f00 RCX: 00007f9b9699f697
[   36.766965] RDX: 0000000000a5e6c0 RSI: 0000000000000001 RDI: 00007f9b8462a000
[   36.768377] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000008f6420
[   36.769649] R10: 00007f9b846e5000 R11: 0000000000000206 R12: 00007f9b795d6a70
[   36.770807] R13: 00007f9b795e4140 R14: 00007f9b795e3fe0 R15: 0000000100000000
[   36.771955] Code: 83 c7 10 e9 3f 68 d1 ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 97 b0 00 00 00 48 8d 42 08 48 83 c2 38 <48> 3b 00 74 06 b8 01 00 00 00 c3 48 3b 40 08 75 f4 48 83 c0 10
[   36.775004] RIP: kyber_has_work+0x14/0x40 RSP: ffffc9000209bca0
[   36.776012] CR2: 0000000000000008
[   36.776690] ---[ end trace 4045cbce364ff2a4 ]---
[   36.777527] Kernel panic - not syncing: Fatal exception
[   36.778526] Dumping ftrace buffer:
[   36.779313]    (ftrace buffer empty)
[   36.780081] Kernel Offset: disabled
[   36.780877] ---[ end Kernel panic - not syncing: Fatal exception

Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b8881750a3ac..071a0fb29597 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -694,6 +694,15 @@ void blk_cleanup_queue(struct request_queue *q)
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
 	spin_unlock_irq(lock);
 
+	/*
+	 * make sure all in-progress dispatch are completed because
+	 * blk_freeze_queue() can only complete all requests, and
+	 * dispatch may still be in-progress since we dispatch requests
+	 * from more than one contexts
+	 */
+	if (q->mq_ops)
+		blk_mq_quiesce_queue(q);
+
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 1/6] blk-mq: quiesce queue before freeing queue
@ 2017-12-12 11:02   ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)


After queue is frozen, dispatch still may happen, for example:

1) requests are submitted from several contexts
2) requests from all these contexts are inserted to queue, but may dispatch
to LLD in one of these paths, but other paths sill need to move on even all
these requests are completed(that means blk_mq_freeze_queue_wait() returns
at that time)
3) dispatch after queue freezing still moves on and causes use-after-free,
because request queue is freed

This patch quiesces queue after it is frozen, and makes sure all
in-progress dispatch are completed.

This patch fixes the following kernel crash when running heavy IOs vs.
deleting device:

[   36.719251] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[   36.720318] IP: kyber_has_work+0x14/0x40
[   36.720847] PGD 254bf5067 P4D 254bf5067 PUD 255e6a067 PMD 0
[   36.721584] Oops: 0000 [#1] PREEMPT SMP
[   36.722105] Dumping ftrace buffer:
[   36.722570]    (ftrace buffer empty)
[   36.723057] Modules linked in: scsi_debug ebtable_filter ebtables ip6table_filter ip6_tables tcm_loop iscsi_target_mod target_core_file target_core_iblock target_core_pscsi target_core_mod xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc fuse iptable_filter ip_tables sd_mod sg btrfs xor zstd_decompress zstd_compress xxhash raid6_pq mptsas mptscsih bcache crc32c_intel ahci mptbase libahci serio_raw scsi_transport_sas nvme libata shpchp lpc_ich virtio_scsi nvme_core binfmt_misc dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
[   36.733438] CPU: 2 PID: 2374 Comm: fio Not tainted 4.15.0-rc2.blk_mq_quiesce+ #714
[   36.735143] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014
[   36.736688] RIP: 0010:kyber_has_work+0x14/0x40
[   36.737515] RSP: 0018:ffffc9000209bca0 EFLAGS: 00010202
[   36.738431] RAX: 0000000000000008 RBX: ffff88025578bfc8 RCX: ffff880257bf4ed0
[   36.739581] RDX: 0000000000000038 RSI: ffffffff81a98c6d RDI: ffff88025578bfc8
[   36.740730] RBP: ffff880253cebfc8 R08: ffffc9000209bda0 R09: ffff8802554f3480
[   36.741885] R10: ffffc9000209be60 R11: ffff880263f72538 R12: ffff88025573e9e8
[   36.743036] R13: ffff88025578bfd0 R14: 0000000000000001 R15: 0000000000000000
[   36.744189] FS:  00007f9b9bee67c0(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
[   36.746617] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   36.748483] CR2: 0000000000000008 CR3: 0000000254bf4001 CR4: 00000000003606e0
[   36.750164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   36.751455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   36.752796] Call Trace:
[   36.753992]  blk_mq_do_dispatch_sched+0x7f/0xe0
[   36.755110]  blk_mq_sched_dispatch_requests+0x119/0x190
[   36.756179]  __blk_mq_run_hw_queue+0x83/0x90
[   36.757144]  __blk_mq_delay_run_hw_queue+0xaf/0x110
[   36.758046]  blk_mq_run_hw_queue+0x24/0x70
[   36.758845]  blk_mq_flush_plug_list+0x1e7/0x270
[   36.759676]  blk_flush_plug_list+0xd6/0x240
[   36.760463]  blk_finish_plug+0x27/0x40
[   36.761195]  do_io_submit+0x19b/0x780
[   36.761921]  ? entry_SYSCALL_64_fastpath+0x1a/0x7d
[   36.762788]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[   36.763639] RIP: 0033:0x7f9b9699f697
[   36.764352] RSP: 002b:00007ffc10f991b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000d1
[   36.765773] RAX: ffffffffffffffda RBX: 00000000008f6f00 RCX: 00007f9b9699f697
[   36.766965] RDX: 0000000000a5e6c0 RSI: 0000000000000001 RDI: 00007f9b8462a000
[   36.768377] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000008f6420
[   36.769649] R10: 00007f9b846e5000 R11: 0000000000000206 R12: 00007f9b795d6a70
[   36.770807] R13: 00007f9b795e4140 R14: 00007f9b795e3fe0 R15: 0000000100000000
[   36.771955] Code: 83 c7 10 e9 3f 68 d1 ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 97 b0 00 00 00 48 8d 42 08 48 83 c2 38 <48> 3b 00 74 06 b8 01 00 00 00 c3 48 3b 40 08 75 f4 48 83 c0 10
[   36.775004] RIP: kyber_has_work+0x14/0x40 RSP: ffffc9000209bca0
[   36.776012] CR2: 0000000000000008
[   36.776690] ---[ end trace 4045cbce364ff2a4 ]---
[   36.777527] Kernel panic - not syncing: Fatal exception
[   36.778526] Dumping ftrace buffer:
[   36.779313]    (ftrace buffer empty)
[   36.780081] Kernel Offset: disabled
[   36.780877] ---[ end Kernel panic - not syncing: Fatal exception

Cc: stable at vger.kernel.org
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b8881750a3ac..071a0fb29597 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -694,6 +694,15 @@ void blk_cleanup_queue(struct request_queue *q)
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
 	spin_unlock_irq(lock);
 
+	/*
+	 * make sure all in-progress dispatch are completed because
+	 * blk_freeze_queue() can only complete all requests, and
+	 * dispatch may still be in-progress since we dispatch requests
+	 * from more than one contexts
+	 */
+	if (q->mq_ops)
+		blk_mq_quiesce_queue(q);
+
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/6] blk-mq: support concurrent blk_mq_quiesce_queue()
  2017-12-12 11:02 ` Ming Lei
@ 2017-12-12 11:02   ` Ming Lei
  -1 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei

Turns out that blk_mq_freeze_queue() isn't stronger[1] than
blk_mq_quiesce_queue() because dispatch may still be in-progress after
queue is frozen, and in several cases, such as switching io scheduler,
and updating hw queues, we still need to quiesce queue as a supplement
of freezing queue.

As we need to extend uses of blk_mq_quiesce_queue(), it is inevitable
for us to need support cucurrent quiesce, especailly we can't let
unquiesce happen when there is quiesce happenning from other contexts.

This patch introduces q->mq_quiesce_depth to deal concurrent quiesce,
and we only unquiesce queue when it is the last one from all contexts.

[1] https://marc.info/?l=linux-block&m=150993988115872&w=2

Suggested-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c           | 21 ++++++++++++++++-----
 drivers/nvme/host/core.c |  4 ++--
 include/linux/blk-mq.h   |  7 ++++++-
 include/linux/blkdev.h   |  2 ++
 4 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11097477eeab..5d69c8075339 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -196,7 +196,8 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	queue_flag_set(QUEUE_FLAG_QUIESCED, q);
+	if (!q->quiesce_depth++)
+		queue_flag_set(QUEUE_FLAG_QUIESCED, q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
@@ -232,22 +233,32 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 /*
  * blk_mq_unquiesce_queue() - counterpart of blk_mq_quiesce_queue()
  * @q: request queue.
+ * @force: force to unquiesce if set
  *
  * This function recovers queue into the state before quiescing
  * which is done by blk_mq_quiesce_queue.
+ *
+ * Note: @force should be passed as true only when it is done before
+ * cleanup queue, for other cases, please don't use this way.
  */
-void blk_mq_unquiesce_queue(struct request_queue *q)
+void __blk_mq_unquiesce_queue(struct request_queue *q, bool force)
 {
 	unsigned long flags;
+	int depth;
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
+	if (q->quiesce_depth > 0)
+		q->quiesce_depth--;
+	depth = q->quiesce_depth;
+	if (!depth || force)
+		queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	/* dispatch requests which are inserted during quiescing */
-	blk_mq_run_hw_queues(q, true);
+	if (!depth || force)
+		blk_mq_run_hw_queues(q, true);
 }
-EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
+EXPORT_SYMBOL_GPL(__blk_mq_unquiesce_queue);
 
 void blk_mq_wake_waiters(struct request_queue *q)
 {
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f837d666cbd4..1ab1168cd46a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3371,7 +3371,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 
 	/* Forcibly unquiesce queues to avoid blocking dispatch */
 	if (ctrl->admin_q)
-		blk_mq_unquiesce_queue(ctrl->admin_q);
+		__blk_mq_unquiesce_queue(ctrl->admin_q, true);
 
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		/*
@@ -3384,7 +3384,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 		blk_set_queue_dying(ns->queue);
 
 		/* Forcibly unquiesce queues to avoid blocking dispatch */
-		blk_mq_unquiesce_queue(ns->queue);
+		__blk_mq_unquiesce_queue(ns->queue, true);
 	}
 	mutex_unlock(&ctrl->namespaces_mutex);
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 95c9a5c862e2..8a01822dc09e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -265,7 +265,7 @@ void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_quiesce_queue(struct request_queue *q);
-void blk_mq_unquiesce_queue(struct request_queue *q);
+void __blk_mq_unquiesce_queue(struct request_queue *q, bool force);
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
@@ -286,6 +286,11 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
+static inline void blk_mq_unquiesce_queue(struct request_queue *q)
+{
+	__blk_mq_unquiesce_queue(q, false);
+}
+
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8089ca17db9a..ee3731f417c0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -570,6 +570,8 @@ struct request_queue {
 	int			bypass_depth;
 	atomic_t		mq_freeze_depth;
 
+	int			quiesce_depth;
+
 #if defined(CONFIG_BLK_DEV_BSG)
 	bsg_job_fn		*bsg_job_fn;
 	struct bsg_class_device bsg_dev;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/6] blk-mq: support concurrent blk_mq_quiesce_queue()
@ 2017-12-12 11:02   ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)


Turns out that blk_mq_freeze_queue() isn't stronger[1] than
blk_mq_quiesce_queue() because dispatch may still be in-progress after
queue is frozen, and in several cases, such as switching io scheduler,
and updating hw queues, we still need to quiesce queue as a supplement
of freezing queue.

As we need to extend uses of blk_mq_quiesce_queue(), it is inevitable
for us to need support cucurrent quiesce, especailly we can't let
unquiesce happen when there is quiesce happenning from other contexts.

This patch introduces q->mq_quiesce_depth to deal concurrent quiesce,
and we only unquiesce queue when it is the last one from all contexts.

[1] https://marc.info/?l=linux-block&m=150993988115872&w=2

Suggested-by: Bart Van Assche <bart.vanassche at wdc.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c           | 21 ++++++++++++++++-----
 drivers/nvme/host/core.c |  4 ++--
 include/linux/blk-mq.h   |  7 ++++++-
 include/linux/blkdev.h   |  2 ++
 4 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11097477eeab..5d69c8075339 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -196,7 +196,8 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	queue_flag_set(QUEUE_FLAG_QUIESCED, q);
+	if (!q->quiesce_depth++)
+		queue_flag_set(QUEUE_FLAG_QUIESCED, q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
@@ -232,22 +233,32 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 /*
  * blk_mq_unquiesce_queue() - counterpart of blk_mq_quiesce_queue()
  * @q: request queue.
+ * @force: force to unquiesce if set
  *
  * This function recovers queue into the state before quiescing
  * which is done by blk_mq_quiesce_queue.
+ *
+ * Note: @force should be passed as true only when it is done before
+ * cleanup queue, for other cases, please don't use this way.
  */
-void blk_mq_unquiesce_queue(struct request_queue *q)
+void __blk_mq_unquiesce_queue(struct request_queue *q, bool force)
 {
 	unsigned long flags;
+	int depth;
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
+	if (q->quiesce_depth > 0)
+		q->quiesce_depth--;
+	depth = q->quiesce_depth;
+	if (!depth || force)
+		queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	/* dispatch requests which are inserted during quiescing */
-	blk_mq_run_hw_queues(q, true);
+	if (!depth || force)
+		blk_mq_run_hw_queues(q, true);
 }
-EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
+EXPORT_SYMBOL_GPL(__blk_mq_unquiesce_queue);
 
 void blk_mq_wake_waiters(struct request_queue *q)
 {
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f837d666cbd4..1ab1168cd46a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3371,7 +3371,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 
 	/* Forcibly unquiesce queues to avoid blocking dispatch */
 	if (ctrl->admin_q)
-		blk_mq_unquiesce_queue(ctrl->admin_q);
+		__blk_mq_unquiesce_queue(ctrl->admin_q, true);
 
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		/*
@@ -3384,7 +3384,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 		blk_set_queue_dying(ns->queue);
 
 		/* Forcibly unquiesce queues to avoid blocking dispatch */
-		blk_mq_unquiesce_queue(ns->queue);
+		__blk_mq_unquiesce_queue(ns->queue, true);
 	}
 	mutex_unlock(&ctrl->namespaces_mutex);
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 95c9a5c862e2..8a01822dc09e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -265,7 +265,7 @@ void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_quiesce_queue(struct request_queue *q);
-void blk_mq_unquiesce_queue(struct request_queue *q);
+void __blk_mq_unquiesce_queue(struct request_queue *q, bool force);
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
@@ -286,6 +286,11 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
+static inline void blk_mq_unquiesce_queue(struct request_queue *q)
+{
+	__blk_mq_unquiesce_queue(q, false);
+}
+
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8089ca17db9a..ee3731f417c0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -570,6 +570,8 @@ struct request_queue {
 	int			bypass_depth;
 	atomic_t		mq_freeze_depth;
 
+	int			quiesce_depth;
+
 #if defined(CONFIG_BLK_DEV_BSG)
 	bsg_job_fn		*bsg_job_fn;
 	struct bsg_class_device bsg_dev;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/6] blk-mq: quiesce queue during switching io sched and updating nr_requests
  2017-12-12 11:02 ` Ming Lei
@ 2017-12-12 11:02   ` Ming Lei
  -1 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei

Dispatch may still be in-progress after queue is frozen, so we have to
quiesce queue before switching IO scheduler and updating nr_requests.

Also when switching io schedulers, blk_mq_run_hw_queue() may still be
called somewhere(such as from nvme_reset_work()), and io scheduler's
per-hctx data may not be setup yet, so cause oops even inside
blk_mq_hctx_has_pending(), such as it can be run just between:

        ret = e->ops.mq.init_sched(q, e);
AND
        ret = e->ops.mq.init_hctx(hctx, i)

inside blk_mq_init_sched().

This reverts commit 7a148c2fcff8330(block: don't call blk_mq_quiesce_queue()
after queue is frozen) basically, and makes sure blk_mq_hctx_has_pending
won't be called if queue is quiesced.

Fixes: 7a148c2fcff83309(block: don't call blk_mq_quiesce_queue() after queue is frozen)
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c   | 27 ++++++++++++++++++++++++++-
 block/elevator.c |  2 ++
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5d69c8075339..85954a0b4394 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1296,7 +1296,30 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
 
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
-	if (blk_mq_hctx_has_pending(hctx)) {
+	int srcu_idx;
+	bool need_run;
+
+	/*
+	 * When queue is quiesced, we may be switching io scheduler, or
+	 * updating nr_hw_queues, or other things, and we can't run queue
+	 * any more, even __blk_mq_hctx_has_pending() can't be called safely.
+	 *
+	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
+	 * quiesced.
+	 */
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
+		rcu_read_lock();
+		need_run = !blk_queue_quiesced(hctx->queue) &&
+			blk_mq_hctx_has_pending(hctx);
+		rcu_read_unlock();
+	} else {
+		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+		need_run = !blk_queue_quiesced(hctx->queue) &&
+			blk_mq_hctx_has_pending(hctx);
+		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+	}
+
+	if (need_run) {
 		__blk_mq_delay_run_hw_queue(hctx, async, 0);
 		return true;
 	}
@@ -2721,6 +2744,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		return -EINVAL;
 
 	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
 
 	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -2744,6 +2768,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	if (!ret)
 		q->nr_requests = nr;
 
+	blk_mq_unquiesce_queue(q);
 	blk_mq_unfreeze_queue(q);
 
 	return ret;
diff --git a/block/elevator.c b/block/elevator.c
index 7bda083d5968..138faeb08a7c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -968,6 +968,7 @@ static int elevator_switch_mq(struct request_queue *q,
 	int ret;
 
 	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
 
 	if (q->elevator) {
 		if (q->elevator->registered)
@@ -994,6 +995,7 @@ static int elevator_switch_mq(struct request_queue *q,
 		blk_add_trace_msg(q, "elv switch: none");
 
 out:
+	blk_mq_unquiesce_queue(q);
 	blk_mq_unfreeze_queue(q);
 	return ret;
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/6] blk-mq: quiesce queue during switching io sched and updating nr_requests
@ 2017-12-12 11:02   ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)


Dispatch may still be in-progress after queue is frozen, so we have to
quiesce queue before switching IO scheduler and updating nr_requests.

Also when switching io schedulers, blk_mq_run_hw_queue() may still be
called somewhere(such as from nvme_reset_work()), and io scheduler's
per-hctx data may not be setup yet, so cause oops even inside
blk_mq_hctx_has_pending(), such as it can be run just between:

        ret = e->ops.mq.init_sched(q, e);
AND
        ret = e->ops.mq.init_hctx(hctx, i)

inside blk_mq_init_sched().

This reverts commit 7a148c2fcff8330(block: don't call blk_mq_quiesce_queue()
after queue is frozen) basically, and makes sure blk_mq_hctx_has_pending
won't be called if queue is quiesced.

Fixes: 7a148c2fcff83309(block: don't call blk_mq_quiesce_queue() after queue is frozen)
Reported-by: Yi Zhang <yi.zhang at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c   | 27 ++++++++++++++++++++++++++-
 block/elevator.c |  2 ++
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5d69c8075339..85954a0b4394 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1296,7 +1296,30 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
 
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
-	if (blk_mq_hctx_has_pending(hctx)) {
+	int srcu_idx;
+	bool need_run;
+
+	/*
+	 * When queue is quiesced, we may be switching io scheduler, or
+	 * updating nr_hw_queues, or other things, and we can't run queue
+	 * any more, even __blk_mq_hctx_has_pending() can't be called safely.
+	 *
+	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
+	 * quiesced.
+	 */
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
+		rcu_read_lock();
+		need_run = !blk_queue_quiesced(hctx->queue) &&
+			blk_mq_hctx_has_pending(hctx);
+		rcu_read_unlock();
+	} else {
+		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+		need_run = !blk_queue_quiesced(hctx->queue) &&
+			blk_mq_hctx_has_pending(hctx);
+		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+	}
+
+	if (need_run) {
 		__blk_mq_delay_run_hw_queue(hctx, async, 0);
 		return true;
 	}
@@ -2721,6 +2744,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		return -EINVAL;
 
 	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
 
 	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -2744,6 +2768,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	if (!ret)
 		q->nr_requests = nr;
 
+	blk_mq_unquiesce_queue(q);
 	blk_mq_unfreeze_queue(q);
 
 	return ret;
diff --git a/block/elevator.c b/block/elevator.c
index 7bda083d5968..138faeb08a7c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -968,6 +968,7 @@ static int elevator_switch_mq(struct request_queue *q,
 	int ret;
 
 	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
 
 	if (q->elevator) {
 		if (q->elevator->registered)
@@ -994,6 +995,7 @@ static int elevator_switch_mq(struct request_queue *q,
 		blk_add_trace_msg(q, "elv switch: none");
 
 out:
+	blk_mq_unquiesce_queue(q);
 	blk_mq_unfreeze_queue(q);
 	return ret;
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/6] blk-mq: avoid to map CPU into stale hw queue
  2017-12-12 11:02 ` Ming Lei
@ 2017-12-12 11:02   ` Ming Lei
  -1 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei

blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
previous map isn't cleared yet, and may point to one stale hw queue
index.

This patch fixes the following issue by clearing the mapping table before
setting it up in blk_mq_pci_map_queues().

This patches fixes this following issue reported by Zhang Yi:

[  101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
[  101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
[  101.216346] PGD 0 P4D 0
[  101.219171] Oops: 0000 [#1] SMP
[  101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[  101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
[  101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[  101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
[  101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
[  101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
[  101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
[  101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
[  101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
[  101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
[  101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
[  101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
[  101.364314] FS:  0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
[  101.373342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
[  101.387714] Call Trace:
[  101.390445]  blk_mq_update_nr_hw_queues+0xbf/0x130
[  101.395791]  nvme_reset_work+0x6f4/0xc06 [nvme]
[  101.400848]  ? pick_next_task_fair+0x290/0x5f0
[  101.405807]  ? __switch_to+0x1f5/0x430
[  101.409988]  ? put_prev_entity+0x2f/0xd0
[  101.414365]  process_one_work+0x141/0x340
[  101.418836]  worker_thread+0x47/0x3e0
[  101.422921]  kthread+0xf5/0x130
[  101.426424]  ? rescuer_thread+0x380/0x380
[  101.430896]  ? kthread_associate_blkcg+0x90/0x90
[  101.436048]  ret_from_fork+0x1f/0x30
[  101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
[  101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
[  101.468205] CR2: 0000000000000098
[  101.471907] ---[ end trace 5fe710f98228a3ca ]---
[  101.482489] Kernel panic - not syncing: Fatal exception
[  101.488505] Kernel Offset: disabled
[  101.497752] ---[ end Kernel panic - not syncing: Fatal exception

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-pci.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index 76944e3271bf..b62212cb8060 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -33,6 +33,14 @@ int blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev)
 	const struct cpumask *mask;
 	unsigned int queue, cpu;
 
+	/*
+	 * When nr_hw_queue is resized, we have to reset the map table
+	 * for avoiding stale mapping since one CPU may not be mapped
+	 * to any hw queue.
+	 */
+	for_each_possible_cpu(cpu)
+		set->mq_map[cpu] = 0;
+
 	for (queue = 0; queue < set->nr_hw_queues; queue++) {
 		mask = pci_irq_get_affinity(pdev, queue);
 		if (!mask)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/6] blk-mq: avoid to map CPU into stale hw queue
@ 2017-12-12 11:02   ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)


blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
previous map isn't cleared yet, and may point to one stale hw queue
index.

This patch fixes the following issue by clearing the mapping table before
setting it up in blk_mq_pci_map_queues().

This patches fixes this following issue reported by Zhang Yi:

[  101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
[  101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
[  101.216346] PGD 0 P4D 0
[  101.219171] Oops: 0000 [#1] SMP
[  101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[  101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
[  101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[  101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
[  101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
[  101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
[  101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
[  101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
[  101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
[  101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
[  101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
[  101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
[  101.364314] FS:  0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
[  101.373342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
[  101.387714] Call Trace:
[  101.390445]  blk_mq_update_nr_hw_queues+0xbf/0x130
[  101.395791]  nvme_reset_work+0x6f4/0xc06 [nvme]
[  101.400848]  ? pick_next_task_fair+0x290/0x5f0
[  101.405807]  ? __switch_to+0x1f5/0x430
[  101.409988]  ? put_prev_entity+0x2f/0xd0
[  101.414365]  process_one_work+0x141/0x340
[  101.418836]  worker_thread+0x47/0x3e0
[  101.422921]  kthread+0xf5/0x130
[  101.426424]  ? rescuer_thread+0x380/0x380
[  101.430896]  ? kthread_associate_blkcg+0x90/0x90
[  101.436048]  ret_from_fork+0x1f/0x30
[  101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
[  101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
[  101.468205] CR2: 0000000000000098
[  101.471907] ---[ end trace 5fe710f98228a3ca ]---
[  101.482489] Kernel panic - not syncing: Fatal exception
[  101.488505] Kernel Offset: disabled
[  101.497752] ---[ end Kernel panic - not syncing: Fatal exception

Reported-by: Yi Zhang <yi.zhang at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq-pci.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index 76944e3271bf..b62212cb8060 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -33,6 +33,14 @@ int blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev)
 	const struct cpumask *mask;
 	unsigned int queue, cpu;
 
+	/*
+	 * When nr_hw_queue is resized, we have to reset the map table
+	 * for avoiding stale mapping since one CPU may not be mapped
+	 * to any hw queue.
+	 */
+	for_each_possible_cpu(cpu)
+		set->mq_map[cpu] = 0;
+
 	for (queue = 0; queue < set->nr_hw_queues; queue++) {
 		mask = pci_irq_get_affinity(pdev, queue);
 		if (!mask)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/6] blk-mq: fix race between updating nr_hw_queues and switching io sched
  2017-12-12 11:02 ` Ming Lei
@ 2017-12-12 11:02   ` Ming Lei
  -1 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei

In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.

This patch fixes the race be holding q->sysfs_lock.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85954a0b4394..4e97c40e7a0f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2418,6 +2418,9 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;
 
 	blk_mq_sysfs_unregister(q);
+
+	/* protect against switching io scheduler  */
+	mutex_lock(&q->sysfs_lock);
 	for (i = 0; i < set->nr_hw_queues; i++) {
 		int node;
 
@@ -2462,6 +2465,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 		}
 	}
 	q->nr_hw_queues = i;
+	mutex_unlock(&q->sysfs_lock);
 	blk_mq_sysfs_register(q);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/6] blk-mq: fix race between updating nr_hw_queues and switching io sched
@ 2017-12-12 11:02   ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)


In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.

This patch fixes the race be holding q->sysfs_lock.

Reported-by: Yi Zhang <yi.zhang at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85954a0b4394..4e97c40e7a0f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2418,6 +2418,9 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;
 
 	blk_mq_sysfs_unregister(q);
+
+	/* protect against switching io scheduler  */
+	mutex_lock(&q->sysfs_lock);
 	for (i = 0; i < set->nr_hw_queues; i++) {
 		int node;
 
@@ -2462,6 +2465,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 		}
 	}
 	q->nr_hw_queues = i;
+	mutex_unlock(&q->sysfs_lock);
 	blk_mq_sysfs_register(q);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/6] nvme: remove .init_request callback
  2017-12-12 11:02 ` Ming Lei
@ 2017-12-12 11:02   ` Ming Lei
  -1 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei

It may cause race by setting 'nvmeq' in nvme_init_request()
because .init_request is called inside switching io scheduler, which
may happen when the NVMe device is being resetted and its nvme queues
are being freed and created. We don't have any sync between the two
pathes.

This patch removes the .init_request callback and sets the nvmeq runtime,
and fixes the following bug:

[   93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
[   93.274146] invalid opcode: 0000 [#1] SMP
[   93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[   93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
[   93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[   93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
[   93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
[   93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
[   93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
[   93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
[   93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
[   93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
[   93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
[   93.422969] FS:  00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
[   93.431996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
[   93.446368] Call Trace:
[   93.449103]  blk_mq_alloc_rqs+0x231/0x2a0
[   93.453579]  blk_mq_sched_alloc_tags.isra.8+0x42/0x80
[   93.459214]  blk_mq_init_sched+0x7e/0x140
[   93.463687]  elevator_switch+0x5a/0x1f0
[   93.467966]  ? elevator_get.isra.17+0x52/0xc0
[   93.472826]  elv_iosched_store+0xde/0x150
[   93.477299]  queue_attr_store+0x4e/0x90
[   93.481580]  kernfs_fop_write+0xfa/0x180
[   93.485958]  __vfs_write+0x33/0x170
[   93.489851]  ? __inode_security_revalidate+0x4c/0x60
[   93.495390]  ? selinux_file_permission+0xda/0x130
[   93.500641]  ? _cond_resched+0x15/0x30
[   93.504815]  vfs_write+0xad/0x1a0
[   93.508512]  SyS_write+0x52/0xc0
[   93.512113]  do_syscall_64+0x61/0x1a0
[   93.516199]  entry_SYSCALL64_slow_path+0x25/0x25
[   93.521351] RIP: 0033:0x7f33ce96aab0
[   93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
[   93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
[   93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
[   93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
[   93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
[   93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de <0f>
0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
[   93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
[   93.602273] ---[ end trace 810dde3993e5f14e ]---

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3c9082..1e9a617e728b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -397,19 +397,6 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 	return 0;
 }
 
-static int nvme_init_request(struct blk_mq_tag_set *set, struct request *req,
-		unsigned int hctx_idx, unsigned int numa_node)
-{
-	struct nvme_dev *dev = set->driver_data;
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	int queue_idx = (set == &dev->tagset) ? hctx_idx + 1 : 0;
-	struct nvme_queue *nvmeq = dev->queues[queue_idx];
-
-	BUG_ON(!nvmeq);
-	iod->nvmeq = nvmeq;
-	return 0;
-}
-
 static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 {
 	struct nvme_dev *dev = set->driver_data;
@@ -448,7 +435,8 @@ static void **nvme_pci_iod_list(struct request *req)
 	return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
 }
 
-static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
+static blk_status_t nvme_init_iod(struct request *rq,
+		struct nvme_dev *dev, int queue_idx)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(rq);
 	int nseg = blk_rq_nr_phys_segments(rq);
@@ -469,6 +457,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
 	iod->npages = -1;
 	iod->nents = 0;
 	iod->length = size;
+	iod->nvmeq = dev->queues[queue_idx];
 
 	return BLK_STS_OK;
 }
@@ -862,8 +851,9 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 /*
  * NOTE: ns is NULL when called on the admin queue.
  */
-static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
-			 const struct blk_mq_queue_data *bd)
+static blk_status_t __nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
+			 const struct blk_mq_queue_data *bd,
+			 int queue_idx)
 {
 	struct nvme_ns *ns = hctx->queue->queuedata;
 	struct nvme_queue *nvmeq = hctx->driver_data;
@@ -876,7 +866,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (ret)
 		return ret;
 
-	ret = nvme_init_iod(req, dev);
+	ret = nvme_init_iod(req, dev, queue_idx);
 	if (ret)
 		goto out_free_cmd;
 
@@ -905,6 +895,18 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
+static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
+			 const struct blk_mq_queue_data *bd)
+{
+	return __nvme_queue_rq(hctx, bd, hctx->queue_num + 1);
+}
+
+static blk_status_t nvme_admin_queue_rq(struct blk_mq_hw_ctx *hctx,
+		const struct blk_mq_queue_data *bd)
+{
+	return __nvme_queue_rq(hctx, bd, 0);
+}
+
 static void nvme_pci_complete_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1480,11 +1482,10 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 }
 
 static const struct blk_mq_ops nvme_mq_admin_ops = {
-	.queue_rq	= nvme_queue_rq,
+	.queue_rq	= nvme_admin_queue_rq,
 	.complete	= nvme_pci_complete_rq,
 	.init_hctx	= nvme_admin_init_hctx,
 	.exit_hctx      = nvme_admin_exit_hctx,
-	.init_request	= nvme_init_request,
 	.timeout	= nvme_timeout,
 };
 
@@ -1492,7 +1493,6 @@ static const struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_pci_complete_rq,
 	.init_hctx	= nvme_init_hctx,
-	.init_request	= nvme_init_request,
 	.map_queues	= nvme_pci_map_queues,
 	.timeout	= nvme_timeout,
 	.poll		= nvme_poll,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/6] nvme: remove .init_request callback
@ 2017-12-12 11:02   ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 11:02 UTC (permalink / raw)


It may cause race by setting 'nvmeq' in nvme_init_request()
because .init_request is called inside switching io scheduler, which
may happen when the NVMe device is being resetted and its nvme queues
are being freed and created. We don't have any sync between the two
pathes.

This patch removes the .init_request callback and sets the nvmeq runtime,
and fixes the following bug:

[   93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
[   93.274146] invalid opcode: 0000 [#1] SMP
[   93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[   93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
[   93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[   93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
[   93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
[   93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
[   93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
[   93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
[   93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
[   93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
[   93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
[   93.422969] FS:  00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
[   93.431996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
[   93.446368] Call Trace:
[   93.449103]  blk_mq_alloc_rqs+0x231/0x2a0
[   93.453579]  blk_mq_sched_alloc_tags.isra.8+0x42/0x80
[   93.459214]  blk_mq_init_sched+0x7e/0x140
[   93.463687]  elevator_switch+0x5a/0x1f0
[   93.467966]  ? elevator_get.isra.17+0x52/0xc0
[   93.472826]  elv_iosched_store+0xde/0x150
[   93.477299]  queue_attr_store+0x4e/0x90
[   93.481580]  kernfs_fop_write+0xfa/0x180
[   93.485958]  __vfs_write+0x33/0x170
[   93.489851]  ? __inode_security_revalidate+0x4c/0x60
[   93.495390]  ? selinux_file_permission+0xda/0x130
[   93.500641]  ? _cond_resched+0x15/0x30
[   93.504815]  vfs_write+0xad/0x1a0
[   93.508512]  SyS_write+0x52/0xc0
[   93.512113]  do_syscall_64+0x61/0x1a0
[   93.516199]  entry_SYSCALL64_slow_path+0x25/0x25
[   93.521351] RIP: 0033:0x7f33ce96aab0
[   93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
[   93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
[   93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
[   93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
[   93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
[   93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de <0f>
0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
[   93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
[   93.602273] ---[ end trace 810dde3993e5f14e ]---

Reported-by: Yi Zhang <yi.zhang at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3c9082..1e9a617e728b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -397,19 +397,6 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 	return 0;
 }
 
-static int nvme_init_request(struct blk_mq_tag_set *set, struct request *req,
-		unsigned int hctx_idx, unsigned int numa_node)
-{
-	struct nvme_dev *dev = set->driver_data;
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	int queue_idx = (set == &dev->tagset) ? hctx_idx + 1 : 0;
-	struct nvme_queue *nvmeq = dev->queues[queue_idx];
-
-	BUG_ON(!nvmeq);
-	iod->nvmeq = nvmeq;
-	return 0;
-}
-
 static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 {
 	struct nvme_dev *dev = set->driver_data;
@@ -448,7 +435,8 @@ static void **nvme_pci_iod_list(struct request *req)
 	return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
 }
 
-static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
+static blk_status_t nvme_init_iod(struct request *rq,
+		struct nvme_dev *dev, int queue_idx)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(rq);
 	int nseg = blk_rq_nr_phys_segments(rq);
@@ -469,6 +457,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
 	iod->npages = -1;
 	iod->nents = 0;
 	iod->length = size;
+	iod->nvmeq = dev->queues[queue_idx];
 
 	return BLK_STS_OK;
 }
@@ -862,8 +851,9 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 /*
  * NOTE: ns is NULL when called on the admin queue.
  */
-static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
-			 const struct blk_mq_queue_data *bd)
+static blk_status_t __nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
+			 const struct blk_mq_queue_data *bd,
+			 int queue_idx)
 {
 	struct nvme_ns *ns = hctx->queue->queuedata;
 	struct nvme_queue *nvmeq = hctx->driver_data;
@@ -876,7 +866,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (ret)
 		return ret;
 
-	ret = nvme_init_iod(req, dev);
+	ret = nvme_init_iod(req, dev, queue_idx);
 	if (ret)
 		goto out_free_cmd;
 
@@ -905,6 +895,18 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
+static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
+			 const struct blk_mq_queue_data *bd)
+{
+	return __nvme_queue_rq(hctx, bd, hctx->queue_num + 1);
+}
+
+static blk_status_t nvme_admin_queue_rq(struct blk_mq_hw_ctx *hctx,
+		const struct blk_mq_queue_data *bd)
+{
+	return __nvme_queue_rq(hctx, bd, 0);
+}
+
 static void nvme_pci_complete_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1480,11 +1482,10 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 }
 
 static const struct blk_mq_ops nvme_mq_admin_ops = {
-	.queue_rq	= nvme_queue_rq,
+	.queue_rq	= nvme_admin_queue_rq,
 	.complete	= nvme_pci_complete_rq,
 	.init_hctx	= nvme_admin_init_hctx,
 	.exit_hctx      = nvme_admin_exit_hctx,
-	.init_request	= nvme_init_request,
 	.timeout	= nvme_timeout,
 };
 
@@ -1492,7 +1493,6 @@ static const struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_pci_complete_rq,
 	.init_hctx	= nvme_init_hctx,
-	.init_request	= nvme_init_request,
 	.map_queues	= nvme_pci_map_queues,
 	.timeout	= nvme_timeout,
 	.poll		= nvme_poll,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/6] blk-mq: avoid to map CPU into stale hw queue
  2017-12-12 11:02   ` Ming Lei
@ 2017-12-12 14:12     ` Christoph Hellwig
  -1 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2017-12-12 14:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block,
	Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn

On Tue, Dec 12, 2017 at 07:02:30PM +0800, Ming Lei wrote:
> blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
> previous map isn't cleared yet, and may point to one stale hw queue
> index.
> 
> This patch fixes the following issue by clearing the mapping table before
> setting it up in blk_mq_pci_map_queues().

This needs to be done in the caller intead of the transport specific
queue map helpers.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/6] blk-mq: avoid to map CPU into stale hw queue
@ 2017-12-12 14:12     ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2017-12-12 14:12 UTC (permalink / raw)


On Tue, Dec 12, 2017@07:02:30PM +0800, Ming Lei wrote:
> blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
> previous map isn't cleared yet, and may point to one stale hw queue
> index.
> 
> This patch fixes the following issue by clearing the mapping table before
> setting it up in blk_mq_pci_map_queues().

This needs to be done in the caller intead of the transport specific
queue map helpers.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] nvme: remove .init_request callback
  2017-12-12 11:02   ` Ming Lei
@ 2017-12-12 14:13     ` Christoph Hellwig
  -1 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2017-12-12 14:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block,
	Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn

On Tue, Dec 12, 2017 at 07:02:32PM +0800, Ming Lei wrote:
> It may cause race by setting 'nvmeq' in nvme_init_request()
> because .init_request is called inside switching io scheduler, which
> may happen when the NVMe device is being resetted and its nvme queues
> are being freed and created. We don't have any sync between the two
> pathes.
> 
> This patch removes the .init_request callback and sets the nvmeq runtime,
> and fixes the following bug:

If ->init_request doesn't work for NVMe it won't work for any other
driver either, so we need to remove it entirely if we can't fix it.

I'd much rather try to fix it first, though.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 6/6] nvme: remove .init_request callback
@ 2017-12-12 14:13     ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2017-12-12 14:13 UTC (permalink / raw)


On Tue, Dec 12, 2017@07:02:32PM +0800, Ming Lei wrote:
> It may cause race by setting 'nvmeq' in nvme_init_request()
> because .init_request is called inside switching io scheduler, which
> may happen when the NVMe device is being resetted and its nvme queues
> are being freed and created. We don't have any sync between the two
> pathes.
> 
> This patch removes the .init_request callback and sets the nvmeq runtime,
> and fixes the following bug:

If ->init_request doesn't work for NVMe it won't work for any other
driver either, so we need to remove it entirely if we can't fix it.

I'd much rather try to fix it first, though.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] nvme: remove .init_request callback
  2017-12-12 14:13     ` Christoph Hellwig
@ 2017-12-12 15:10       ` Ming Lei
  -1 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 15:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, Jens Axboe, linux-block, Bart Van Assche,
	Keith Busch, Sagi Grimberg, Yi Zhang, Johannes Thumshirn

On Tue, Dec 12, 2017 at 03:13:58PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 12, 2017 at 07:02:32PM +0800, Ming Lei wrote:
> > It may cause race by setting 'nvmeq' in nvme_init_request()
> > because .init_request is called inside switching io scheduler, which
> > may happen when the NVMe device is being resetted and its nvme queues
> > are being freed and created. We don't have any sync between the two
> > pathes.
> > 
> > This patch removes the .init_request callback and sets the nvmeq runtime,
> > and fixes the following bug:
> 
> If ->init_request doesn't work for NVMe it won't work for any other
> driver either, so we need to remove it entirely if we can't fix it.
> 
> I'd much rather try to fix it first, though.

Just checked all this uses, and it is a NVMe specific issue, because only
NVMe's .init_request() associates reference of one variable to the request,
and the variable itself may change during the request's lifetime.

And all the following .init_request() have this issue:

	nvme_fc_init_request()
	nvme_rdma_init_request()
	nvme_loop_init_request()
	

Thanks,
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 6/6] nvme: remove .init_request callback
@ 2017-12-12 15:10       ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 15:10 UTC (permalink / raw)


On Tue, Dec 12, 2017@03:13:58PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 12, 2017@07:02:32PM +0800, Ming Lei wrote:
> > It may cause race by setting 'nvmeq' in nvme_init_request()
> > because .init_request is called inside switching io scheduler, which
> > may happen when the NVMe device is being resetted and its nvme queues
> > are being freed and created. We don't have any sync between the two
> > pathes.
> > 
> > This patch removes the .init_request callback and sets the nvmeq runtime,
> > and fixes the following bug:
> 
> If ->init_request doesn't work for NVMe it won't work for any other
> driver either, so we need to remove it entirely if we can't fix it.
> 
> I'd much rather try to fix it first, though.

Just checked all this uses, and it is a NVMe specific issue, because only
NVMe's .init_request() associates reference of one variable to the request,
and the variable itself may change during the request's lifetime.

And all the following .init_request() have this issue:

	nvme_fc_init_request()
	nvme_rdma_init_request()
	nvme_loop_init_request()
	

Thanks,
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/6] blk-mq: avoid to map CPU into stale hw queue
  2017-12-12 14:12     ` Christoph Hellwig
@ 2017-12-12 15:12       ` Ming Lei
  -1 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 15:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, Jens Axboe, linux-block, Bart Van Assche,
	Keith Busch, Sagi Grimberg, Yi Zhang, Johannes Thumshirn

On Tue, Dec 12, 2017 at 03:12:43PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 12, 2017 at 07:02:30PM +0800, Ming Lei wrote:
> > blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
> > previous map isn't cleared yet, and may point to one stale hw queue
> > index.
> > 
> > This patch fixes the following issue by clearing the mapping table before
> > setting it up in blk_mq_pci_map_queues().
> 
> This needs to be done in the caller intead of the transport specific
> queue map helpers.

OK, there is such issue on other .map_queues() too, such as virtio, ...

Thanks,
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/6] blk-mq: avoid to map CPU into stale hw queue
@ 2017-12-12 15:12       ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-12 15:12 UTC (permalink / raw)


On Tue, Dec 12, 2017@03:12:43PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 12, 2017@07:02:30PM +0800, Ming Lei wrote:
> > blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
> > previous map isn't cleared yet, and may point to one stale hw queue
> > index.
> > 
> > This patch fixes the following issue by clearing the mapping table before
> > setting it up in blk_mq_pci_map_queues().
> 
> This needs to be done in the caller intead of the transport specific
> queue map helpers.

OK, there is such issue on other .map_queues() too, such as virtio, ...

Thanks,
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] nvme: remove .init_request callback
  2017-12-12 15:10       ` Ming Lei
@ 2017-12-13  6:43         ` Ming Lei
  -1 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-13  6:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, Jens Axboe, linux-block, Bart Van Assche,
	Keith Busch, Sagi Grimberg, Yi Zhang, Johannes Thumshirn

On Tue, Dec 12, 2017 at 11:10:32PM +0800, Ming Lei wrote:
> On Tue, Dec 12, 2017 at 03:13:58PM +0100, Christoph Hellwig wrote:
> > On Tue, Dec 12, 2017 at 07:02:32PM +0800, Ming Lei wrote:
> > > It may cause race by setting 'nvmeq' in nvme_init_request()
> > > because .init_request is called inside switching io scheduler, which
> > > may happen when the NVMe device is being resetted and its nvme queues
> > > are being freed and created. We don't have any sync between the two
> > > pathes.
> > > 
> > > This patch removes the .init_request callback and sets the nvmeq runtime,
> > > and fixes the following bug:
> > 
> > If ->init_request doesn't work for NVMe it won't work for any other
> > driver either, so we need to remove it entirely if we can't fix it.
> > 
> > I'd much rather try to fix it first, though.
> 
> Just checked all this uses, and it is a NVMe specific issue, because only
> NVMe's .init_request() associates reference of one variable to the request,
> and the variable itself may change during the request's lifetime.
> 
> And all the following .init_request() have this issue:
> 
> 	nvme_fc_init_request()
> 	nvme_rdma_init_request()
> 	nvme_loop_init_request()

After taking a close look, there isn't such issue actually on the above
three cases because the lifetime of ctl->queues(and each element) are
same with 'nvme_ctr/nvme_rdma_ctrl/nvme_loop_ctrl'. Not like nvme-pci,
->queues(and each element) won't be reallocated.

So it is a nvme-pci specific issue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 6/6] nvme: remove .init_request callback
@ 2017-12-13  6:43         ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2017-12-13  6:43 UTC (permalink / raw)


On Tue, Dec 12, 2017@11:10:32PM +0800, Ming Lei wrote:
> On Tue, Dec 12, 2017@03:13:58PM +0100, Christoph Hellwig wrote:
> > On Tue, Dec 12, 2017@07:02:32PM +0800, Ming Lei wrote:
> > > It may cause race by setting 'nvmeq' in nvme_init_request()
> > > because .init_request is called inside switching io scheduler, which
> > > may happen when the NVMe device is being resetted and its nvme queues
> > > are being freed and created. We don't have any sync between the two
> > > pathes.
> > > 
> > > This patch removes the .init_request callback and sets the nvmeq runtime,
> > > and fixes the following bug:
> > 
> > If ->init_request doesn't work for NVMe it won't work for any other
> > driver either, so we need to remove it entirely if we can't fix it.
> > 
> > I'd much rather try to fix it first, though.
> 
> Just checked all this uses, and it is a NVMe specific issue, because only
> NVMe's .init_request() associates reference of one variable to the request,
> and the variable itself may change during the request's lifetime.
> 
> And all the following .init_request() have this issue:
> 
> 	nvme_fc_init_request()
> 	nvme_rdma_init_request()
> 	nvme_loop_init_request()

After taking a close look, there isn't such issue actually on the above
three cases because the lifetime of ctl->queues(and each element) are
same with 'nvme_ctr/nvme_rdma_ctrl/nvme_loop_ctrl'. Not like nvme-pci,
->queues(and each element) won't be reallocated.

So it is a nvme-pci specific issue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-12-13  6:43 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-12 11:02 [PATCH 0/6] blk-mq: fix race related with device deletion/reset/switching sched Ming Lei
2017-12-12 11:02 ` Ming Lei
2017-12-12 11:02 ` [PATCH 1/6] blk-mq: quiesce queue before freeing queue Ming Lei
2017-12-12 11:02   ` Ming Lei
2017-12-12 11:02 ` [PATCH 2/6] blk-mq: support concurrent blk_mq_quiesce_queue() Ming Lei
2017-12-12 11:02   ` Ming Lei
2017-12-12 11:02 ` [PATCH 3/6] blk-mq: quiesce queue during switching io sched and updating nr_requests Ming Lei
2017-12-12 11:02   ` Ming Lei
2017-12-12 11:02 ` [PATCH 4/6] blk-mq: avoid to map CPU into stale hw queue Ming Lei
2017-12-12 11:02   ` Ming Lei
2017-12-12 14:12   ` Christoph Hellwig
2017-12-12 14:12     ` Christoph Hellwig
2017-12-12 15:12     ` Ming Lei
2017-12-12 15:12       ` Ming Lei
2017-12-12 11:02 ` [PATCH 5/6] blk-mq: fix race between updating nr_hw_queues and switching io sched Ming Lei
2017-12-12 11:02   ` Ming Lei
2017-12-12 11:02 ` [PATCH 6/6] nvme: remove .init_request callback Ming Lei
2017-12-12 11:02   ` Ming Lei
2017-12-12 14:13   ` Christoph Hellwig
2017-12-12 14:13     ` Christoph Hellwig
2017-12-12 15:10     ` Ming Lei
2017-12-12 15:10       ` Ming Lei
2017-12-13  6:43       ` Ming Lei
2017-12-13  6:43         ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.