All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/6] blk-mq: fix race related with device deletion/reset/switching sched
@ 2017-12-14  2:30 ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:30 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei

Hi,

The 1st patch fixes one kernel oops triggered by IOs vs. deleting SCSI
device, and this issue can be triggered easily on scsi_debug.

The other 5 patch fixes recent Yi Zhang's reports about his NVMe stress
tests, most of them are related with switching io sched, NVMe reset or
updating nr_hw_queues.

V2:
	- address stale queue mapping in blk_mq_update_queue_map(), instead
	of PCI transport, since such issue exists on other transport too,
	as suggested by Christoph
	- avoid to introduce nvme_admin_queue_rq() since the nvme queue can
	be got from hctx->driver_data, which is reliable


Thanks,
Ming

Ming Lei (6):
  blk-mq: quiesce queue before freeing queue
  blk-mq: support concurrent blk_mq_quiesce_queue()
  blk-mq: quiesce queue during switching io sched and updating
    nr_requests
  blk-mq: avoid to map CPU into stale hw queue
  blk-mq: fix race between updating nr_hw_queues and switching io sched
  nvme-pci: remove .init_request callback

 block/blk-core.c         |  9 ++++++
 block/blk-mq.c           | 74 ++++++++++++++++++++++++++++++++++++++++++------
 block/elevator.c         |  2 ++
 drivers/nvme/host/core.c |  4 +--
 drivers/nvme/host/pci.c  | 21 +++-----------
 include/linux/blk-mq.h   |  7 ++++-
 include/linux/blkdev.h   |  2 ++
 7 files changed, 91 insertions(+), 28 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 0/6] blk-mq: fix race related with device deletion/reset/switching sched
@ 2017-12-14  2:30 ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:30 UTC (permalink / raw)


Hi,

The 1st patch fixes one kernel oops triggered by IOs vs. deleting SCSI
device, and this issue can be triggered easily on scsi_debug.

The other 5 patch fixes recent Yi Zhang's reports about his NVMe stress
tests, most of them are related with switching io sched, NVMe reset or
updating nr_hw_queues.

V2:
	- address stale queue mapping in blk_mq_update_queue_map(), instead
	of PCI transport, since such issue exists on other transport too,
	as suggested by Christoph
	- avoid to introduce nvme_admin_queue_rq() since the nvme queue can
	be got from hctx->driver_data, which is reliable


Thanks,
Ming

Ming Lei (6):
  blk-mq: quiesce queue before freeing queue
  blk-mq: support concurrent blk_mq_quiesce_queue()
  blk-mq: quiesce queue during switching io sched and updating
    nr_requests
  blk-mq: avoid to map CPU into stale hw queue
  blk-mq: fix race between updating nr_hw_queues and switching io sched
  nvme-pci: remove .init_request callback

 block/blk-core.c         |  9 ++++++
 block/blk-mq.c           | 74 ++++++++++++++++++++++++++++++++++++++++++------
 block/elevator.c         |  2 ++
 drivers/nvme/host/core.c |  4 +--
 drivers/nvme/host/pci.c  | 21 +++-----------
 include/linux/blk-mq.h   |  7 ++++-
 include/linux/blkdev.h   |  2 ++
 7 files changed, 91 insertions(+), 28 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 1/6] blk-mq: quiesce queue before freeing queue
  2017-12-14  2:30 ` Ming Lei
@ 2017-12-14  2:30   ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:30 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei, stable

After queue is frozen, dispatch still may happen, for example:

1) requests are submitted from several contexts
2) requests from all these contexts are inserted to queue, but may dispatch
to LLD in one of these paths, but other paths sill need to move on even all
these requests are completed(that means blk_mq_freeze_queue_wait() returns
at that time)
3) dispatch after queue freezing still moves on and causes use-after-free,
because request queue is freed

This patch quiesces queue after it is frozen, and makes sure all
in-progress dispatch are completed.

This patch fixes the following kernel crash when running heavy IOs vs.
deleting device:

[   36.719251] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[   36.720318] IP: kyber_has_work+0x14/0x40
[   36.720847] PGD 254bf5067 P4D 254bf5067 PUD 255e6a067 PMD 0
[   36.721584] Oops: 0000 [#1] PREEMPT SMP
[   36.722105] Dumping ftrace buffer:
[   36.722570]    (ftrace buffer empty)
[   36.723057] Modules linked in: scsi_debug ebtable_filter ebtables ip6table_filter ip6_tables tcm_loop iscsi_target_mod target_core_file target_core_iblock target_core_pscsi target_core_mod xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc fuse iptable_filter ip_tables sd_mod sg btrfs xor zstd_decompress zstd_compress xxhash raid6_pq mptsas mptscsih bcache crc32c_intel ahci mptbase libahci serio_raw scsi_transport_sas nvme libata shpchp lpc_ich virtio_scsi nvme_core binfmt_misc dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
[   36.733438] CPU: 2 PID: 2374 Comm: fio Not tainted 4.15.0-rc2.blk_mq_quiesce+ #714
[   36.735143] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014
[   36.736688] RIP: 0010:kyber_has_work+0x14/0x40
[   36.737515] RSP: 0018:ffffc9000209bca0 EFLAGS: 00010202
[   36.738431] RAX: 0000000000000008 RBX: ffff88025578bfc8 RCX: ffff880257bf4ed0
[   36.739581] RDX: 0000000000000038 RSI: ffffffff81a98c6d RDI: ffff88025578bfc8
[   36.740730] RBP: ffff880253cebfc8 R08: ffffc9000209bda0 R09: ffff8802554f3480
[   36.741885] R10: ffffc9000209be60 R11: ffff880263f72538 R12: ffff88025573e9e8
[   36.743036] R13: ffff88025578bfd0 R14: 0000000000000001 R15: 0000000000000000
[   36.744189] FS:  00007f9b9bee67c0(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
[   36.746617] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   36.748483] CR2: 0000000000000008 CR3: 0000000254bf4001 CR4: 00000000003606e0
[   36.750164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   36.751455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   36.752796] Call Trace:
[   36.753992]  blk_mq_do_dispatch_sched+0x7f/0xe0
[   36.755110]  blk_mq_sched_dispatch_requests+0x119/0x190
[   36.756179]  __blk_mq_run_hw_queue+0x83/0x90
[   36.757144]  __blk_mq_delay_run_hw_queue+0xaf/0x110
[   36.758046]  blk_mq_run_hw_queue+0x24/0x70
[   36.758845]  blk_mq_flush_plug_list+0x1e7/0x270
[   36.759676]  blk_flush_plug_list+0xd6/0x240
[   36.760463]  blk_finish_plug+0x27/0x40
[   36.761195]  do_io_submit+0x19b/0x780
[   36.761921]  ? entry_SYSCALL_64_fastpath+0x1a/0x7d
[   36.762788]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[   36.763639] RIP: 0033:0x7f9b9699f697
[   36.764352] RSP: 002b:00007ffc10f991b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000d1
[   36.765773] RAX: ffffffffffffffda RBX: 00000000008f6f00 RCX: 00007f9b9699f697
[   36.766965] RDX: 0000000000a5e6c0 RSI: 0000000000000001 RDI: 00007f9b8462a000
[   36.768377] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000008f6420
[   36.769649] R10: 00007f9b846e5000 R11: 0000000000000206 R12: 00007f9b795d6a70
[   36.770807] R13: 00007f9b795e4140 R14: 00007f9b795e3fe0 R15: 0000000100000000
[   36.771955] Code: 83 c7 10 e9 3f 68 d1 ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 97 b0 00 00 00 48 8d 42 08 48 83 c2 38 <48> 3b 00 74 06 b8 01 00 00 00 c3 48 3b 40 08 75 f4 48 83 c0 10
[   36.775004] RIP: kyber_has_work+0x14/0x40 RSP: ffffc9000209bca0
[   36.776012] CR2: 0000000000000008
[   36.776690] ---[ end trace 4045cbce364ff2a4 ]---
[   36.777527] Kernel panic - not syncing: Fatal exception
[   36.778526] Dumping ftrace buffer:
[   36.779313]    (ftrace buffer empty)
[   36.780081] Kernel Offset: disabled
[   36.780877] ---[ end Kernel panic - not syncing: Fatal exception

Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b8881750a3ac..071a0fb29597 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -694,6 +694,15 @@ void blk_cleanup_queue(struct request_queue *q)
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
 	spin_unlock_irq(lock);
 
+	/*
+	 * make sure all in-progress dispatch are completed because
+	 * blk_freeze_queue() can only complete all requests, and
+	 * dispatch may still be in-progress since we dispatch requests
+	 * from more than one contexts
+	 */
+	if (q->mq_ops)
+		blk_mq_quiesce_queue(q);
+
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 1/6] blk-mq: quiesce queue before freeing queue
@ 2017-12-14  2:30   ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:30 UTC (permalink / raw)


After queue is frozen, dispatch still may happen, for example:

1) requests are submitted from several contexts
2) requests from all these contexts are inserted to queue, but may dispatch
to LLD in one of these paths, but other paths sill need to move on even all
these requests are completed(that means blk_mq_freeze_queue_wait() returns
at that time)
3) dispatch after queue freezing still moves on and causes use-after-free,
because request queue is freed

This patch quiesces queue after it is frozen, and makes sure all
in-progress dispatch are completed.

This patch fixes the following kernel crash when running heavy IOs vs.
deleting device:

[   36.719251] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[   36.720318] IP: kyber_has_work+0x14/0x40
[   36.720847] PGD 254bf5067 P4D 254bf5067 PUD 255e6a067 PMD 0
[   36.721584] Oops: 0000 [#1] PREEMPT SMP
[   36.722105] Dumping ftrace buffer:
[   36.722570]    (ftrace buffer empty)
[   36.723057] Modules linked in: scsi_debug ebtable_filter ebtables ip6table_filter ip6_tables tcm_loop iscsi_target_mod target_core_file target_core_iblock target_core_pscsi target_core_mod xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc fuse iptable_filter ip_tables sd_mod sg btrfs xor zstd_decompress zstd_compress xxhash raid6_pq mptsas mptscsih bcache crc32c_intel ahci mptbase libahci serio_raw scsi_transport_sas nvme libata shpchp lpc_ich virtio_scsi nvme_core binfmt_misc dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
[   36.733438] CPU: 2 PID: 2374 Comm: fio Not tainted 4.15.0-rc2.blk_mq_quiesce+ #714
[   36.735143] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014
[   36.736688] RIP: 0010:kyber_has_work+0x14/0x40
[   36.737515] RSP: 0018:ffffc9000209bca0 EFLAGS: 00010202
[   36.738431] RAX: 0000000000000008 RBX: ffff88025578bfc8 RCX: ffff880257bf4ed0
[   36.739581] RDX: 0000000000000038 RSI: ffffffff81a98c6d RDI: ffff88025578bfc8
[   36.740730] RBP: ffff880253cebfc8 R08: ffffc9000209bda0 R09: ffff8802554f3480
[   36.741885] R10: ffffc9000209be60 R11: ffff880263f72538 R12: ffff88025573e9e8
[   36.743036] R13: ffff88025578bfd0 R14: 0000000000000001 R15: 0000000000000000
[   36.744189] FS:  00007f9b9bee67c0(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
[   36.746617] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   36.748483] CR2: 0000000000000008 CR3: 0000000254bf4001 CR4: 00000000003606e0
[   36.750164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   36.751455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   36.752796] Call Trace:
[   36.753992]  blk_mq_do_dispatch_sched+0x7f/0xe0
[   36.755110]  blk_mq_sched_dispatch_requests+0x119/0x190
[   36.756179]  __blk_mq_run_hw_queue+0x83/0x90
[   36.757144]  __blk_mq_delay_run_hw_queue+0xaf/0x110
[   36.758046]  blk_mq_run_hw_queue+0x24/0x70
[   36.758845]  blk_mq_flush_plug_list+0x1e7/0x270
[   36.759676]  blk_flush_plug_list+0xd6/0x240
[   36.760463]  blk_finish_plug+0x27/0x40
[   36.761195]  do_io_submit+0x19b/0x780
[   36.761921]  ? entry_SYSCALL_64_fastpath+0x1a/0x7d
[   36.762788]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[   36.763639] RIP: 0033:0x7f9b9699f697
[   36.764352] RSP: 002b:00007ffc10f991b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000d1
[   36.765773] RAX: ffffffffffffffda RBX: 00000000008f6f00 RCX: 00007f9b9699f697
[   36.766965] RDX: 0000000000a5e6c0 RSI: 0000000000000001 RDI: 00007f9b8462a000
[   36.768377] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000008f6420
[   36.769649] R10: 00007f9b846e5000 R11: 0000000000000206 R12: 00007f9b795d6a70
[   36.770807] R13: 00007f9b795e4140 R14: 00007f9b795e3fe0 R15: 0000000100000000
[   36.771955] Code: 83 c7 10 e9 3f 68 d1 ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 97 b0 00 00 00 48 8d 42 08 48 83 c2 38 <48> 3b 00 74 06 b8 01 00 00 00 c3 48 3b 40 08 75 f4 48 83 c0 10
[   36.775004] RIP: kyber_has_work+0x14/0x40 RSP: ffffc9000209bca0
[   36.776012] CR2: 0000000000000008
[   36.776690] ---[ end trace 4045cbce364ff2a4 ]---
[   36.777527] Kernel panic - not syncing: Fatal exception
[   36.778526] Dumping ftrace buffer:
[   36.779313]    (ftrace buffer empty)
[   36.780081] Kernel Offset: disabled
[   36.780877] ---[ end Kernel panic - not syncing: Fatal exception

Cc: stable at vger.kernel.org
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b8881750a3ac..071a0fb29597 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -694,6 +694,15 @@ void blk_cleanup_queue(struct request_queue *q)
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
 	spin_unlock_irq(lock);
 
+	/*
+	 * make sure all in-progress dispatch are completed because
+	 * blk_freeze_queue() can only complete all requests, and
+	 * dispatch may still be in-progress since we dispatch requests
+	 * from more than one contexts
+	 */
+	if (q->mq_ops)
+		blk_mq_quiesce_queue(q);
+
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 2/6] blk-mq: support concurrent blk_mq_quiesce_queue()
  2017-12-14  2:30 ` Ming Lei
@ 2017-12-14  2:30   ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:30 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei

Turns out that blk_mq_freeze_queue() isn't stronger[1] than
blk_mq_quiesce_queue() because dispatch may still be in-progress after
queue is frozen, and in several cases, such as switching io scheduler,
and updating hw queues, we still need to quiesce queue as a supplement
of freezing queue.

As we need to extend uses of blk_mq_quiesce_queue(), it is inevitable
for us to need support cucurrent quiesce, especailly we can't let
unquiesce happen when there is quiesce happenning from other contexts.

This patch introduces q->mq_quiesce_depth to deal concurrent quiesce,
and we only unquiesce queue when it is the last one from all contexts.

[1] https://marc.info/?l=linux-block&m=150993988115872&w=2

Suggested-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c           | 21 ++++++++++++++++-----
 drivers/nvme/host/core.c |  4 ++--
 include/linux/blk-mq.h   |  7 ++++++-
 include/linux/blkdev.h   |  2 ++
 4 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11097477eeab..5d69c8075339 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -196,7 +196,8 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	queue_flag_set(QUEUE_FLAG_QUIESCED, q);
+	if (!q->quiesce_depth++)
+		queue_flag_set(QUEUE_FLAG_QUIESCED, q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
@@ -232,22 +233,32 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 /*
  * blk_mq_unquiesce_queue() - counterpart of blk_mq_quiesce_queue()
  * @q: request queue.
+ * @force: force to unquiesce if set
  *
  * This function recovers queue into the state before quiescing
  * which is done by blk_mq_quiesce_queue.
+ *
+ * Note: @force should be passed as true only when it is done before
+ * cleanup queue, for other cases, please don't use this way.
  */
-void blk_mq_unquiesce_queue(struct request_queue *q)
+void __blk_mq_unquiesce_queue(struct request_queue *q, bool force)
 {
 	unsigned long flags;
+	int depth;
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
+	if (q->quiesce_depth > 0)
+		q->quiesce_depth--;
+	depth = q->quiesce_depth;
+	if (!depth || force)
+		queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	/* dispatch requests which are inserted during quiescing */
-	blk_mq_run_hw_queues(q, true);
+	if (!depth || force)
+		blk_mq_run_hw_queues(q, true);
 }
-EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
+EXPORT_SYMBOL_GPL(__blk_mq_unquiesce_queue);
 
 void blk_mq_wake_waiters(struct request_queue *q)
 {
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f837d666cbd4..1ab1168cd46a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3371,7 +3371,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 
 	/* Forcibly unquiesce queues to avoid blocking dispatch */
 	if (ctrl->admin_q)
-		blk_mq_unquiesce_queue(ctrl->admin_q);
+		__blk_mq_unquiesce_queue(ctrl->admin_q, true);
 
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		/*
@@ -3384,7 +3384,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 		blk_set_queue_dying(ns->queue);
 
 		/* Forcibly unquiesce queues to avoid blocking dispatch */
-		blk_mq_unquiesce_queue(ns->queue);
+		__blk_mq_unquiesce_queue(ns->queue, true);
 	}
 	mutex_unlock(&ctrl->namespaces_mutex);
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 95c9a5c862e2..8a01822dc09e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -265,7 +265,7 @@ void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_quiesce_queue(struct request_queue *q);
-void blk_mq_unquiesce_queue(struct request_queue *q);
+void __blk_mq_unquiesce_queue(struct request_queue *q, bool force);
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
@@ -286,6 +286,11 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
+static inline void blk_mq_unquiesce_queue(struct request_queue *q)
+{
+	__blk_mq_unquiesce_queue(q, false);
+}
+
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8089ca17db9a..ee3731f417c0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -570,6 +570,8 @@ struct request_queue {
 	int			bypass_depth;
 	atomic_t		mq_freeze_depth;
 
+	int			quiesce_depth;
+
 #if defined(CONFIG_BLK_DEV_BSG)
 	bsg_job_fn		*bsg_job_fn;
 	struct bsg_class_device bsg_dev;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 2/6] blk-mq: support concurrent blk_mq_quiesce_queue()
@ 2017-12-14  2:30   ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:30 UTC (permalink / raw)


Turns out that blk_mq_freeze_queue() isn't stronger[1] than
blk_mq_quiesce_queue() because dispatch may still be in-progress after
queue is frozen, and in several cases, such as switching io scheduler,
and updating hw queues, we still need to quiesce queue as a supplement
of freezing queue.

As we need to extend uses of blk_mq_quiesce_queue(), it is inevitable
for us to need support cucurrent quiesce, especailly we can't let
unquiesce happen when there is quiesce happenning from other contexts.

This patch introduces q->mq_quiesce_depth to deal concurrent quiesce,
and we only unquiesce queue when it is the last one from all contexts.

[1] https://marc.info/?l=linux-block&m=150993988115872&w=2

Suggested-by: Bart Van Assche <bart.vanassche at wdc.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c           | 21 ++++++++++++++++-----
 drivers/nvme/host/core.c |  4 ++--
 include/linux/blk-mq.h   |  7 ++++++-
 include/linux/blkdev.h   |  2 ++
 4 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11097477eeab..5d69c8075339 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -196,7 +196,8 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	queue_flag_set(QUEUE_FLAG_QUIESCED, q);
+	if (!q->quiesce_depth++)
+		queue_flag_set(QUEUE_FLAG_QUIESCED, q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
@@ -232,22 +233,32 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 /*
  * blk_mq_unquiesce_queue() - counterpart of blk_mq_quiesce_queue()
  * @q: request queue.
+ * @force: force to unquiesce if set
  *
  * This function recovers queue into the state before quiescing
  * which is done by blk_mq_quiesce_queue.
+ *
+ * Note: @force should be passed as true only when it is done before
+ * cleanup queue, for other cases, please don't use this way.
  */
-void blk_mq_unquiesce_queue(struct request_queue *q)
+void __blk_mq_unquiesce_queue(struct request_queue *q, bool force)
 {
 	unsigned long flags;
+	int depth;
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
+	if (q->quiesce_depth > 0)
+		q->quiesce_depth--;
+	depth = q->quiesce_depth;
+	if (!depth || force)
+		queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	/* dispatch requests which are inserted during quiescing */
-	blk_mq_run_hw_queues(q, true);
+	if (!depth || force)
+		blk_mq_run_hw_queues(q, true);
 }
-EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
+EXPORT_SYMBOL_GPL(__blk_mq_unquiesce_queue);
 
 void blk_mq_wake_waiters(struct request_queue *q)
 {
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f837d666cbd4..1ab1168cd46a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3371,7 +3371,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 
 	/* Forcibly unquiesce queues to avoid blocking dispatch */
 	if (ctrl->admin_q)
-		blk_mq_unquiesce_queue(ctrl->admin_q);
+		__blk_mq_unquiesce_queue(ctrl->admin_q, true);
 
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		/*
@@ -3384,7 +3384,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 		blk_set_queue_dying(ns->queue);
 
 		/* Forcibly unquiesce queues to avoid blocking dispatch */
-		blk_mq_unquiesce_queue(ns->queue);
+		__blk_mq_unquiesce_queue(ns->queue, true);
 	}
 	mutex_unlock(&ctrl->namespaces_mutex);
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 95c9a5c862e2..8a01822dc09e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -265,7 +265,7 @@ void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_quiesce_queue(struct request_queue *q);
-void blk_mq_unquiesce_queue(struct request_queue *q);
+void __blk_mq_unquiesce_queue(struct request_queue *q, bool force);
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
@@ -286,6 +286,11 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
+static inline void blk_mq_unquiesce_queue(struct request_queue *q)
+{
+	__blk_mq_unquiesce_queue(q, false);
+}
+
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8089ca17db9a..ee3731f417c0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -570,6 +570,8 @@ struct request_queue {
 	int			bypass_depth;
 	atomic_t		mq_freeze_depth;
 
+	int			quiesce_depth;
+
 #if defined(CONFIG_BLK_DEV_BSG)
 	bsg_job_fn		*bsg_job_fn;
 	struct bsg_class_device bsg_dev;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 3/6] blk-mq: quiesce queue during switching io sched and updating nr_requests
  2017-12-14  2:30 ` Ming Lei
@ 2017-12-14  2:31   ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:31 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei

Dispatch may still be in-progress after queue is frozen, so we have to
quiesce queue before switching IO scheduler and updating nr_requests.

Also when switching io schedulers, blk_mq_run_hw_queue() may still be
called somewhere(such as from nvme_reset_work()), and io scheduler's
per-hctx data may not be setup yet, so cause oops even inside
blk_mq_hctx_has_pending(), such as it can be run just between:

        ret = e->ops.mq.init_sched(q, e);
AND
        ret = e->ops.mq.init_hctx(hctx, i)

inside blk_mq_init_sched().

This reverts commit 7a148c2fcff8330(block: don't call blk_mq_quiesce_queue()
after queue is frozen) basically, and makes sure blk_mq_hctx_has_pending
won't be called if queue is quiesced.

Fixes: 7a148c2fcff83309(block: don't call blk_mq_quiesce_queue() after queue is frozen)
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c   | 27 ++++++++++++++++++++++++++-
 block/elevator.c |  2 ++
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5d69c8075339..85954a0b4394 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1296,7 +1296,30 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
 
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
-	if (blk_mq_hctx_has_pending(hctx)) {
+	int srcu_idx;
+	bool need_run;
+
+	/*
+	 * When queue is quiesced, we may be switching io scheduler, or
+	 * updating nr_hw_queues, or other things, and we can't run queue
+	 * any more, even __blk_mq_hctx_has_pending() can't be called safely.
+	 *
+	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
+	 * quiesced.
+	 */
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
+		rcu_read_lock();
+		need_run = !blk_queue_quiesced(hctx->queue) &&
+			blk_mq_hctx_has_pending(hctx);
+		rcu_read_unlock();
+	} else {
+		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+		need_run = !blk_queue_quiesced(hctx->queue) &&
+			blk_mq_hctx_has_pending(hctx);
+		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+	}
+
+	if (need_run) {
 		__blk_mq_delay_run_hw_queue(hctx, async, 0);
 		return true;
 	}
@@ -2721,6 +2744,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		return -EINVAL;
 
 	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
 
 	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -2744,6 +2768,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	if (!ret)
 		q->nr_requests = nr;
 
+	blk_mq_unquiesce_queue(q);
 	blk_mq_unfreeze_queue(q);
 
 	return ret;
diff --git a/block/elevator.c b/block/elevator.c
index 7bda083d5968..138faeb08a7c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -968,6 +968,7 @@ static int elevator_switch_mq(struct request_queue *q,
 	int ret;
 
 	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
 
 	if (q->elevator) {
 		if (q->elevator->registered)
@@ -994,6 +995,7 @@ static int elevator_switch_mq(struct request_queue *q,
 		blk_add_trace_msg(q, "elv switch: none");
 
 out:
+	blk_mq_unquiesce_queue(q);
 	blk_mq_unfreeze_queue(q);
 	return ret;
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 3/6] blk-mq: quiesce queue during switching io sched and updating nr_requests
@ 2017-12-14  2:31   ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:31 UTC (permalink / raw)


Dispatch may still be in-progress after queue is frozen, so we have to
quiesce queue before switching IO scheduler and updating nr_requests.

Also when switching io schedulers, blk_mq_run_hw_queue() may still be
called somewhere(such as from nvme_reset_work()), and io scheduler's
per-hctx data may not be setup yet, so cause oops even inside
blk_mq_hctx_has_pending(), such as it can be run just between:

        ret = e->ops.mq.init_sched(q, e);
AND
        ret = e->ops.mq.init_hctx(hctx, i)

inside blk_mq_init_sched().

This reverts commit 7a148c2fcff8330(block: don't call blk_mq_quiesce_queue()
after queue is frozen) basically, and makes sure blk_mq_hctx_has_pending
won't be called if queue is quiesced.

Fixes: 7a148c2fcff83309(block: don't call blk_mq_quiesce_queue() after queue is frozen)
Reported-by: Yi Zhang <yi.zhang at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c   | 27 ++++++++++++++++++++++++++-
 block/elevator.c |  2 ++
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5d69c8075339..85954a0b4394 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1296,7 +1296,30 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
 
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
-	if (blk_mq_hctx_has_pending(hctx)) {
+	int srcu_idx;
+	bool need_run;
+
+	/*
+	 * When queue is quiesced, we may be switching io scheduler, or
+	 * updating nr_hw_queues, or other things, and we can't run queue
+	 * any more, even __blk_mq_hctx_has_pending() can't be called safely.
+	 *
+	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
+	 * quiesced.
+	 */
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
+		rcu_read_lock();
+		need_run = !blk_queue_quiesced(hctx->queue) &&
+			blk_mq_hctx_has_pending(hctx);
+		rcu_read_unlock();
+	} else {
+		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+		need_run = !blk_queue_quiesced(hctx->queue) &&
+			blk_mq_hctx_has_pending(hctx);
+		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+	}
+
+	if (need_run) {
 		__blk_mq_delay_run_hw_queue(hctx, async, 0);
 		return true;
 	}
@@ -2721,6 +2744,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		return -EINVAL;
 
 	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
 
 	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -2744,6 +2768,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	if (!ret)
 		q->nr_requests = nr;
 
+	blk_mq_unquiesce_queue(q);
 	blk_mq_unfreeze_queue(q);
 
 	return ret;
diff --git a/block/elevator.c b/block/elevator.c
index 7bda083d5968..138faeb08a7c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -968,6 +968,7 @@ static int elevator_switch_mq(struct request_queue *q,
 	int ret;
 
 	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
 
 	if (q->elevator) {
 		if (q->elevator->registered)
@@ -994,6 +995,7 @@ static int elevator_switch_mq(struct request_queue *q,
 		blk_add_trace_msg(q, "elv switch: none");
 
 out:
+	blk_mq_unquiesce_queue(q);
 	blk_mq_unfreeze_queue(q);
 	return ret;
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 4/6] blk-mq: avoid to map CPU into stale hw queue
  2017-12-14  2:30 ` Ming Lei
@ 2017-12-14  2:31   ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:31 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei

blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
previous map isn't cleared yet, and may point to one stale hw queue
index.

This patch fixes the following issue by clearing the mapping table before
setting it up in blk_mq_pci_map_queues().

This patches fixes this following issue reported by Zhang Yi:

[  101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
[  101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
[  101.216346] PGD 0 P4D 0
[  101.219171] Oops: 0000 [#1] SMP
[  101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[  101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
[  101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[  101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
[  101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
[  101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
[  101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
[  101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
[  101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
[  101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
[  101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
[  101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
[  101.364314] FS:  0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
[  101.373342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
[  101.387714] Call Trace:
[  101.390445]  blk_mq_update_nr_hw_queues+0xbf/0x130
[  101.395791]  nvme_reset_work+0x6f4/0xc06 [nvme]
[  101.400848]  ? pick_next_task_fair+0x290/0x5f0
[  101.405807]  ? __switch_to+0x1f5/0x430
[  101.409988]  ? put_prev_entity+0x2f/0xd0
[  101.414365]  process_one_work+0x141/0x340
[  101.418836]  worker_thread+0x47/0x3e0
[  101.422921]  kthread+0xf5/0x130
[  101.426424]  ? rescuer_thread+0x380/0x380
[  101.430896]  ? kthread_associate_blkcg+0x90/0x90
[  101.436048]  ret_from_fork+0x1f/0x30
[  101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
[  101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
[  101.468205] CR2: 0000000000000098
[  101.471907] ---[ end trace 5fe710f98228a3ca ]---
[  101.482489] Kernel panic - not syncing: Fatal exception
[  101.488505] Kernel Offset: disabled
[  101.497752] ---[ end Kernel panic - not syncing: Fatal exception

Suggested-by: Christoph Hellwig <hch@lst.de>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85954a0b4394..6dae8f631253 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2633,9 +2633,27 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 
 static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
 {
-	if (set->ops->map_queues)
+	if (set->ops->map_queues) {
+		int cpu;
+		/*
+		 * transport .map_queues is usually done in the following
+		 * way:
+		 *
+		 * for (queue = 0; queue < set->nr_hw_queues; queue++) {
+		 * 	mask = get_cpu_mask(queue)
+		 * 	for_each_cpu(cpu, mask)
+		 * 		set->mq_map[cpu] = queue;
+		 * }
+		 *
+		 * When we need to remap, the table has to be cleared for
+		 * killing stale mapping since one CPU may not be mapped
+		 * to any hw queue.
+		 */
+		for_each_possible_cpu(cpu)
+			set->mq_map[cpu] = 0;
+
 		return set->ops->map_queues(set);
-	else
+	} else
 		return blk_mq_map_queues(set);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 4/6] blk-mq: avoid to map CPU into stale hw queue
@ 2017-12-14  2:31   ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:31 UTC (permalink / raw)


blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
previous map isn't cleared yet, and may point to one stale hw queue
index.

This patch fixes the following issue by clearing the mapping table before
setting it up in blk_mq_pci_map_queues().

This patches fixes this following issue reported by Zhang Yi:

[  101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
[  101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
[  101.216346] PGD 0 P4D 0
[  101.219171] Oops: 0000 [#1] SMP
[  101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[  101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
[  101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[  101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
[  101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
[  101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
[  101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
[  101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
[  101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
[  101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
[  101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
[  101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
[  101.364314] FS:  0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
[  101.373342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
[  101.387714] Call Trace:
[  101.390445]  blk_mq_update_nr_hw_queues+0xbf/0x130
[  101.395791]  nvme_reset_work+0x6f4/0xc06 [nvme]
[  101.400848]  ? pick_next_task_fair+0x290/0x5f0
[  101.405807]  ? __switch_to+0x1f5/0x430
[  101.409988]  ? put_prev_entity+0x2f/0xd0
[  101.414365]  process_one_work+0x141/0x340
[  101.418836]  worker_thread+0x47/0x3e0
[  101.422921]  kthread+0xf5/0x130
[  101.426424]  ? rescuer_thread+0x380/0x380
[  101.430896]  ? kthread_associate_blkcg+0x90/0x90
[  101.436048]  ret_from_fork+0x1f/0x30
[  101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
[  101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
[  101.468205] CR2: 0000000000000098
[  101.471907] ---[ end trace 5fe710f98228a3ca ]---
[  101.482489] Kernel panic - not syncing: Fatal exception
[  101.488505] Kernel Offset: disabled
[  101.497752] ---[ end Kernel panic - not syncing: Fatal exception

Suggested-by: Christoph Hellwig <hch at lst.de>
Reported-by: Yi Zhang <yi.zhang at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85954a0b4394..6dae8f631253 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2633,9 +2633,27 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 
 static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
 {
-	if (set->ops->map_queues)
+	if (set->ops->map_queues) {
+		int cpu;
+		/*
+		 * transport .map_queues is usually done in the following
+		 * way:
+		 *
+		 * for (queue = 0; queue < set->nr_hw_queues; queue++) {
+		 * 	mask = get_cpu_mask(queue)
+		 * 	for_each_cpu(cpu, mask)
+		 * 		set->mq_map[cpu] = queue;
+		 * }
+		 *
+		 * When we need to remap, the table has to be cleared for
+		 * killing stale mapping since one CPU may not be mapped
+		 * to any hw queue.
+		 */
+		for_each_possible_cpu(cpu)
+			set->mq_map[cpu] = 0;
+
 		return set->ops->map_queues(set);
-	else
+	} else
 		return blk_mq_map_queues(set);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 5/6] blk-mq: fix race between updating nr_hw_queues and switching io sched
  2017-12-14  2:30 ` Ming Lei
@ 2017-12-14  2:31   ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:31 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei

In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.

This patch fixes the race be holding q->sysfs_lock.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6dae8f631253..d6c75daec41b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2418,6 +2418,9 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;
 
 	blk_mq_sysfs_unregister(q);
+
+	/* protect against switching io scheduler  */
+	mutex_lock(&q->sysfs_lock);
 	for (i = 0; i < set->nr_hw_queues; i++) {
 		int node;
 
@@ -2462,6 +2465,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 		}
 	}
 	q->nr_hw_queues = i;
+	mutex_unlock(&q->sysfs_lock);
 	blk_mq_sysfs_register(q);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 5/6] blk-mq: fix race between updating nr_hw_queues and switching io sched
@ 2017-12-14  2:31   ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:31 UTC (permalink / raw)


In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.

This patch fixes the race be holding q->sysfs_lock.

Reported-by: Yi Zhang <yi.zhang at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6dae8f631253..d6c75daec41b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2418,6 +2418,9 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;
 
 	blk_mq_sysfs_unregister(q);
+
+	/* protect against switching io scheduler  */
+	mutex_lock(&q->sysfs_lock);
 	for (i = 0; i < set->nr_hw_queues; i++) {
 		int node;
 
@@ -2462,6 +2465,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 		}
 	}
 	q->nr_hw_queues = i;
+	mutex_unlock(&q->sysfs_lock);
 	blk_mq_sysfs_register(q);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 6/6] nvme-pci: remove .init_request callback
  2017-12-14  2:30 ` Ming Lei
@ 2017-12-14  2:31   ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:31 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, Ming Lei

It may cause race by setting 'nvmeq' in nvme_init_request()
because .init_request is called inside switching io scheduler, which
may happen when the NVMe device is being resetted and its nvme queues
are being freed and created. We don't have any sync between the two
pathes.

This patch removes the .init_request callback and sets the nvmeq runtime
via hctx->driver_data, which is reliable because .init_hctx is always
called after hw queue is resized in reset handler, then the following bug
is fixed:

[   93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
[   93.274146] invalid opcode: 0000 [#1] SMP
[   93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[   93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
[   93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[   93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
[   93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
[   93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
[   93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
[   93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
[   93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
[   93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
[   93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
[   93.422969] FS:  00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
[   93.431996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
[   93.446368] Call Trace:
[   93.449103]  blk_mq_alloc_rqs+0x231/0x2a0
[   93.453579]  blk_mq_sched_alloc_tags.isra.8+0x42/0x80
[   93.459214]  blk_mq_init_sched+0x7e/0x140
[   93.463687]  elevator_switch+0x5a/0x1f0
[   93.467966]  ? elevator_get.isra.17+0x52/0xc0
[   93.472826]  elv_iosched_store+0xde/0x150
[   93.477299]  queue_attr_store+0x4e/0x90
[   93.481580]  kernfs_fop_write+0xfa/0x180
[   93.485958]  __vfs_write+0x33/0x170
[   93.489851]  ? __inode_security_revalidate+0x4c/0x60
[   93.495390]  ? selinux_file_permission+0xda/0x130
[   93.500641]  ? _cond_resched+0x15/0x30
[   93.504815]  vfs_write+0xad/0x1a0
[   93.508512]  SyS_write+0x52/0xc0
[   93.512113]  do_syscall_64+0x61/0x1a0
[   93.516199]  entry_SYSCALL64_slow_path+0x25/0x25
[   93.521351] RIP: 0033:0x7f33ce96aab0
[   93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
[   93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
[   93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
[   93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
[   93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
[   93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de <0f>
0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
[   93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
[   93.602273] ---[ end trace 810dde3993e5f14e ]---

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 21 ++++-----------------
 1 file changed, 4 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3c9082..9eff33b21ca5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -397,19 +397,6 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 	return 0;
 }
 
-static int nvme_init_request(struct blk_mq_tag_set *set, struct request *req,
-		unsigned int hctx_idx, unsigned int numa_node)
-{
-	struct nvme_dev *dev = set->driver_data;
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	int queue_idx = (set == &dev->tagset) ? hctx_idx + 1 : 0;
-	struct nvme_queue *nvmeq = dev->queues[queue_idx];
-
-	BUG_ON(!nvmeq);
-	iod->nvmeq = nvmeq;
-	return 0;
-}
-
 static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 {
 	struct nvme_dev *dev = set->driver_data;
@@ -448,7 +435,8 @@ static void **nvme_pci_iod_list(struct request *req)
 	return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
 }
 
-static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
+static blk_status_t nvme_init_iod(struct request *rq,
+		struct nvme_dev *dev, struct nvme_queue *nvmeq)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(rq);
 	int nseg = blk_rq_nr_phys_segments(rq);
@@ -469,6 +457,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
 	iod->npages = -1;
 	iod->nents = 0;
 	iod->length = size;
+	iod->nvmeq = nvmeq;
 
 	return BLK_STS_OK;
 }
@@ -876,7 +865,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (ret)
 		return ret;
 
-	ret = nvme_init_iod(req, dev);
+	ret = nvme_init_iod(req, dev, nvmeq);
 	if (ret)
 		goto out_free_cmd;
 
@@ -1484,7 +1473,6 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 	.complete	= nvme_pci_complete_rq,
 	.init_hctx	= nvme_admin_init_hctx,
 	.exit_hctx      = nvme_admin_exit_hctx,
-	.init_request	= nvme_init_request,
 	.timeout	= nvme_timeout,
 };
 
@@ -1492,7 +1480,6 @@ static const struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_pci_complete_rq,
 	.init_hctx	= nvme_init_hctx,
-	.init_request	= nvme_init_request,
 	.map_queues	= nvme_pci_map_queues,
 	.timeout	= nvme_timeout,
 	.poll		= nvme_poll,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 6/6] nvme-pci: remove .init_request callback
@ 2017-12-14  2:31   ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-14  2:31 UTC (permalink / raw)


It may cause race by setting 'nvmeq' in nvme_init_request()
because .init_request is called inside switching io scheduler, which
may happen when the NVMe device is being resetted and its nvme queues
are being freed and created. We don't have any sync between the two
pathes.

This patch removes the .init_request callback and sets the nvmeq runtime
via hctx->driver_data, which is reliable because .init_hctx is always
called after hw queue is resized in reset handler, then the following bug
is fixed:

[   93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
[   93.274146] invalid opcode: 0000 [#1] SMP
[   93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[   93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
[   93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[   93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
[   93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
[   93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
[   93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
[   93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
[   93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
[   93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
[   93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
[   93.422969] FS:  00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
[   93.431996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
[   93.446368] Call Trace:
[   93.449103]  blk_mq_alloc_rqs+0x231/0x2a0
[   93.453579]  blk_mq_sched_alloc_tags.isra.8+0x42/0x80
[   93.459214]  blk_mq_init_sched+0x7e/0x140
[   93.463687]  elevator_switch+0x5a/0x1f0
[   93.467966]  ? elevator_get.isra.17+0x52/0xc0
[   93.472826]  elv_iosched_store+0xde/0x150
[   93.477299]  queue_attr_store+0x4e/0x90
[   93.481580]  kernfs_fop_write+0xfa/0x180
[   93.485958]  __vfs_write+0x33/0x170
[   93.489851]  ? __inode_security_revalidate+0x4c/0x60
[   93.495390]  ? selinux_file_permission+0xda/0x130
[   93.500641]  ? _cond_resched+0x15/0x30
[   93.504815]  vfs_write+0xad/0x1a0
[   93.508512]  SyS_write+0x52/0xc0
[   93.512113]  do_syscall_64+0x61/0x1a0
[   93.516199]  entry_SYSCALL64_slow_path+0x25/0x25
[   93.521351] RIP: 0033:0x7f33ce96aab0
[   93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
[   93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
[   93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
[   93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
[   93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
[   93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de <0f>
0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
[   93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
[   93.602273] ---[ end trace 810dde3993e5f14e ]---

Reported-by: Yi Zhang <yi.zhang at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 21 ++++-----------------
 1 file changed, 4 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3c9082..9eff33b21ca5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -397,19 +397,6 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 	return 0;
 }
 
-static int nvme_init_request(struct blk_mq_tag_set *set, struct request *req,
-		unsigned int hctx_idx, unsigned int numa_node)
-{
-	struct nvme_dev *dev = set->driver_data;
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	int queue_idx = (set == &dev->tagset) ? hctx_idx + 1 : 0;
-	struct nvme_queue *nvmeq = dev->queues[queue_idx];
-
-	BUG_ON(!nvmeq);
-	iod->nvmeq = nvmeq;
-	return 0;
-}
-
 static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 {
 	struct nvme_dev *dev = set->driver_data;
@@ -448,7 +435,8 @@ static void **nvme_pci_iod_list(struct request *req)
 	return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
 }
 
-static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
+static blk_status_t nvme_init_iod(struct request *rq,
+		struct nvme_dev *dev, struct nvme_queue *nvmeq)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(rq);
 	int nseg = blk_rq_nr_phys_segments(rq);
@@ -469,6 +457,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
 	iod->npages = -1;
 	iod->nents = 0;
 	iod->length = size;
+	iod->nvmeq = nvmeq;
 
 	return BLK_STS_OK;
 }
@@ -876,7 +865,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (ret)
 		return ret;
 
-	ret = nvme_init_iod(req, dev);
+	ret = nvme_init_iod(req, dev, nvmeq);
 	if (ret)
 		goto out_free_cmd;
 
@@ -1484,7 +1473,6 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 	.complete	= nvme_pci_complete_rq,
 	.init_hctx	= nvme_admin_init_hctx,
 	.exit_hctx      = nvme_admin_exit_hctx,
-	.init_request	= nvme_init_request,
 	.timeout	= nvme_timeout,
 };
 
@@ -1492,7 +1480,6 @@ static const struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_pci_complete_rq,
 	.init_hctx	= nvme_init_hctx,
-	.init_request	= nvme_init_request,
 	.map_queues	= nvme_pci_map_queues,
 	.timeout	= nvme_timeout,
 	.poll		= nvme_poll,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/6] blk-mq: fix race related with device deletion/reset/switching sched
  2017-12-14  2:30 ` Ming Lei
@ 2017-12-19 15:30   ` Yi Zhang
  -1 siblings, 0 replies; 36+ messages in thread
From: Yi Zhang @ 2017-12-19 15:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block,
	Sagi Grimberg, Keith Busch, Johannes Thumshirn, Bart Van Assche

Thanks ming.

Tested-by: Yi Zhang <yi.zhang@redhat.com>

Best Regards,
  Yi Zhang


----- Original Message -----
From: "Ming Lei" <ming.lei@redhat.com>
To: linux-nvme@lists.infradead.org, "Christoph Hellwig" <hch@lst.de>, "Jens Axboe" <axboe@fb.com>, linux-block@vger.kernel.org
Cc: "Yi Zhang" <yi.zhang@redhat.com>, "Sagi Grimberg" <sagi@grimberg.me>, "Ming Lei" <ming.lei@redhat.com>, "Keith Busch" <keith.busch@intel.com>, "Johannes Thumshirn" <jthumshirn@suse.de>, "Bart Van Assche" <bart.vanassche@sandisk.com>
Sent: Thursday, December 14, 2017 10:30:57 AM
Subject: [PATCH V2 0/6] blk-mq: fix race related with device deletion/reset/switching sched

Hi,

The 1st patch fixes one kernel oops triggered by IOs vs. deleting SCSI
device, and this issue can be triggered easily on scsi_debug.

The other 5 patch fixes recent Yi Zhang's reports about his NVMe stress
tests, most of them are related with switching io sched, NVMe reset or
updating nr_hw_queues.

V2:
	- address stale queue mapping in blk_mq_update_queue_map(), instead
	of PCI transport, since such issue exists on other transport too,
	as suggested by Christoph
	- avoid to introduce nvme_admin_queue_rq() since the nvme queue can
	be got from hctx->driver_data, which is reliable


Thanks,
Ming

Ming Lei (6):
  blk-mq: quiesce queue before freeing queue
  blk-mq: support concurrent blk_mq_quiesce_queue()
  blk-mq: quiesce queue during switching io sched and updating
    nr_requests
  blk-mq: avoid to map CPU into stale hw queue
  blk-mq: fix race between updating nr_hw_queues and switching io sched
  nvme-pci: remove .init_request callback

 block/blk-core.c         |  9 ++++++
 block/blk-mq.c           | 74 ++++++++++++++++++++++++++++++++++++++++++------
 block/elevator.c         |  2 ++
 drivers/nvme/host/core.c |  4 +--
 drivers/nvme/host/pci.c  | 21 +++-----------
 include/linux/blk-mq.h   |  7 ++++-
 include/linux/blkdev.h   |  2 ++
 7 files changed, 91 insertions(+), 28 deletions(-)

-- 
2.9.5


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 0/6] blk-mq: fix race related with device deletion/reset/switching sched
@ 2017-12-19 15:30   ` Yi Zhang
  0 siblings, 0 replies; 36+ messages in thread
From: Yi Zhang @ 2017-12-19 15:30 UTC (permalink / raw)


Thanks ming.

Tested-by: Yi Zhang <yi.zhang at redhat.com>

Best Regards,
  Yi Zhang


----- Original Message -----
From: "Ming Lei" <ming.lei@redhat.com>
To: linux-nvme at lists.infradead.org, "Christoph Hellwig" <hch at lst.de>, "Jens Axboe" <axboe at fb.com>, linux-block at vger.kernel.org
Cc: "Yi Zhang" <yi.zhang at redhat.com>, "Sagi Grimberg" <sagi at grimberg.me>, "Ming Lei" <ming.lei at redhat.com>, "Keith Busch" <keith.busch at intel.com>, "Johannes Thumshirn" <jthumshirn at suse.de>, "Bart Van Assche" <bart.vanassche at sandisk.com>
Sent: Thursday, December 14, 2017 10:30:57 AM
Subject: [PATCH V2 0/6] blk-mq: fix race related with device deletion/reset/switching sched

Hi,

The 1st patch fixes one kernel oops triggered by IOs vs. deleting SCSI
device, and this issue can be triggered easily on scsi_debug.

The other 5 patch fixes recent Yi Zhang's reports about his NVMe stress
tests, most of them are related with switching io sched, NVMe reset or
updating nr_hw_queues.

V2:
	- address stale queue mapping in blk_mq_update_queue_map(), instead
	of PCI transport, since such issue exists on other transport too,
	as suggested by Christoph
	- avoid to introduce nvme_admin_queue_rq() since the nvme queue can
	be got from hctx->driver_data, which is reliable


Thanks,
Ming

Ming Lei (6):
  blk-mq: quiesce queue before freeing queue
  blk-mq: support concurrent blk_mq_quiesce_queue()
  blk-mq: quiesce queue during switching io sched and updating
    nr_requests
  blk-mq: avoid to map CPU into stale hw queue
  blk-mq: fix race between updating nr_hw_queues and switching io sched
  nvme-pci: remove .init_request callback

 block/blk-core.c         |  9 ++++++
 block/blk-mq.c           | 74 ++++++++++++++++++++++++++++++++++++++++++------
 block/elevator.c         |  2 ++
 drivers/nvme/host/core.c |  4 +--
 drivers/nvme/host/pci.c  | 21 +++-----------
 include/linux/blk-mq.h   |  7 ++++-
 include/linux/blkdev.h   |  2 ++
 7 files changed, 91 insertions(+), 28 deletions(-)

-- 
2.9.5


_______________________________________________
Linux-nvme mailing list
Linux-nvme at lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 6/6] nvme-pci: remove .init_request callback
  2017-12-14  2:31   ` Ming Lei
@ 2017-12-21  8:20     ` Sagi Grimberg
  -1 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2017-12-21  8:20 UTC (permalink / raw)
  To: Ming Lei, linux-nvme, Christoph Hellwig, Jens Axboe, linux-block
  Cc: Bart Van Assche, Keith Busch, Yi Zhang, Johannes Thumshirn

Ming,

I'd prefer that we make the pci driver match
the rest of the drivers in nvme.

IMO it would be better to allocate a queues array at probe time
and simply reuse it at reset sequence.

Can this (untested) patch also fix the issue your seeing:
--
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f26aaa393016..a8edb29dac16 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -75,7 +75,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
bool shutdown);
   * Represents an NVM Express device.  Each nvme_dev is a PCI function.
   */
  struct nvme_dev {
-       struct nvme_queue **queues;
+       struct nvme_queue *queues;
         struct blk_mq_tag_set tagset;
         struct blk_mq_tag_set admin_tagset;
         u32 __iomem *dbs;
@@ -365,7 +365,7 @@ static int nvme_admin_init_hctx(struct blk_mq_hw_ctx 
*hctx, void *data,
                                 unsigned int hctx_idx)
  {
         struct nvme_dev *dev = data;
-       struct nvme_queue *nvmeq = dev->queues[0];
+       struct nvme_queue *nvmeq = &dev->queues[0];

         WARN_ON(hctx_idx != 0);
         WARN_ON(dev->admin_tagset.tags[0] != hctx->tags);
@@ -387,7 +387,7 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx 
*hctx, void *data,
                           unsigned int hctx_idx)
  {
         struct nvme_dev *dev = data;
-       struct nvme_queue *nvmeq = dev->queues[hctx_idx + 1];
+       struct nvme_queue *nvmeq = &dev->queues[hctx_idx + 1];

         if (!nvmeq->tags)
                 nvmeq->tags = &dev->tagset.tags[hctx_idx];
@@ -403,7 +403,7 @@ static int nvme_init_request(struct blk_mq_tag_set 
*set, struct request *req,
         struct nvme_dev *dev = set->driver_data;
         struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
         int queue_idx = (set == &dev->tagset) ? hctx_idx + 1 : 0;
-       struct nvme_queue *nvmeq = dev->queues[queue_idx];
+       struct nvme_queue *nvmeq = &dev->queues[queue_idx];

         BUG_ON(!nvmeq);
         iod->nvmeq = nvmeq;
@@ -1046,7 +1046,7 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx, 
unsigned int tag)
  static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl)
  {
         struct nvme_dev *dev = to_nvme_dev(ctrl);
-       struct nvme_queue *nvmeq = dev->queues[0];
+       struct nvme_queue *nvmeq = &dev->queues[0];
         struct nvme_command c;

         memset(&c, 0, sizeof(c));
@@ -1290,10 +1290,8 @@ static void nvme_free_queues(struct nvme_dev 
*dev, int lowest)
         int i;

         for (i = dev->ctrl.queue_count - 1; i >= lowest; i--) {
-               struct nvme_queue *nvmeq = dev->queues[i];
                 dev->ctrl.queue_count--;
-               dev->queues[i] = NULL;
-               nvme_free_queue(nvmeq);
+               nvme_free_queue(&dev->queues[i]);
         }
  }

@@ -1325,7 +1323,7 @@ static int nvme_suspend_queue(struct nvme_queue 
*nvmeq)

  static void nvme_disable_admin_queue(struct nvme_dev *dev, bool shutdown)
  {
-       struct nvme_queue *nvmeq = dev->queues[0];
+       struct nvme_queue *nvmeq = &dev->queues[0];

         if (!nvmeq)
                 return;
@@ -1387,10 +1385,7 @@ static int nvme_alloc_sq_cmds(struct nvme_dev 
*dev, struct nvme_queue *nvmeq,
  static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
                                                         int depth, int 
node)
  {
-       struct nvme_queue *nvmeq = kzalloc_node(sizeof(*nvmeq), GFP_KERNEL,
-                                                       node);
-       if (!nvmeq)
-               return NULL;
+       struct nvme_queue *nvmeq = &dev->queues[qid];

         nvmeq->cqes = dma_zalloc_coherent(dev->dev, CQ_SIZE(depth),
                                           &nvmeq->cq_dma_addr, GFP_KERNEL);
@@ -1409,7 +1404,6 @@ static struct nvme_queue *nvme_alloc_queue(struct 
nvme_dev *dev, int qid,
         nvmeq->q_depth = depth;
         nvmeq->qid = qid;
         nvmeq->cq_vector = -1;
-       dev->queues[qid] = nvmeq;
         dev->ctrl.queue_count++;

         return nvmeq;
@@ -1592,7 +1586,7 @@ static int nvme_pci_configure_admin_queue(struct 
nvme_dev *dev)
         if (result < 0)
                 return result;

-       nvmeq = dev->queues[0];
+       nvmeq = &dev->queues[0];
         if (!nvmeq) {
                 nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH,
                                         dev_to_node(dev->dev));
@@ -1638,7 +1632,7 @@ static int nvme_create_io_queues(struct nvme_dev *dev)

         max = min(dev->max_qid, dev->ctrl.queue_count - 1);
         for (i = dev->online_queues; i <= max; i++) {
-               ret = nvme_create_queue(dev->queues[i], i);
+               ret = nvme_create_queue(&dev->queues[i], i);
                 if (ret)
                         break;
         }
@@ -1894,7 +1888,7 @@ static int nvme_setup_host_mem(struct nvme_dev *dev)

  static int nvme_setup_io_queues(struct nvme_dev *dev)
  {
-       struct nvme_queue *adminq = dev->queues[0];
+       struct nvme_queue *adminq = &dev->queues[0];
         struct pci_dev *pdev = to_pci_dev(dev->dev);
         int result, nr_io_queues;
         unsigned long size;
@@ -2020,7 +2014,7 @@ static void nvme_disable_io_queues(struct nvme_dev 
*dev, int queues)
   retry:
                 timeout = ADMIN_TIMEOUT;
                 for (; i > 0; i--, sent++)
-                       if (nvme_delete_queue(dev->queues[i], opcode))
+                       if (nvme_delete_queue(&dev->queues[i], opcode))
                                 break;

                 while (sent--) {
@@ -2209,7 +2203,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
bool shutdown)

         queues = dev->online_queues - 1;
         for (i = dev->ctrl.queue_count - 1; i > 0; i--)
-               nvme_suspend_queue(dev->queues[i]);
+               nvme_suspend_queue(&dev->queues[i]);

         if (dead) {
                 /* A device might become IO incapable very soon during
@@ -2217,7 +2211,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
bool shutdown)
                  * queue_count can be 0 here.
                  */
                 if (dev->ctrl.queue_count)
-                       nvme_suspend_queue(dev->queues[0]);
+                       nvme_suspend_queue(&dev->queues[0]);
         } else {
                 nvme_disable_io_queues(dev, queues);
                 nvme_disable_admin_queue(dev, shutdown);
@@ -2462,6 +2456,7 @@ static int nvme_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
         int node, result = -ENOMEM;
         struct nvme_dev *dev;
         unsigned long quirks = id->driver_data;
+       unsigned int alloc_size;

         node = dev_to_node(&pdev->dev);
         if (node == NUMA_NO_NODE)
@@ -2470,8 +2465,9 @@ static int nvme_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
         dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
         if (!dev)
                 return -ENOMEM;
-       dev->queues = kzalloc_node((num_possible_cpus() + 1) * 
sizeof(void *),
-                                                       GFP_KERNEL, node);
+
+       alloc_size = (num_possible_cpus() + 1) * sizeof(struct 
nvme_queue *);
+       dev->queues = kzalloc_node(alloc_size, GFP_KERNEL, node);
         if (!dev->queues)
                 goto free;
--

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 6/6] nvme-pci: remove .init_request callback
@ 2017-12-21  8:20     ` Sagi Grimberg
  0 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2017-12-21  8:20 UTC (permalink / raw)


Ming,

I'd prefer that we make the pci driver match
the rest of the drivers in nvme.

IMO it would be better to allocate a queues array at probe time
and simply reuse it at reset sequence.

Can this (untested) patch also fix the issue your seeing:
--
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f26aaa393016..a8edb29dac16 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -75,7 +75,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
bool shutdown);
   * Represents an NVM Express device.  Each nvme_dev is a PCI function.
   */
  struct nvme_dev {
-       struct nvme_queue **queues;
+       struct nvme_queue *queues;
         struct blk_mq_tag_set tagset;
         struct blk_mq_tag_set admin_tagset;
         u32 __iomem *dbs;
@@ -365,7 +365,7 @@ static int nvme_admin_init_hctx(struct blk_mq_hw_ctx 
*hctx, void *data,
                                 unsigned int hctx_idx)
  {
         struct nvme_dev *dev = data;
-       struct nvme_queue *nvmeq = dev->queues[0];
+       struct nvme_queue *nvmeq = &dev->queues[0];

         WARN_ON(hctx_idx != 0);
         WARN_ON(dev->admin_tagset.tags[0] != hctx->tags);
@@ -387,7 +387,7 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx 
*hctx, void *data,
                           unsigned int hctx_idx)
  {
         struct nvme_dev *dev = data;
-       struct nvme_queue *nvmeq = dev->queues[hctx_idx + 1];
+       struct nvme_queue *nvmeq = &dev->queues[hctx_idx + 1];

         if (!nvmeq->tags)
                 nvmeq->tags = &dev->tagset.tags[hctx_idx];
@@ -403,7 +403,7 @@ static int nvme_init_request(struct blk_mq_tag_set 
*set, struct request *req,
         struct nvme_dev *dev = set->driver_data;
         struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
         int queue_idx = (set == &dev->tagset) ? hctx_idx + 1 : 0;
-       struct nvme_queue *nvmeq = dev->queues[queue_idx];
+       struct nvme_queue *nvmeq = &dev->queues[queue_idx];

         BUG_ON(!nvmeq);
         iod->nvmeq = nvmeq;
@@ -1046,7 +1046,7 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx, 
unsigned int tag)
  static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl)
  {
         struct nvme_dev *dev = to_nvme_dev(ctrl);
-       struct nvme_queue *nvmeq = dev->queues[0];
+       struct nvme_queue *nvmeq = &dev->queues[0];
         struct nvme_command c;

         memset(&c, 0, sizeof(c));
@@ -1290,10 +1290,8 @@ static void nvme_free_queues(struct nvme_dev 
*dev, int lowest)
         int i;

         for (i = dev->ctrl.queue_count - 1; i >= lowest; i--) {
-               struct nvme_queue *nvmeq = dev->queues[i];
                 dev->ctrl.queue_count--;
-               dev->queues[i] = NULL;
-               nvme_free_queue(nvmeq);
+               nvme_free_queue(&dev->queues[i]);
         }
  }

@@ -1325,7 +1323,7 @@ static int nvme_suspend_queue(struct nvme_queue 
*nvmeq)

  static void nvme_disable_admin_queue(struct nvme_dev *dev, bool shutdown)
  {
-       struct nvme_queue *nvmeq = dev->queues[0];
+       struct nvme_queue *nvmeq = &dev->queues[0];

         if (!nvmeq)
                 return;
@@ -1387,10 +1385,7 @@ static int nvme_alloc_sq_cmds(struct nvme_dev 
*dev, struct nvme_queue *nvmeq,
  static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
                                                         int depth, int 
node)
  {
-       struct nvme_queue *nvmeq = kzalloc_node(sizeof(*nvmeq), GFP_KERNEL,
-                                                       node);
-       if (!nvmeq)
-               return NULL;
+       struct nvme_queue *nvmeq = &dev->queues[qid];

         nvmeq->cqes = dma_zalloc_coherent(dev->dev, CQ_SIZE(depth),
                                           &nvmeq->cq_dma_addr, GFP_KERNEL);
@@ -1409,7 +1404,6 @@ static struct nvme_queue *nvme_alloc_queue(struct 
nvme_dev *dev, int qid,
         nvmeq->q_depth = depth;
         nvmeq->qid = qid;
         nvmeq->cq_vector = -1;
-       dev->queues[qid] = nvmeq;
         dev->ctrl.queue_count++;

         return nvmeq;
@@ -1592,7 +1586,7 @@ static int nvme_pci_configure_admin_queue(struct 
nvme_dev *dev)
         if (result < 0)
                 return result;

-       nvmeq = dev->queues[0];
+       nvmeq = &dev->queues[0];
         if (!nvmeq) {
                 nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH,
                                         dev_to_node(dev->dev));
@@ -1638,7 +1632,7 @@ static int nvme_create_io_queues(struct nvme_dev *dev)

         max = min(dev->max_qid, dev->ctrl.queue_count - 1);
         for (i = dev->online_queues; i <= max; i++) {
-               ret = nvme_create_queue(dev->queues[i], i);
+               ret = nvme_create_queue(&dev->queues[i], i);
                 if (ret)
                         break;
         }
@@ -1894,7 +1888,7 @@ static int nvme_setup_host_mem(struct nvme_dev *dev)

  static int nvme_setup_io_queues(struct nvme_dev *dev)
  {
-       struct nvme_queue *adminq = dev->queues[0];
+       struct nvme_queue *adminq = &dev->queues[0];
         struct pci_dev *pdev = to_pci_dev(dev->dev);
         int result, nr_io_queues;
         unsigned long size;
@@ -2020,7 +2014,7 @@ static void nvme_disable_io_queues(struct nvme_dev 
*dev, int queues)
   retry:
                 timeout = ADMIN_TIMEOUT;
                 for (; i > 0; i--, sent++)
-                       if (nvme_delete_queue(dev->queues[i], opcode))
+                       if (nvme_delete_queue(&dev->queues[i], opcode))
                                 break;

                 while (sent--) {
@@ -2209,7 +2203,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
bool shutdown)

         queues = dev->online_queues - 1;
         for (i = dev->ctrl.queue_count - 1; i > 0; i--)
-               nvme_suspend_queue(dev->queues[i]);
+               nvme_suspend_queue(&dev->queues[i]);

         if (dead) {
                 /* A device might become IO incapable very soon during
@@ -2217,7 +2211,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
bool shutdown)
                  * queue_count can be 0 here.
                  */
                 if (dev->ctrl.queue_count)
-                       nvme_suspend_queue(dev->queues[0]);
+                       nvme_suspend_queue(&dev->queues[0]);
         } else {
                 nvme_disable_io_queues(dev, queues);
                 nvme_disable_admin_queue(dev, shutdown);
@@ -2462,6 +2456,7 @@ static int nvme_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
         int node, result = -ENOMEM;
         struct nvme_dev *dev;
         unsigned long quirks = id->driver_data;
+       unsigned int alloc_size;

         node = dev_to_node(&pdev->dev);
         if (node == NUMA_NO_NODE)
@@ -2470,8 +2465,9 @@ static int nvme_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
         dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
         if (!dev)
                 return -ENOMEM;
-       dev->queues = kzalloc_node((num_possible_cpus() + 1) * 
sizeof(void *),
-                                                       GFP_KERNEL, node);
+
+       alloc_size = (num_possible_cpus() + 1) * sizeof(struct 
nvme_queue *);
+       dev->queues = kzalloc_node(alloc_size, GFP_KERNEL, node);
         if (!dev->queues)
                 goto free;
--

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 6/6] nvme-pci: remove .init_request callback
  2017-12-21  8:20     ` Sagi Grimberg
@ 2017-12-21  8:36       ` Johannes Thumshirn
  -1 siblings, 0 replies; 36+ messages in thread
From: Johannes Thumshirn @ 2017-12-21  8:36 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Ming Lei, linux-nvme, Christoph Hellwig, Jens Axboe, linux-block,
	Bart Van Assche, Keith Busch, Yi Zhang

On Thu, Dec 21, 2017 at 10:20:45AM +0200, Sagi Grimberg wrote:
> @@ -2470,8 +2465,9 @@ static int nvme_probe(struct pci_dev *pdev, const
> struct pci_device_id *id)
>         dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
>         if (!dev)
>                 return -ENOMEM;
> -       dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void
> *),
> -                                                       GFP_KERNEL, node);
> +
> +       alloc_size = (num_possible_cpus() + 1) * sizeof(struct nvme_queue
> *);
> +       dev->queues = kzalloc_node(alloc_size, GFP_KERNEL, node);

Nit:
	  dev->queues = kcalloc_node(num_possible_cpus() + 1,
				     sizeof(struct nvme_queue *),
				     node);
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 6/6] nvme-pci: remove .init_request callback
@ 2017-12-21  8:36       ` Johannes Thumshirn
  0 siblings, 0 replies; 36+ messages in thread
From: Johannes Thumshirn @ 2017-12-21  8:36 UTC (permalink / raw)


On Thu, Dec 21, 2017@10:20:45AM +0200, Sagi Grimberg wrote:
> @@ -2470,8 +2465,9 @@ static int nvme_probe(struct pci_dev *pdev, const
> struct pci_device_id *id)
>         dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
>         if (!dev)
>                 return -ENOMEM;
> -       dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void
> *),
> -                                                       GFP_KERNEL, node);
> +
> +       alloc_size = (num_possible_cpus() + 1) * sizeof(struct nvme_queue
> *);
> +       dev->queues = kzalloc_node(alloc_size, GFP_KERNEL, node);

Nit:
	  dev->queues = kcalloc_node(num_possible_cpus() + 1,
				     sizeof(struct nvme_queue *),
				     node);
-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 6/6] nvme-pci: remove .init_request callback
  2017-12-21  8:20     ` Sagi Grimberg
@ 2017-12-22  1:34       ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-22  1:34 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block,
	Bart Van Assche, Keith Busch, Yi Zhang, Johannes Thumshirn

Hi Sagi,

On Thu, Dec 21, 2017 at 10:20:45AM +0200, Sagi Grimberg wrote:
> Ming,
> 
> I'd prefer that we make the pci driver match
> the rest of the drivers in nvme.

OK, this way looks better.

> 
> IMO it would be better to allocate a queues array at probe time
> and simply reuse it at reset sequence.
> 
> Can this (untested) patch also fix the issue your seeing:

Please prepare a formal one(at least tested in normal case), either I
or Zhang Yi may test/verify it.

> --
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f26aaa393016..a8edb29dac16 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -75,7 +75,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool
> shutdown);
>   * Represents an NVM Express device.  Each nvme_dev is a PCI function.
>   */
>  struct nvme_dev {
> -       struct nvme_queue **queues;
> +       struct nvme_queue *queues;
>         struct blk_mq_tag_set tagset;
>         struct blk_mq_tag_set admin_tagset;
>         u32 __iomem *dbs;
> @@ -365,7 +365,7 @@ static int nvme_admin_init_hctx(struct blk_mq_hw_ctx
> *hctx, void *data,
>                                 unsigned int hctx_idx)
>  {
>         struct nvme_dev *dev = data;
> -       struct nvme_queue *nvmeq = dev->queues[0];
> +       struct nvme_queue *nvmeq = &dev->queues[0];
> 
>         WARN_ON(hctx_idx != 0);
>         WARN_ON(dev->admin_tagset.tags[0] != hctx->tags);
> @@ -387,7 +387,7 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx,
> void *data,
>                           unsigned int hctx_idx)
>  {
>         struct nvme_dev *dev = data;
> -       struct nvme_queue *nvmeq = dev->queues[hctx_idx + 1];
> +       struct nvme_queue *nvmeq = &dev->queues[hctx_idx + 1];
> 
>         if (!nvmeq->tags)
>                 nvmeq->tags = &dev->tagset.tags[hctx_idx];
> @@ -403,7 +403,7 @@ static int nvme_init_request(struct blk_mq_tag_set *set,
> struct request *req,
>         struct nvme_dev *dev = set->driver_data;
>         struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>         int queue_idx = (set == &dev->tagset) ? hctx_idx + 1 : 0;
> -       struct nvme_queue *nvmeq = dev->queues[queue_idx];
> +       struct nvme_queue *nvmeq = &dev->queues[queue_idx];
> 
>         BUG_ON(!nvmeq);
>         iod->nvmeq = nvmeq;
> @@ -1046,7 +1046,7 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx,
> unsigned int tag)
>  static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl)
>  {
>         struct nvme_dev *dev = to_nvme_dev(ctrl);
> -       struct nvme_queue *nvmeq = dev->queues[0];
> +       struct nvme_queue *nvmeq = &dev->queues[0];
>         struct nvme_command c;
> 
>         memset(&c, 0, sizeof(c));
> @@ -1290,10 +1290,8 @@ static void nvme_free_queues(struct nvme_dev *dev,
> int lowest)
>         int i;
> 
>         for (i = dev->ctrl.queue_count - 1; i >= lowest; i--) {
> -               struct nvme_queue *nvmeq = dev->queues[i];
>                 dev->ctrl.queue_count--;
> -               dev->queues[i] = NULL;
> -               nvme_free_queue(nvmeq);
> +               nvme_free_queue(&dev->queues[i]);
>         }
>  }
> 
> @@ -1325,7 +1323,7 @@ static int nvme_suspend_queue(struct nvme_queue
> *nvmeq)
> 
>  static void nvme_disable_admin_queue(struct nvme_dev *dev, bool shutdown)
>  {
> -       struct nvme_queue *nvmeq = dev->queues[0];
> +       struct nvme_queue *nvmeq = &dev->queues[0];
> 
>         if (!nvmeq)
>                 return;
> @@ -1387,10 +1385,7 @@ static int nvme_alloc_sq_cmds(struct nvme_dev *dev,
> struct nvme_queue *nvmeq,
>  static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
>                                                         int depth, int node)
>  {
> -       struct nvme_queue *nvmeq = kzalloc_node(sizeof(*nvmeq), GFP_KERNEL,
> -                                                       node);
> -       if (!nvmeq)
> -               return NULL;
> +       struct nvme_queue *nvmeq = &dev->queues[qid];

Maybe you need to zero *nvmeq again since it is done in current code.

> 
>         nvmeq->cqes = dma_zalloc_coherent(dev->dev, CQ_SIZE(depth),
>                                           &nvmeq->cq_dma_addr, GFP_KERNEL);
> @@ -1409,7 +1404,6 @@ static struct nvme_queue *nvme_alloc_queue(struct
> nvme_dev *dev, int qid,
>         nvmeq->q_depth = depth;
>         nvmeq->qid = qid;
>         nvmeq->cq_vector = -1;
> -       dev->queues[qid] = nvmeq;
>         dev->ctrl.queue_count++;
> 
>         return nvmeq;
> @@ -1592,7 +1586,7 @@ static int nvme_pci_configure_admin_queue(struct
> nvme_dev *dev)
>         if (result < 0)
>                 return result;
> 
> -       nvmeq = dev->queues[0];
> +       nvmeq = &dev->queues[0];
>         if (!nvmeq) {
>                 nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH,
>                                         dev_to_node(dev->dev));
> @@ -1638,7 +1632,7 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
> 
>         max = min(dev->max_qid, dev->ctrl.queue_count - 1);
>         for (i = dev->online_queues; i <= max; i++) {
> -               ret = nvme_create_queue(dev->queues[i], i);
> +               ret = nvme_create_queue(&dev->queues[i], i);
>                 if (ret)
>                         break;
>         }
> @@ -1894,7 +1888,7 @@ static int nvme_setup_host_mem(struct nvme_dev *dev)
> 
>  static int nvme_setup_io_queues(struct nvme_dev *dev)
>  {
> -       struct nvme_queue *adminq = dev->queues[0];
> +       struct nvme_queue *adminq = &dev->queues[0];
>         struct pci_dev *pdev = to_pci_dev(dev->dev);
>         int result, nr_io_queues;
>         unsigned long size;
> @@ -2020,7 +2014,7 @@ static void nvme_disable_io_queues(struct nvme_dev
> *dev, int queues)
>   retry:
>                 timeout = ADMIN_TIMEOUT;
>                 for (; i > 0; i--, sent++)
> -                       if (nvme_delete_queue(dev->queues[i], opcode))
> +                       if (nvme_delete_queue(&dev->queues[i], opcode))
>                                 break;
> 
>                 while (sent--) {
> @@ -2209,7 +2203,7 @@ static void nvme_dev_disable(struct nvme_dev *dev,
> bool shutdown)
> 
>         queues = dev->online_queues - 1;
>         for (i = dev->ctrl.queue_count - 1; i > 0; i--)
> -               nvme_suspend_queue(dev->queues[i]);
> +               nvme_suspend_queue(&dev->queues[i]);
> 
>         if (dead) {
>                 /* A device might become IO incapable very soon during
> @@ -2217,7 +2211,7 @@ static void nvme_dev_disable(struct nvme_dev *dev,
> bool shutdown)
>                  * queue_count can be 0 here.
>                  */
>                 if (dev->ctrl.queue_count)
> -                       nvme_suspend_queue(dev->queues[0]);
> +                       nvme_suspend_queue(&dev->queues[0]);
>         } else {
>                 nvme_disable_io_queues(dev, queues);
>                 nvme_disable_admin_queue(dev, shutdown);
> @@ -2462,6 +2456,7 @@ static int nvme_probe(struct pci_dev *pdev, const
> struct pci_device_id *id)
>         int node, result = -ENOMEM;
>         struct nvme_dev *dev;
>         unsigned long quirks = id->driver_data;
> +       unsigned int alloc_size;
> 
>         node = dev_to_node(&pdev->dev);
>         if (node == NUMA_NO_NODE)
> @@ -2470,8 +2465,9 @@ static int nvme_probe(struct pci_dev *pdev, const
> struct pci_device_id *id)
>         dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
>         if (!dev)
>                 return -ENOMEM;
> -       dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void
> *),
> -                                                       GFP_KERNEL, node);
> +
> +       alloc_size = (num_possible_cpus() + 1) * sizeof(struct nvme_queue
> *);

The element size should be 'sizeof(struct nvme_queue)'.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 6/6] nvme-pci: remove .init_request callback
@ 2017-12-22  1:34       ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2017-12-22  1:34 UTC (permalink / raw)


Hi Sagi,

On Thu, Dec 21, 2017@10:20:45AM +0200, Sagi Grimberg wrote:
> Ming,
> 
> I'd prefer that we make the pci driver match
> the rest of the drivers in nvme.

OK, this way looks better.

> 
> IMO it would be better to allocate a queues array at probe time
> and simply reuse it at reset sequence.
> 
> Can this (untested) patch also fix the issue your seeing:

Please prepare a formal one(at least tested in normal case), either I
or Zhang Yi may test/verify it.

> --
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f26aaa393016..a8edb29dac16 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -75,7 +75,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool
> shutdown);
>   * Represents an NVM Express device.  Each nvme_dev is a PCI function.
>   */
>  struct nvme_dev {
> -       struct nvme_queue **queues;
> +       struct nvme_queue *queues;
>         struct blk_mq_tag_set tagset;
>         struct blk_mq_tag_set admin_tagset;
>         u32 __iomem *dbs;
> @@ -365,7 +365,7 @@ static int nvme_admin_init_hctx(struct blk_mq_hw_ctx
> *hctx, void *data,
>                                 unsigned int hctx_idx)
>  {
>         struct nvme_dev *dev = data;
> -       struct nvme_queue *nvmeq = dev->queues[0];
> +       struct nvme_queue *nvmeq = &dev->queues[0];
> 
>         WARN_ON(hctx_idx != 0);
>         WARN_ON(dev->admin_tagset.tags[0] != hctx->tags);
> @@ -387,7 +387,7 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx,
> void *data,
>                           unsigned int hctx_idx)
>  {
>         struct nvme_dev *dev = data;
> -       struct nvme_queue *nvmeq = dev->queues[hctx_idx + 1];
> +       struct nvme_queue *nvmeq = &dev->queues[hctx_idx + 1];
> 
>         if (!nvmeq->tags)
>                 nvmeq->tags = &dev->tagset.tags[hctx_idx];
> @@ -403,7 +403,7 @@ static int nvme_init_request(struct blk_mq_tag_set *set,
> struct request *req,
>         struct nvme_dev *dev = set->driver_data;
>         struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>         int queue_idx = (set == &dev->tagset) ? hctx_idx + 1 : 0;
> -       struct nvme_queue *nvmeq = dev->queues[queue_idx];
> +       struct nvme_queue *nvmeq = &dev->queues[queue_idx];
> 
>         BUG_ON(!nvmeq);
>         iod->nvmeq = nvmeq;
> @@ -1046,7 +1046,7 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx,
> unsigned int tag)
>  static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl)
>  {
>         struct nvme_dev *dev = to_nvme_dev(ctrl);
> -       struct nvme_queue *nvmeq = dev->queues[0];
> +       struct nvme_queue *nvmeq = &dev->queues[0];
>         struct nvme_command c;
> 
>         memset(&c, 0, sizeof(c));
> @@ -1290,10 +1290,8 @@ static void nvme_free_queues(struct nvme_dev *dev,
> int lowest)
>         int i;
> 
>         for (i = dev->ctrl.queue_count - 1; i >= lowest; i--) {
> -               struct nvme_queue *nvmeq = dev->queues[i];
>                 dev->ctrl.queue_count--;
> -               dev->queues[i] = NULL;
> -               nvme_free_queue(nvmeq);
> +               nvme_free_queue(&dev->queues[i]);
>         }
>  }
> 
> @@ -1325,7 +1323,7 @@ static int nvme_suspend_queue(struct nvme_queue
> *nvmeq)
> 
>  static void nvme_disable_admin_queue(struct nvme_dev *dev, bool shutdown)
>  {
> -       struct nvme_queue *nvmeq = dev->queues[0];
> +       struct nvme_queue *nvmeq = &dev->queues[0];
> 
>         if (!nvmeq)
>                 return;
> @@ -1387,10 +1385,7 @@ static int nvme_alloc_sq_cmds(struct nvme_dev *dev,
> struct nvme_queue *nvmeq,
>  static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
>                                                         int depth, int node)
>  {
> -       struct nvme_queue *nvmeq = kzalloc_node(sizeof(*nvmeq), GFP_KERNEL,
> -                                                       node);
> -       if (!nvmeq)
> -               return NULL;
> +       struct nvme_queue *nvmeq = &dev->queues[qid];

Maybe you need to zero *nvmeq again since it is done in current code.

> 
>         nvmeq->cqes = dma_zalloc_coherent(dev->dev, CQ_SIZE(depth),
>                                           &nvmeq->cq_dma_addr, GFP_KERNEL);
> @@ -1409,7 +1404,6 @@ static struct nvme_queue *nvme_alloc_queue(struct
> nvme_dev *dev, int qid,
>         nvmeq->q_depth = depth;
>         nvmeq->qid = qid;
>         nvmeq->cq_vector = -1;
> -       dev->queues[qid] = nvmeq;
>         dev->ctrl.queue_count++;
> 
>         return nvmeq;
> @@ -1592,7 +1586,7 @@ static int nvme_pci_configure_admin_queue(struct
> nvme_dev *dev)
>         if (result < 0)
>                 return result;
> 
> -       nvmeq = dev->queues[0];
> +       nvmeq = &dev->queues[0];
>         if (!nvmeq) {
>                 nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH,
>                                         dev_to_node(dev->dev));
> @@ -1638,7 +1632,7 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
> 
>         max = min(dev->max_qid, dev->ctrl.queue_count - 1);
>         for (i = dev->online_queues; i <= max; i++) {
> -               ret = nvme_create_queue(dev->queues[i], i);
> +               ret = nvme_create_queue(&dev->queues[i], i);
>                 if (ret)
>                         break;
>         }
> @@ -1894,7 +1888,7 @@ static int nvme_setup_host_mem(struct nvme_dev *dev)
> 
>  static int nvme_setup_io_queues(struct nvme_dev *dev)
>  {
> -       struct nvme_queue *adminq = dev->queues[0];
> +       struct nvme_queue *adminq = &dev->queues[0];
>         struct pci_dev *pdev = to_pci_dev(dev->dev);
>         int result, nr_io_queues;
>         unsigned long size;
> @@ -2020,7 +2014,7 @@ static void nvme_disable_io_queues(struct nvme_dev
> *dev, int queues)
>   retry:
>                 timeout = ADMIN_TIMEOUT;
>                 for (; i > 0; i--, sent++)
> -                       if (nvme_delete_queue(dev->queues[i], opcode))
> +                       if (nvme_delete_queue(&dev->queues[i], opcode))
>                                 break;
> 
>                 while (sent--) {
> @@ -2209,7 +2203,7 @@ static void nvme_dev_disable(struct nvme_dev *dev,
> bool shutdown)
> 
>         queues = dev->online_queues - 1;
>         for (i = dev->ctrl.queue_count - 1; i > 0; i--)
> -               nvme_suspend_queue(dev->queues[i]);
> +               nvme_suspend_queue(&dev->queues[i]);
> 
>         if (dead) {
>                 /* A device might become IO incapable very soon during
> @@ -2217,7 +2211,7 @@ static void nvme_dev_disable(struct nvme_dev *dev,
> bool shutdown)
>                  * queue_count can be 0 here.
>                  */
>                 if (dev->ctrl.queue_count)
> -                       nvme_suspend_queue(dev->queues[0]);
> +                       nvme_suspend_queue(&dev->queues[0]);
>         } else {
>                 nvme_disable_io_queues(dev, queues);
>                 nvme_disable_admin_queue(dev, shutdown);
> @@ -2462,6 +2456,7 @@ static int nvme_probe(struct pci_dev *pdev, const
> struct pci_device_id *id)
>         int node, result = -ENOMEM;
>         struct nvme_dev *dev;
>         unsigned long quirks = id->driver_data;
> +       unsigned int alloc_size;
> 
>         node = dev_to_node(&pdev->dev);
>         if (node == NUMA_NO_NODE)
> @@ -2470,8 +2465,9 @@ static int nvme_probe(struct pci_dev *pdev, const
> struct pci_device_id *id)
>         dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
>         if (!dev)
>                 return -ENOMEM;
> -       dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void
> *),
> -                                                       GFP_KERNEL, node);
> +
> +       alloc_size = (num_possible_cpus() + 1) * sizeof(struct nvme_queue
> *);

The element size should be 'sizeof(struct nvme_queue)'.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 6/6] nvme-pci: remove .init_request callback
  2017-12-22  1:34       ` Ming Lei
@ 2017-12-24  8:50         ` Sagi Grimberg
  -1 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2017-12-24  8:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block,
	Bart Van Assche, Keith Busch, Yi Zhang, Johannes Thumshirn


> Please prepare a formal one(at least tested in normal case), either I
> or Zhang Yi may test/verify it.

OK.

>> @@ -1387,10 +1385,7 @@ static int nvme_alloc_sq_cmds(struct nvme_dev *dev,
>> struct nvme_queue *nvmeq,
>>   static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
>>                                                          int depth, int node)
>>   {
>> -       struct nvme_queue *nvmeq = kzalloc_node(sizeof(*nvmeq), GFP_KERNEL,
>> -                                                       node);
>> -       if (!nvmeq)
>> -               return NULL;
>> +       struct nvme_queue *nvmeq = &dev->queues[qid];
> 
> Maybe you need to zero *nvmeq again since it is done in current code.

Relying on this is not a good idea, so I think we better off without
it because I want to know about it and fix it.

>> @@ -2470,8 +2465,9 @@ static int nvme_probe(struct pci_dev *pdev, const
>> struct pci_device_id *id)
>>          dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
>>          if (!dev)
>>                  return -ENOMEM;
>> -       dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void
>> *),
>> -                                                       GFP_KERNEL, node);
>> +
>> +       alloc_size = (num_possible_cpus() + 1) * sizeof(struct nvme_queue
>> *);
> 
> The element size should be 'sizeof(struct nvme_queue)'.

Right.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 6/6] nvme-pci: remove .init_request callback
@ 2017-12-24  8:50         ` Sagi Grimberg
  0 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2017-12-24  8:50 UTC (permalink / raw)



> Please prepare a formal one(at least tested in normal case), either I
> or Zhang Yi may test/verify it.

OK.

>> @@ -1387,10 +1385,7 @@ static int nvme_alloc_sq_cmds(struct nvme_dev *dev,
>> struct nvme_queue *nvmeq,
>>   static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
>>                                                          int depth, int node)
>>   {
>> -       struct nvme_queue *nvmeq = kzalloc_node(sizeof(*nvmeq), GFP_KERNEL,
>> -                                                       node);
>> -       if (!nvmeq)
>> -               return NULL;
>> +       struct nvme_queue *nvmeq = &dev->queues[qid];
> 
> Maybe you need to zero *nvmeq again since it is done in current code.

Relying on this is not a good idea, so I think we better off without
it because I want to know about it and fix it.

>> @@ -2470,8 +2465,9 @@ static int nvme_probe(struct pci_dev *pdev, const
>> struct pci_device_id *id)
>>          dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
>>          if (!dev)
>>                  return -ENOMEM;
>> -       dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void
>> *),
>> -                                                       GFP_KERNEL, node);
>> +
>> +       alloc_size = (num_possible_cpus() + 1) * sizeof(struct nvme_queue
>> *);
> 
> The element size should be 'sizeof(struct nvme_queue)'.

Right.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 1/6] blk-mq: quiesce queue before freeing queue
  2017-12-14  2:30   ` Ming Lei
@ 2017-12-29  9:54     ` Christoph Hellwig
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block,
	Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn, stable

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 1/6] blk-mq: quiesce queue before freeing queue
@ 2017-12-29  9:54     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:54 UTC (permalink / raw)


Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 2/6] blk-mq: support concurrent blk_mq_quiesce_queue()
  2017-12-14  2:30   ` Ming Lei
@ 2017-12-29  9:58     ` Christoph Hellwig
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block,
	Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn

The quiesce count looks good to me.  But I'm really worried about
the force unquiesce in nvme.  I'd feel much better if we could find
a way to move this into the core.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 2/6] blk-mq: support concurrent blk_mq_quiesce_queue()
@ 2017-12-29  9:58     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:58 UTC (permalink / raw)


The quiesce count looks good to me.  But I'm really worried about
the force unquiesce in nvme.  I'd feel much better if we could find
a way to move this into the core.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 3/6] blk-mq: quiesce queue during switching io sched and updating nr_requests
  2017-12-14  2:31   ` Ming Lei
@ 2017-12-29  9:58     ` Christoph Hellwig
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block,
	Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 3/6] blk-mq: quiesce queue during switching io sched and updating nr_requests
@ 2017-12-29  9:58     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:58 UTC (permalink / raw)


Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 4/6] blk-mq: avoid to map CPU into stale hw queue
  2017-12-14  2:31   ` Ming Lei
@ 2017-12-29  9:59     ` Christoph Hellwig
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block,
	Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 4/6] blk-mq: avoid to map CPU into stale hw queue
@ 2017-12-29  9:59     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:59 UTC (permalink / raw)


Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 5/6] blk-mq: fix race between updating nr_hw_queues and switching io sched
  2017-12-14  2:31   ` Ming Lei
@ 2017-12-29  9:59     ` Christoph Hellwig
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, Christoph Hellwig, Jens Axboe, linux-block,
	Bart Van Assche, Keith Busch, Sagi Grimberg, Yi Zhang,
	Johannes Thumshirn

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 5/6] blk-mq: fix race between updating nr_hw_queues and switching io sched
@ 2017-12-29  9:59     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:59 UTC (permalink / raw)


Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 2/6] blk-mq: support concurrent blk_mq_quiesce_queue()
  2017-12-29  9:58     ` Christoph Hellwig
@ 2018-01-02  3:01       ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-01-02  3:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, Jens Axboe, linux-block, Bart Van Assche,
	Keith Busch, Sagi Grimberg, Yi Zhang, Johannes Thumshirn

On Fri, Dec 29, 2017 at 10:58:33AM +0100, Christoph Hellwig wrote:
> The quiesce count looks good to me.  But I'm really worried about
> the force unquiesce in nvme.  I'd feel much better if we could find

This patch doesn't change the behaviour wrt. force unquiesce in NVMe,
which has been done since commit 443bd90f2cca(nvme: host: unquiesce queue
in nvme_kill_queues()).

The reason is that NVMe's usage of blk_mq_quiesce_queue() is a bit complicated,
for example:

- nvme_dev_disable(false) may be called two times in reset_work()
- queue may never be started in nvme_fw_act_work()
...

But blk_cleanup_queue() can be triggered anytime.

> a way to move this into the core.

We can't move the force unquiesce into blk_set_queue_dying() or other
places in core before calling blk_cleanup_queue(), because that may
break SCSI which uses quiesce to implement device block.

thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 2/6] blk-mq: support concurrent blk_mq_quiesce_queue()
@ 2018-01-02  3:01       ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-01-02  3:01 UTC (permalink / raw)


On Fri, Dec 29, 2017@10:58:33AM +0100, Christoph Hellwig wrote:
> The quiesce count looks good to me.  But I'm really worried about
> the force unquiesce in nvme.  I'd feel much better if we could find

This patch doesn't change the behaviour wrt. force unquiesce in NVMe,
which has been done since commit 443bd90f2cca(nvme: host: unquiesce queue
in nvme_kill_queues()).

The reason is that NVMe's usage of blk_mq_quiesce_queue() is a bit complicated,
for example:

- nvme_dev_disable(false) may be called two times in reset_work()
- queue may never be started in nvme_fw_act_work()
...

But blk_cleanup_queue() can be triggered anytime.

> a way to move this into the core.

We can't move the force unquiesce into blk_set_queue_dying() or other
places in core before calling blk_cleanup_queue(), because that may
break SCSI which uses quiesce to implement device block.

thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2018-01-02  3:02 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-14  2:30 [PATCH V2 0/6] blk-mq: fix race related with device deletion/reset/switching sched Ming Lei
2017-12-14  2:30 ` Ming Lei
2017-12-14  2:30 ` [PATCH V2 1/6] blk-mq: quiesce queue before freeing queue Ming Lei
2017-12-14  2:30   ` Ming Lei
2017-12-29  9:54   ` Christoph Hellwig
2017-12-29  9:54     ` Christoph Hellwig
2017-12-14  2:30 ` [PATCH V2 2/6] blk-mq: support concurrent blk_mq_quiesce_queue() Ming Lei
2017-12-14  2:30   ` Ming Lei
2017-12-29  9:58   ` Christoph Hellwig
2017-12-29  9:58     ` Christoph Hellwig
2018-01-02  3:01     ` Ming Lei
2018-01-02  3:01       ` Ming Lei
2017-12-14  2:31 ` [PATCH V2 3/6] blk-mq: quiesce queue during switching io sched and updating nr_requests Ming Lei
2017-12-14  2:31   ` Ming Lei
2017-12-29  9:58   ` Christoph Hellwig
2017-12-29  9:58     ` Christoph Hellwig
2017-12-14  2:31 ` [PATCH V2 4/6] blk-mq: avoid to map CPU into stale hw queue Ming Lei
2017-12-14  2:31   ` Ming Lei
2017-12-29  9:59   ` Christoph Hellwig
2017-12-29  9:59     ` Christoph Hellwig
2017-12-14  2:31 ` [PATCH V2 5/6] blk-mq: fix race between updating nr_hw_queues and switching io sched Ming Lei
2017-12-14  2:31   ` Ming Lei
2017-12-29  9:59   ` Christoph Hellwig
2017-12-29  9:59     ` Christoph Hellwig
2017-12-14  2:31 ` [PATCH V2 6/6] nvme-pci: remove .init_request callback Ming Lei
2017-12-14  2:31   ` Ming Lei
2017-12-21  8:20   ` Sagi Grimberg
2017-12-21  8:20     ` Sagi Grimberg
2017-12-21  8:36     ` Johannes Thumshirn
2017-12-21  8:36       ` Johannes Thumshirn
2017-12-22  1:34     ` Ming Lei
2017-12-22  1:34       ` Ming Lei
2017-12-24  8:50       ` Sagi Grimberg
2017-12-24  8:50         ` Sagi Grimberg
2017-12-19 15:30 ` [PATCH V2 0/6] blk-mq: fix race related with device deletion/reset/switching sched Yi Zhang
2017-12-19 15:30   ` Yi Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.