All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] nvme: flush rescan worker before resetting
@ 2019-06-18 10:10 Hannes Reinecke
  2019-06-18 10:10 ` [PATCH 1/2] nvme: Do not remove namespaces during reset Hannes Reinecke
  2019-06-18 10:10 ` [PATCH 2/2] nvme: flush scan_work when resetting controller Hannes Reinecke
  0 siblings, 2 replies; 28+ messages in thread
From: Hannes Reinecke @ 2019-06-18 10:10 UTC (permalink / raw)


Hi all,

There is a race condition between namespace rescanning and controller
reset; during controller reset all namespaces are quiesed vie
nams_stop_ctrl(), and after reset all namespaces are unquiesced
again.
When namespace scanning was active by the time controller reset was
triggered the rescan code will call nvme_ns_remove(), which then will
cause a kernel crash in nvme_start_ctrl() as it'll trip over
uninitialized namespaces.

To fix this two patches are required; the first one will skip the call
to nvme_ns_remove() if the controller is not live, and the second one
will flush the rescan worker before resetting to avoid this situation
from occurring.

As usual, comments and reviews are welcome.

Hannes Reinecke (2):
  nvme: Do not remove namespaces during reset
  nvme: flush scan_work when resetting controller

 drivers/nvme/host/core.c | 21 +++++++++++++++++++++
 drivers/nvme/host/fc.c   |  1 +
 drivers/nvme/host/rdma.c |  1 +
 drivers/nvme/host/tcp.c  |  1 +
 4 files changed, 24 insertions(+)

-- 
2.16.4

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/2] nvme: Do not remove namespaces during reset
  2019-06-18 10:10 [PATCH 0/2] nvme: flush rescan worker before resetting Hannes Reinecke
@ 2019-06-18 10:10 ` Hannes Reinecke
  2019-06-18 17:30   ` Sagi Grimberg
  2019-06-20  1:22   ` Ming Lei
  2019-06-18 10:10 ` [PATCH 2/2] nvme: flush scan_work when resetting controller Hannes Reinecke
  1 sibling, 2 replies; 28+ messages in thread
From: Hannes Reinecke @ 2019-06-18 10:10 UTC (permalink / raw)


When a controller is resetting or reconnecting there is no way
how we could establish the validity of any given namespace.
So do not call nvme_ns_remove() during resetting or reconnecting
and rely on the call to nvme_scan_queue() after reset to fixup
things.

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/host/core.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ba2079d217da..e872591e5fe7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3358,6 +3358,17 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 static void nvme_ns_remove(struct nvme_ns *ns)
 {
+	/*
+	 * We cannot make any assumptions about namespaces during
+	 * reset; in particular we shouldn't attempt to remove them
+	 * as I/O might still be queued to them.
+	 * So ignore this call during reset and rely on the
+	 * rescan after reset to clean up things again.
+	 */
+	if (ns->ctrl->state == NVME_CTRL_RESETTING ||
+	    ns->ctrl->state == NVME_CTRL_CONNECTING)
+		return;
+
 	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
 		return;
 
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-18 10:10 [PATCH 0/2] nvme: flush rescan worker before resetting Hannes Reinecke
  2019-06-18 10:10 ` [PATCH 1/2] nvme: Do not remove namespaces during reset Hannes Reinecke
@ 2019-06-18 10:10 ` Hannes Reinecke
  2019-06-18 17:41   ` Sagi Grimberg
  2019-06-20  1:36   ` Ming Lei
  1 sibling, 2 replies; 28+ messages in thread
From: Hannes Reinecke @ 2019-06-18 10:10 UTC (permalink / raw)


When resetting the controller there is no point whatsoever to
have a scan run in parallel; we cannot access the controller and
we cannot tell which devices are present and which not.
Additionally we'll run a scan after reset anyway.
So flush existing scans before reconnecting, ensuring to
short-circuit the scan workqueue function if the controller state
isn't live to avoid lockups.

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/host/core.c | 10 ++++++++++
 drivers/nvme/host/fc.c   |  1 +
 drivers/nvme/host/rdma.c |  1 +
 drivers/nvme/host/tcp.c  |  1 +
 4 files changed, 13 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e872591e5fe7..dc3f1ff12276 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1627,6 +1627,9 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	sector_t capacity = le64_to_cpu(id->nsze) << (ns->lba_shift - 9);
 	unsigned short bs = 1 << ns->lba_shift;
 
+	if (ns->ctrl->state != NVME_CTRL_LIVE)
+		return;
+
 	if (ns->lba_shift > PAGE_SHIFT) {
 		/* unsupported block size, set capacity to 0 later */
 		bs = (1 << 9);
@@ -1702,6 +1705,9 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		return -ENODEV;
 	}
 
+	if (ctrl->state != NVME_CTRL_LIVE)
+		return 0;
+
 	id = nvme_identify_ns(ctrl, ns->head->ns_id);
 	if (!id)
 		return -ENODEV;
@@ -3437,6 +3443,8 @@ static int nvme_scan_ns_list(struct nvme_ctrl *ctrl, unsigned nn)
 		return -ENOMEM;
 
 	for (i = 0; i < num_lists; i++) {
+		if (ctrl->state != NVME_CTRL_LIVE)
+			goto free;
 		ret = nvme_identify_ns_list(ctrl, prev, ns_list);
 		if (ret)
 			goto free;
@@ -3515,6 +3523,8 @@ static void nvme_scan_work(struct work_struct *work)
 	if (test_and_clear_bit(NVME_AER_NOTICE_NS_CHANGED, &ctrl->events)) {
 		dev_info(ctrl->device, "rescanning namespaces.\n");
 		nvme_clear_changed_ns_log(ctrl);
+		if (ctrl->state != NVME_CTRL_LIVE)
+			return;
 	}
 
 	if (nvme_identify_ctrl(ctrl, &id))
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 9b497d785ed7..1c6aa21858b2 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2909,6 +2909,7 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
 
 	__nvme_fc_terminate_io(ctrl);
 
+	flush_work(&ctrl->ctrl.scan_work);
 	nvme_stop_ctrl(&ctrl->ctrl);
 
 	if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index f383146e7d0f..c00d59a14a2b 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1839,6 +1839,7 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
 
 	nvme_stop_ctrl(&ctrl->ctrl);
 	nvme_rdma_shutdown_ctrl(ctrl, false);
+	flush_work(&ctrl->ctrl.scan_work);
 
 	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
 		/* state change failure should never happen */
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 2b107a1d152b..ee32d9355959 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1872,6 +1872,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
 
 	nvme_stop_ctrl(ctrl);
 	nvme_tcp_teardown_ctrl(ctrl, false);
+	flush_work(&ctrl->scan_work);
 
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
 		/* state change failure is ok if we're in DELETING state */
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 1/2] nvme: Do not remove namespaces during reset
  2019-06-18 10:10 ` [PATCH 1/2] nvme: Do not remove namespaces during reset Hannes Reinecke
@ 2019-06-18 17:30   ` Sagi Grimberg
  2019-06-20  1:22   ` Ming Lei
  1 sibling, 0 replies; 28+ messages in thread
From: Sagi Grimberg @ 2019-06-18 17:30 UTC (permalink / raw)



> When a controller is resetting or reconnecting there is no way
> how we could establish the validity of any given namespace.
> So do not call nvme_ns_remove() during resetting or reconnecting
> and rely on the call to nvme_scan_queue() after reset to fixup
> things.
> 
> Signed-off-by: Hannes Reinecke <hare at suse.com>
> ---
>   drivers/nvme/host/core.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index ba2079d217da..e872591e5fe7 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3358,6 +3358,17 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>   
>   static void nvme_ns_remove(struct nvme_ns *ns)
>   {
> +	/*
> +	 * We cannot make any assumptions about namespaces during
> +	 * reset; in particular we shouldn't attempt to remove them
> +	 * as I/O might still be queued to them.
> +	 * So ignore this call during reset and rely on the
> +	 * rescan after reset to clean up things again.
> +	 */
> +	if (ns->ctrl->state == NVME_CTRL_RESETTING ||
> +	    ns->ctrl->state == NVME_CTRL_CONNECTING)
> +		return;
> +

This is conflicting with the commit log, this condition needs to
go to the call site.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-18 10:10 ` [PATCH 2/2] nvme: flush scan_work when resetting controller Hannes Reinecke
@ 2019-06-18 17:41   ` Sagi Grimberg
  2019-06-19  6:22     ` Hannes Reinecke
  2019-06-20  1:36   ` Ming Lei
  1 sibling, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2019-06-18 17:41 UTC (permalink / raw)



> When resetting the controller there is no point whatsoever to
> have a scan run in parallel;

There is also no point of trying to prevent it.

> we cannot access the controller and
> we cannot tell which devices are present and which not.
> Additionally we'll run a scan after reset anyway.
> So flush existing scans before reconnecting, ensuring to
> short-circuit the scan workqueue function if the controller state
> isn't live to avoid lockups.

What is this fixing? can we get a detailed description?

These ctrl->state conditions sprayed around that are not barriered
by anything do not prevent any scan to run in parallel with resets.

Also, do note that resets will unquiesce the admin queue right away
and not when the reset completes (rdma, tcp, fc). The only exception
is pci that will unquiesce as soon as the reset fails.

So can you please clarify the problem?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-18 17:41   ` Sagi Grimberg
@ 2019-06-19  6:22     ` Hannes Reinecke
  2019-06-19 16:56       ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2019-06-19  6:22 UTC (permalink / raw)


On 6/18/19 7:41 PM, Sagi Grimberg wrote:
> 
>> When resetting the controller there is no point whatsoever to
>> have a scan run in parallel;
> 
> There is also no point of trying to prevent it.
> 
Oh, I don't mind having a scan running in parallel once reset is
underway; that will be properly short-circuited anyway.
What I object to is having scanning and resetting _starting_ at the same
time, as then we incur all sorts of race conditions.

Like this one :-)

>> we cannot access the controller and
>> we cannot tell which devices are present and which not.
>> Additionally we'll run a scan after reset anyway.
>> So flush existing scans before reconnecting, ensuring to
>> short-circuit the scan workqueue function if the controller state
>> isn't live to avoid lockups.
> 
> What is this fixing? can we get a detailed description?
> 
> These ctrl->state conditions sprayed around that are not barriered
> by anything do not prevent any scan to run in parallel with resets.
> 
The 'ctrl->state' conditions are required to avoid I/O to be stalled
during reset. Problem here is that I/O will be held during reset, and
resubmitted once reset is finished.

The transport drivers work around this problem to terminate any I/O
prior to resetting the controller, but this will only cover I/O which
had been pending _before_ reset was called.
Any I/O being started after that will still be held, and we would
live-lock trying to flush the scan thread.
So essentially we guard any I/O with the state check, and short circuit
it if we find ourselves in reset.
There is no need to protect the state check, as we will be aborting I/O
when a state change to 'RESETTING' happens, and the update should be
visible to all threads after that.

If you want I can separate this into two patches, one with flushing the
scan function and one with the state check guards, and detail out the
reasoning here.

> Also, do note that resets will unquiesce the admin queue right away
> and not when the reset completes (rdma, tcp, fc). The only exception
> is pci that will unquiesce as soon as the reset fails.
> 
> So can you please clarify the problem?
> 
Sure. During failover testing NetApp had been seeing several crashes like:

[67088.344034] WARNING: CPU: 4 PID: 25020 at
../lib/percpu-refcount.c:334 percpu_ref_kill_and_confirm+0x7a/0xa0
[67088.344035] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5
auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet
iscsi_ibft iscsi_boot_sysfs xfs libcrc32c mgag200 ttm drm_kms_helper drm
drm_panel_orientation_quirks iTCO_wdt syscopyarea iTCO_vendor_support
sysfillrect igb sysimgblt fb_sys_fops be2net ptp pps_core i2c_algo_bit
dca intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp ipmi_ssif
coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel pcbc
aesni_intel aes_x86_64 crypto_simd glue_helper lpc_ich ipmi_si cryptd
button mfd_core i2c_i801 pcspkr ipmi_devintf ipmi_msghandler ac shpchp
btrfs xor raid6_pq sd_mod lpfc isci(X) ahci libsas nvmet_fc libahci
nvmet ehci_pci scsi_transport_sas ehci_hcd crc32c_intel configfs libata
usbcore nvme_fc nvme_fabrics nvme_core
[67088.344079]  scsi_transport_fc sg scsi_mod autofs4
[67088.344082] Supported: No, Unreleased kernel
[67088.344085] CPU: 4 PID: 25020 Comm: kworker/u49:35 Tainted: G
           4.12.14-2.g1a155c2-default #1 SLE15 (unreleased)
[67088.344085] Hardware name: FUJITSU PRIMERGY RX200 S7/D3032-A1, BIOS
V4.6.5.1 R1.8.0 for D3032-A1x 05/25/2012
[67088.344092] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[67088.344094] task: ffff90a5dfa6c180 task.stack: ffff9e138d3d8000
[67088.344096] RIP: 0010:percpu_ref_kill_and_confirm+0x7a/0xa0
[67088.344097] RSP: 0018:ffff9e138d3dbc00 EFLAGS: 00010086
[67088.344099] RAX: 0000000000000055 RBX: ffff90a60d1f3ac0 RCX:
ffffffff9c066e08
[67088.344100] RDX: 0000000000000001 RSI: 0000000000000082 RDI:
0000000000000093
[67088.344100] RBP: 0000000000000246 R08: 0000000000035601 R09:
ffffffff9c83a8e0
[67088.344101] R10: ffff9e138633be98 R11: 0000000000000000 R12:
0000000000000000
[67088.344102] R13: ffff90a61e813100 R14: 0000000000000009 R15:
0000000000000200
[67088.344103] FS:  0000000000000000(0000) GS:ffff90a61fb00000(0000)
knlGS:0000000000000000
[67088.344104] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[67088.344105] CR2: 00007fc607bb2860 CR3: 000000041700a004 CR4:
00000000000606e0
[67088.344106] Call Trace:
[67088.344112]  blk_freeze_queue_start+0x2a/0x40
[67088.344114]  blk_freeze_queue+0xe/0x40
[67088.344118]  nvme_update_disk_info+0x36/0x260 [nvme_core]
[67088.344122]  __nvme_revalidate_disk+0xca/0xf0 [nvme_core]
[67088.344125]  nvme_revalidate_disk+0xa6/0x120 [nvme_core]
[67088.344127]  ? blk_mq_get_tag+0xa3/0x220
[67088.344130]  revalidate_disk+0x23/0xc0
[67088.344133]  nvme_validate_ns+0x43/0x830 [nvme_core]
[67088.344137]  ? wake_up_q+0x70/0x70
[67088.344139]  ? blk_mq_free_request+0x12a/0x160
[67088.344142]  ? __nvme_submit_sync_cmd+0x73/0xe0 [nvme_core]
[67088.344145]  nvme_scan_work+0x2b3/0x350 [nvme_core]
[67088.344149]  process_one_work+0x1da/0x400
[67088.344150]  worker_thread+0x2b/0x3f0
[67088.344152]  ? process_one_work+0x400/0x400

The underlying reason is an imbalance in nvme namespace refcouting:
nvme_ns_remove() will teardown the queue first, before removing it from
the list of namespaces:

static void nvme_ns_remove(struct nvme_ns *ns)
{
	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
		return;

	nvme_fault_inject_fini(ns);
	if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
		del_gendisk(ns->disk);
		blk_cleanup_queue(ns->queue);
		if (blk_get_integrity(ns->disk))
			blk_integrity_unregister(ns->disk);
	}

	mutex_lock(&ns->ctrl->subsys->lock);
	list_del_rcu(&ns->siblings);
	nvme_mpath_clear_current_path(ns);
	mutex_unlock(&ns->ctrl->subsys->lock);

	down_write(&ns->ctrl->namespaces_rwsem);
	list_del_init(&ns->list);
	up_write(&ns->ctrl->namespaces_rwsem);


	synchronize_srcu(&ns->head->srcu);
	nvme_mpath_check_last_path(ns);
	nvme_put_ns(ns);
}

when scanning is active during reset all namespaces are found to be
unresponsive, and will be consequently removed:

static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
{
	struct nvme_ns *ns;

	ns = nvme_find_get_ns(ctrl, nsid);
	if (ns) {
		if (ns->disk && revalidate_disk(ns->disk))
			nvme_ns_remove(ns);
		nvme_put_ns(ns);
	} else
		nvme_alloc_ns(ctrl, nsid);
}

However, as the controller is resetting it will call
nvme_stop_queues() at the start of the reset, and nvme_start_queues()
once reset is complete.

Both iterate the namespace list like

void nvme_stop_queues(struct nvme_ctrl *ctrl)
{
	struct nvme_ns *ns;

	down_read(&ctrl->namespaces_rwsem);
	list_for_each_entry(ns, &ctrl->namespaces, list)
		blk_mq_quiesce_queue(ns->queue);
	up_read(&ctrl->namespaces_rwsem);
}

but this will only synchronize with the very last step of nvme_ns_remove().
At that time blk_cleanup_queue() already had been called, and
consequently we crash in 'blk_mq_quiesce_queue()' (or
'blk_mq_unquiesce_queue()', depending on the timing).

This imbalance in nvme_ns_remove() is one issue which Ming Lei tried to
fix up with his patchset 'blk-mq: fix races related with freezing queue'.
However, he dropped the final patch
"nvme: hold request queue's refcount in ns's whole lifetime"
on request from hch, as he felt that blk_cleanup_queue() was unsafe in
general, and should need to include the final put for the request queue
itself.
Sadly, this was precisely the patch for fixing this issue :-(

So instead of waiting for the big rewrite to happen I tried to approach
things from the other end, namely stopping the race to happen in the
first place.

And with these two patches the crashes are gone, so I thought to share
it here if there were interest.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare at suse.com			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-19  6:22     ` Hannes Reinecke
@ 2019-06-19 16:56       ` Sagi Grimberg
  2019-06-19 18:45         ` Hannes Reinecke
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2019-06-19 16:56 UTC (permalink / raw)



>>> When resetting the controller there is no point whatsoever to
>>> have a scan run in parallel;
>>
>> There is also no point of trying to prevent it.
>>
> Oh, I don't mind having a scan running in parallel once reset is
> underway; that will be properly short-circuited anyway.
> What I object to is having scanning and resetting _starting_ at the same
> time, as then we incur all sorts of race conditions.

But your patch does not inherently protect against it, these state
conditions sprinkled can easily race the reset.

Note that I do agree with patch #1, but I don't understand how patch
#2 is fixing the problem other than narrowing the window at best.

> 
> Like this one :-)
> 
>>> we cannot access the controller and
>>> we cannot tell which devices are present and which not.
>>> Additionally we'll run a scan after reset anyway.
>>> So flush existing scans before reconnecting, ensuring to
>>> short-circuit the scan workqueue function if the controller state
>>> isn't live to avoid lockups.
>>
>> What is this fixing? can we get a detailed description?
>>
>> These ctrl->state conditions sprayed around that are not barriered
>> by anything do not prevent any scan to run in parallel with resets.
>>
> The 'ctrl->state' conditions are required to avoid I/O to be stalled
> during reset.

Not sure what you mean by stall, are you referring to the case that
it takes a long time to reconnect?

The solution is to make sure we drain the inflight I/O, not half-way
trying to prevent them from happening.

> Problem here is that I/O will be held during reset, and
> resubmitted once reset is finished.
> 
> The transport drivers work around this problem to terminate any I/O
> prior to resetting the controller,

Its not a work around.

> but this will only cover I/O which
> had been pending _before_ reset was called.
> Any I/O being started after that will still be held, and we would
> live-lock trying to flush the scan thread.

But resets do not flush the scan thread (before you restored it).

> So essentially we guard any I/O with the state check, and short circuit
> it if we find ourselves in reset.

As I mentioned, patch #1 is fine. Please explain what patch #2 is
exactly fixing.

> There is no need to protect the state check, as we will be aborting I/O
> when a state change to 'RESETTING' happens, and the update should be
> visible to all threads after that.

I think this assumption is incorrect. Nothing guarantees when/who will
have visibility to the state change.

> If you want I can separate this into two patches, one with flushing the
> scan function and one with the state check guards, and detail out the
> reasoning here.

No need to split, if its needed, it needs to be in a single patch.
But again, I want to understand after patch #1 is applied, what are the
symptoms and why patch #2 is solving them.

>> Also, do note that resets will unquiesce the admin queue right away
>> and not when the reset completes (rdma, tcp, fc). The only exception
>> is pci that will unquiesce as soon as the reset fails.
>>
>> So can you please clarify the problem?
>>
> Sure. During failover testing NetApp had been seeing several crashes like:
> 
> [67088.344034] WARNING: CPU: 4 PID: 25020 at
> ../lib/percpu-refcount.c:334 percpu_ref_kill_and_confirm+0x7a/0xa0
> [67088.344035] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5
> auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet
> iscsi_ibft iscsi_boot_sysfs xfs libcrc32c mgag200 ttm drm_kms_helper drm
> drm_panel_orientation_quirks iTCO_wdt syscopyarea iTCO_vendor_support
> sysfillrect igb sysimgblt fb_sys_fops be2net ptp pps_core i2c_algo_bit
> dca intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp ipmi_ssif
> coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel pcbc
> aesni_intel aes_x86_64 crypto_simd glue_helper lpc_ich ipmi_si cryptd
> button mfd_core i2c_i801 pcspkr ipmi_devintf ipmi_msghandler ac shpchp
> btrfs xor raid6_pq sd_mod lpfc isci(X) ahci libsas nvmet_fc libahci
> nvmet ehci_pci scsi_transport_sas ehci_hcd crc32c_intel configfs libata
> usbcore nvme_fc nvme_fabrics nvme_core
> [67088.344079]  scsi_transport_fc sg scsi_mod autofs4
> [67088.344082] Supported: No, Unreleased kernel
> [67088.344085] CPU: 4 PID: 25020 Comm: kworker/u49:35 Tainted: G
>             4.12.14-2.g1a155c2-default #1 SLE15 (unreleased)
> [67088.344085] Hardware name: FUJITSU PRIMERGY RX200 S7/D3032-A1, BIOS
> V4.6.5.1 R1.8.0 for D3032-A1x 05/25/2012
> [67088.344092] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> [67088.344094] task: ffff90a5dfa6c180 task.stack: ffff9e138d3d8000
> [67088.344096] RIP: 0010:percpu_ref_kill_and_confirm+0x7a/0xa0
> [67088.344097] RSP: 0018:ffff9e138d3dbc00 EFLAGS: 00010086
> [67088.344099] RAX: 0000000000000055 RBX: ffff90a60d1f3ac0 RCX:
> ffffffff9c066e08
> [67088.344100] RDX: 0000000000000001 RSI: 0000000000000082 RDI:
> 0000000000000093
> [67088.344100] RBP: 0000000000000246 R08: 0000000000035601 R09:
> ffffffff9c83a8e0
> [67088.344101] R10: ffff9e138633be98 R11: 0000000000000000 R12:
> 0000000000000000
> [67088.344102] R13: ffff90a61e813100 R14: 0000000000000009 R15:
> 0000000000000200
> [67088.344103] FS:  0000000000000000(0000) GS:ffff90a61fb00000(0000)
> knlGS:0000000000000000
> [67088.344104] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [67088.344105] CR2: 00007fc607bb2860 CR3: 000000041700a004 CR4:
> 00000000000606e0
> [67088.344106] Call Trace:
> [67088.344112]  blk_freeze_queue_start+0x2a/0x40
> [67088.344114]  blk_freeze_queue+0xe/0x40
> [67088.344118]  nvme_update_disk_info+0x36/0x260 [nvme_core]
> [67088.344122]  __nvme_revalidate_disk+0xca/0xf0 [nvme_core]
> [67088.344125]  nvme_revalidate_disk+0xa6/0x120 [nvme_core]
> [67088.344127]  ? blk_mq_get_tag+0xa3/0x220
> [67088.344130]  revalidate_disk+0x23/0xc0
> [67088.344133]  nvme_validate_ns+0x43/0x830 [nvme_core]
> [67088.344137]  ? wake_up_q+0x70/0x70
> [67088.344139]  ? blk_mq_free_request+0x12a/0x160
> [67088.344142]  ? __nvme_submit_sync_cmd+0x73/0xe0 [nvme_core]
> [67088.344145]  nvme_scan_work+0x2b3/0x350 [nvme_core]
> [67088.344149]  process_one_work+0x1da/0x400
> [67088.344150]  worker_thread+0x2b/0x3f0
> [67088.344152]  ? process_one_work+0x400/0x400
> 
> The underlying reason is an imbalance in nvme namespace refcouting:
> nvme_ns_remove() will teardown the queue first, before removing it from
> the list of namespaces:
> 
> static void nvme_ns_remove(struct nvme_ns *ns)
> {
> 	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
> 		return;
> 
> 	nvme_fault_inject_fini(ns);
> 	if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
> 		del_gendisk(ns->disk);
> 		blk_cleanup_queue(ns->queue);
> 		if (blk_get_integrity(ns->disk))
> 			blk_integrity_unregister(ns->disk);
> 	}
> 
> 	mutex_lock(&ns->ctrl->subsys->lock);
> 	list_del_rcu(&ns->siblings);
> 	nvme_mpath_clear_current_path(ns);
> 	mutex_unlock(&ns->ctrl->subsys->lock);
> 
> 	down_write(&ns->ctrl->namespaces_rwsem);
> 	list_del_init(&ns->list);
> 	up_write(&ns->ctrl->namespaces_rwsem);
> 
> 
> 	synchronize_srcu(&ns->head->srcu);
> 	nvme_mpath_check_last_path(ns);
> 	nvme_put_ns(ns);
> }
> 
> when scanning is active during reset all namespaces are found to be
> unresponsive, and will be consequently removed:
> 
> static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> {
> 	struct nvme_ns *ns;
> 
> 	ns = nvme_find_get_ns(ctrl, nsid);
> 	if (ns) {
> 		if (ns->disk && revalidate_disk(ns->disk))
> 			nvme_ns_remove(ns);
> 		nvme_put_ns(ns);
> 	} else
> 		nvme_alloc_ns(ctrl, nsid);
> }
> 

Which is what patch #1 is for right?

> However, as the controller is resetting it will call
> nvme_stop_queues() at the start of the reset, and nvme_start_queues()
> once reset is complete.
> 
> Both iterate the namespace list like
> 
> void nvme_stop_queues(struct nvme_ctrl *ctrl)
> {
> 	struct nvme_ns *ns;
> 
> 	down_read(&ctrl->namespaces_rwsem);
> 	list_for_each_entry(ns, &ctrl->namespaces, list)
> 		blk_mq_quiesce_queue(ns->queue);
> 	up_read(&ctrl->namespaces_rwsem);
> }
> 
> but this will only synchronize with the very last step of nvme_ns_remove().
> At that time blk_cleanup_queue() already had been called, and
> consequently we crash in 'blk_mq_quiesce_queue()' (or
> 'blk_mq_unquiesce_queue()', depending on the timing).

I sent a patch that moves the rcu sync.
Please see:
[PATCH] nvme: fix possible io failures when removing multipathed ns

We can have an incremental patch that moves the ns removal from
ctrl->namespaces also before.

> 
> This imbalance in nvme_ns_remove() is one issue which Ming Lei tried to
> fix up with his patchset 'blk-mq: fix races related with freezing queue'.
> However, he dropped the final patch
> "nvme: hold request queue's refcount in ns's whole lifetime"
> on request from hch, as he felt that blk_cleanup_queue() was unsafe in
> general, and should need to include the final put for the request queue
> itself.
> Sadly, this was precisely the patch for fixing this issue :-(
> 
> So instead of waiting for the big rewrite to happen I tried to approach
> things from the other end, namely stopping the race to happen in the
> first place.
> 
> And with these two patches the crashes are gone, so I thought to share
> it here if there were interest.

There is interest in fixing the crash, no doubt. But I just want to
understand if this is the correct fix.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-19 16:56       ` Sagi Grimberg
@ 2019-06-19 18:45         ` Hannes Reinecke
  2019-06-19 20:04           ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2019-06-19 18:45 UTC (permalink / raw)


On 6/19/19 6:56 PM, Sagi Grimberg wrote:
> 
>>>> When resetting the controller there is no point whatsoever to
>>>> have a scan run in parallel;
>>>
>>> There is also no point of trying to prevent it.
>>>
>> Oh, I don't mind having a scan running in parallel once reset is
>> underway; that will be properly short-circuited anyway.
>> What I object to is having scanning and resetting _starting_ at the same
>> time, as then we incur all sorts of race conditions.
> 
> But your patch does not inherently protect against it, these state
> conditions sprinkled can easily race the reset.
> 
The reasoning is:
- (a) I/O is started from the scan thread
- (b) reset triggers
- (b) reset abort I/O
- (a) I/O returns with an error
- (a) checks the state and skips remaining I/O

> Note that I do agree with patch #1, but I don't understand how patch
> #2 is fixing the problem other than narrowing the window at best.
> 
Only if the state is not propagated to the scan thread.
My reasoning is that the thread will have a context switch (as it's 
doing I/O), and hence the state change will be propagated.

>>
>> Like this one :-)
>>
>>>> we cannot access the controller and
>>>> we cannot tell which devices are present and which not.
>>>> Additionally we'll run a scan after reset anyway.
>>>> So flush existing scans before reconnecting, ensuring to
>>>> short-circuit the scan workqueue function if the controller state
>>>> isn't live to avoid lockups.
>>>
>>> What is this fixing? can we get a detailed description?
>>>
>>> These ctrl->state conditions sprayed around that are not barriered
>>> by anything do not prevent any scan to run in parallel with resets.
>>>
>> The 'ctrl->state' conditions are required to avoid I/O to be stalled
>> during reset.
> 
> Not sure what you mean by stall, are you referring to the case that
> it takes a long time to reconnect?
> 
The stall is here (to stick with the above example):
- (a) scan thread issues I/O
- (b) reset triggers
- (b) reset aborts all I/O
- (a) I/O returns with an error
- (a) scan thread issues next I/O
- (b) flushes scan thread
- (a) I/O is held by the core as the controller is in reset
- (b) flush hangs

> The solution is to make sure we drain the inflight I/O, not half-way
> trying to prevent them from happening.
> 
See above. Ths problem is that we're draining I/O only once, but the 
scan thread is issuing _several_ I/O, of which only _one_ will be aborted.

>> Problem here is that I/O will be held during reset, and
>> resubmitted once reset is finished.
>>
>> The transport drivers work around this problem to terminate any I/O
>> prior to resetting the controller,
> 
> Its not a work around.
> 
See above. Not a workaound, but not sufficient to let the scan thread 
making progress, either.

>> but this will only cover I/O which
>> had been pending _before_ reset was called.
>> Any I/O being started after that will still be held, and we would
>> live-lock trying to flush the scan thread.
> 
> But resets do not flush the scan thread (before you restored it).
> 
Yes. But the scan thread will tear down namespaces when running in 
parallel with resets, and then we're running into a race condition.

>> So essentially we guard any I/O with the state check, and short circuit
>> it if we find ourselves in reset.
> 
> As I mentioned, patch #1 is fine. Please explain what patch #2 is
> exactly fixing.
> 
Yes, I'm trying to :-)

>> There is no need to protect the state check, as we will be aborting I/O
>> when a state change to 'RESETTING' happens, and the update should be
>> visible to all threads after that.
> 
> I think this assumption is incorrect. Nothing guarantees when/who will
> have visibility to the state change.
> 
Which actually might be true, but we haven't seen the original issue 
anymore after these patches. So at the very least the race window 
becomes really small :-)

>> If you want I can separate this into two patches, one with flushing the
>> scan function and one with the state check guards, and detail out the
>> reasoning here.
> 
> No need to split, if its needed, it needs to be in a single patch.
> But again, I want to understand after patch #1 is applied, what are the
> symptoms and why patch #2 is solving them.
> 
>>> Also, do note that resets will unquiesce the admin queue right away
>>> and not when the reset completes (rdma, tcp, fc). The only exception
>>> is pci that will unquiesce as soon as the reset fails.
>>>
>>> So can you please clarify the problem?
>>>
>> Sure. During failover testing NetApp had been seeing several crashes 
>> like:
>>
[ .. ]
>> when scanning is active during reset all namespaces are found to be
>> unresponsive, and will be consequently removed:
>>
>> static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>> {
>> ????struct nvme_ns *ns;
>>
>> ????ns = nvme_find_get_ns(ctrl, nsid);
>> ????if (ns) {
>> ??????? if (ns->disk && revalidate_disk(ns->disk))
>> ??????????? nvme_ns_remove(ns);
>> ??????? nvme_put_ns(ns);
>> ????} else
>> ??????? nvme_alloc_ns(ctrl, nsid);
>> }
>>
> 
> Which is what patch #1 is for right?
> 
That's what I thought initially, too, but it turned out to be not 
sufficient.

>> However, as the controller is resetting it will call
>> nvme_stop_queues() at the start of the reset, and nvme_start_queues()
>> once reset is complete.
>>
>> Both iterate the namespace list like
>>
>> void nvme_stop_queues(struct nvme_ctrl *ctrl)
>> {
>> ????struct nvme_ns *ns;
>>
>> ????down_read(&ctrl->namespaces_rwsem);
>> ????list_for_each_entry(ns, &ctrl->namespaces, list)
>> ??????? blk_mq_quiesce_queue(ns->queue);
>> ????up_read(&ctrl->namespaces_rwsem);
>> }
>>
>> but this will only synchronize with the very last step of 
>> nvme_ns_remove().
>> At that time blk_cleanup_queue() already had been called, and
>> consequently we crash in 'blk_mq_quiesce_queue()' (or
>> 'blk_mq_unquiesce_queue()', depending on the timing).
> 
> I sent a patch that moves the rcu sync.
> Please see:
> [PATCH] nvme: fix possible io failures when removing multipathed ns
> 
> We can have an incremental patch that moves the ns removal from
> ctrl->namespaces also before.
> 
Again, I'm not sure if that's sufficient.
>>
>> This imbalance in nvme_ns_remove() is one issue which Ming Lei tried to
>> fix up with his patchset 'blk-mq: fix races related with freezing queue'.
>> However, he dropped the final patch
>> "nvme: hold request queue's refcount in ns's whole lifetime"
>> on request from hch, as he felt that blk_cleanup_queue() was unsafe in
>> general, and should need to include the final put for the request queue
>> itself.
>> Sadly, this was precisely the patch for fixing this issue :-(
>>
>> So instead of waiting for the big rewrite to happen I tried to approach
>> things from the other end, namely stopping the race to happen in the
>> first place.
>>
>> And with these two patches the crashes are gone, so I thought to share
>> it here if there were interest.
> 
> There is interest in fixing the crash, no doubt. But I just want to
> understand if this is the correct fix.

Ultimately, it's the imbalance in nvme_ns_remove() which is causing this 
crash.
nvme_ns_remove() will call blk_cleanup_queue() _before_ removing the 
namespace from the list.
which means that any call to nvme_stop_queues() or nvme_start_queues() 
happening between blk_cleanup_queue() and list_del() will be hitting 
this issue.
And without an explicit flush of the scan thread we cannot avoid this.
Leaving us with the choice of either rebalance nvme_ns_remove() (which 
is what the patchset from Ming Lei tried to do) or we flush the scan 
thread, which is what I have been doing.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare at suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-19 18:45         ` Hannes Reinecke
@ 2019-06-19 20:04           ` Sagi Grimberg
  2019-06-21 16:26             ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2019-06-19 20:04 UTC (permalink / raw)



>>>>> When resetting the controller there is no point whatsoever to
>>>>> have a scan run in parallel;
>>>>
>>>> There is also no point of trying to prevent it.
>>>>
>>> Oh, I don't mind having a scan running in parallel once reset is
>>> underway; that will be properly short-circuited anyway.
>>> What I object to is having scanning and resetting _starting_ at the same
>>> time, as then we incur all sorts of race conditions.
>>
>> But your patch does not inherently protect against it, these state
>> conditions sprinkled can easily race the reset.
>>
> The reasoning is:
> - (a) I/O is started from the scan thread
> - (b) reset triggers
> - (b) reset abort I/O
> - (a) I/O returns with an error
> - (a) checks the state and skips remaining I/O
> 
>> Note that I do agree with patch #1, but I don't understand how patch
>> #2 is fixing the problem other than narrowing the window at best.
>>
> Only if the state is not propagated to the scan thread.
> My reasoning is that the thread will have a context switch (as it's 
> doing I/O), and hence the state change will be propagated.

Still not guaranteed.


>> Not sure what you mean by stall, are you referring to the case that
>> it takes a long time to reconnect?
>>
> The stall is here (to stick with the above example):
> - (a) scan thread issues I/O
> - (b) reset triggers
> - (b) reset aborts all I/O
> - (a) I/O returns with an error
> - (a) scan thread issues next I/O
> - (b) flushes scan thread

But resets do not flush the scan thread (before your patch). That
is what I'm failing to understand. the scan work is only flushed
in nvme_remove_namespaces().

> - (a) I/O is held by the core as the controller is in reset
> - (b) flush hangs
> 
>> The solution is to make sure we drain the inflight I/O, not half-way
>> trying to prevent them from happening.
>>
> See above. Ths problem is that we're draining I/O only once, but the 
> scan thread is issuing _several_ I/O, of which only _one_ will be aborted.

But with your patch #1 it will bail if it failed and the state was 
propagated.

>>> Problem here is that I/O will be held during reset, and
>>> resubmitted once reset is finished.
>>>
>>> The transport drivers work around this problem to terminate any I/O
>>> prior to resetting the controller,
>>
>> Its not a work around.
>>
> See above. Not a workaound, but not sufficient to let the scan thread 
> making progress, either.
> 
>>> but this will only cover I/O which
>>> had been pending _before_ reset was called.
>>> Any I/O being started after that will still be held, and we would
>>> live-lock trying to flush the scan thread.
>>
>> But resets do not flush the scan thread (before you restored it).
>>
> Yes. But the scan thread will tear down namespaces when running in 
> parallel with resets, and then we're running into a race condition.

But where is the "hang" you are referring to which flushes the scan
thread?

>> But again, I want to understand after patch #1 is applied, what are the
>> symptoms and why patch #2 is solving them.

Can we take a step back here? I think we are mixing two different issues
here. in your trace you refer to a use-after-free panic when this
patch is trying to solve a "hang" condition.

Can we understand what is the phenomenon after patch #1 is applied?

>>> static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>>> {
>>> ????struct nvme_ns *ns;
>>>
>>> ????ns = nvme_find_get_ns(ctrl, nsid);
>>> ????if (ns) {
>>> ??????? if (ns->disk && revalidate_disk(ns->disk))
>>> ??????????? nvme_ns_remove(ns);
>>> ??????? nvme_put_ns(ns);
>>> ????} else
>>> ??????? nvme_alloc_ns(ctrl, nsid);
>>> }
>>>
>>
>> Which is what patch #1 is for right?
>>
> That's what I thought initially, too, but it turned out to be not 
> sufficient.

Not sufficient because it hangs? or panics?

>> We can have an incremental patch that moves the ns removal from
>> ctrl->namespaces also before.
>>
> Again, I'm not sure if that's sufficient.

[...]

> Ultimately, it's the imbalance in nvme_ns_remove() which is causing this 
> crash.
> nvme_ns_remove() will call blk_cleanup_queue() _before_ removing the 
> namespace from the list.

This is what I suggest as an incremental change to the patch I
referenced. blk_cleanup_queue() is called after the ns is removed
from the list.

> which means that any call to nvme_stop_queues() or nvme_start_queues() 
> happening between blk_cleanup_queue() and list_del() will be hitting 
> this issue.

This is the use-after-free again right?

> And without an explicit flush of the scan thread we cannot avoid this.
> Leaving us with the choice of either rebalance nvme_ns_remove() (which 
> is what the patchset from Ming Lei tried to do) or we flush the scan 
> thread, which is what I have been doing.

You may be right, I just need a little help getting there as so far I'm
not convinced that there is no other way. I'm also not 100% clear
on what is the phenomenon after patch #1 is applied, is it a crash or
a hang?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/2] nvme: Do not remove namespaces during reset
  2019-06-18 10:10 ` [PATCH 1/2] nvme: Do not remove namespaces during reset Hannes Reinecke
  2019-06-18 17:30   ` Sagi Grimberg
@ 2019-06-20  1:22   ` Ming Lei
  1 sibling, 0 replies; 28+ messages in thread
From: Ming Lei @ 2019-06-20  1:22 UTC (permalink / raw)


Hi Hannes,

Could you explain a bit what the user visible issue is addressed by this
patch?

On Tue, Jun 18, 2019@12:10:24PM +0200, Hannes Reinecke wrote:
> When a controller is resetting or reconnecting there is no way
> how we could establish the validity of any given namespace.
> So do not call nvme_ns_remove() during resetting or reconnecting
> and rely on the call to nvme_scan_queue() after reset to fixup
> things.
> 
> Signed-off-by: Hannes Reinecke <hare at suse.com>
> ---
>  drivers/nvme/host/core.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index ba2079d217da..e872591e5fe7 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3358,6 +3358,17 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>  
>  static void nvme_ns_remove(struct nvme_ns *ns)
>  {
> +	/*
> +	 * We cannot make any assumptions about namespaces during
> +	 * reset; in particular we shouldn't attempt to remove them
> +	 * as I/O might still be queued to them.
> +	 * So ignore this call during reset and rely on the
> +	 * rescan after reset to clean up things again.
> +	 */
> +	if (ns->ctrl->state == NVME_CTRL_RESETTING ||
> +	    ns->ctrl->state == NVME_CTRL_CONNECTING)
> +		return;
> +
>  	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
>  		return;

nvme_ns_remove() may be called from nvme_remove_invalid_namespaces()
and nvme_remove_namespaces(), in which the 'ns' to be removed is
retrieved & deleted from ctrl->namespaces.

That means if the 'ns' needs to be removed by the two mentioned
functions from scan work context again after reset is done, the removal
may never be done because the 'ns' can't be found in ctrl->namespaces.

If you want to avoid the use-after-free issue[1], I'd suggest to use
the queue refcount.

https://lore.kernel.org/linux-block/20190424110221.17435-10-ming.lei at redhat.com/

Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-18 10:10 ` [PATCH 2/2] nvme: flush scan_work when resetting controller Hannes Reinecke
  2019-06-18 17:41   ` Sagi Grimberg
@ 2019-06-20  1:36   ` Ming Lei
  2019-06-21  6:14     ` Hannes Reinecke
  1 sibling, 1 reply; 28+ messages in thread
From: Ming Lei @ 2019-06-20  1:36 UTC (permalink / raw)


On Tue, Jun 18, 2019@12:10:25PM +0200, Hannes Reinecke wrote:
> When resetting the controller there is no point whatsoever to
> have a scan run in parallel; we cannot access the controller and

scan won't be run in parallel, because .scan_work is embedded in
'struct nvme_ctrl' which is per-HBA.

> we cannot tell which devices are present and which not.
> Additionally we'll run a scan after reset anyway.
> So flush existing scans before reconnecting, ensuring to
> short-circuit the scan workqueue function if the controller state
> isn't live to avoid lockups.

This way may cause dead-lock.

1) nvme_revalidate_disk() might freeze queue in flush context, however
any in-flight requests won't be completed until reset is done, so
deadlock may be caused by flushing scans in reset context.

2) sync IO may be involved in revalidate_disk() which is called in
scan context, so deadlock is caused for same reason with 1).


Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-20  1:36   ` Ming Lei
@ 2019-06-21  6:14     ` Hannes Reinecke
  2019-06-21  6:58       ` Ming Lei
  0 siblings, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2019-06-21  6:14 UTC (permalink / raw)


On 6/20/19 3:36 AM, Ming Lei wrote:
> On Tue, Jun 18, 2019@12:10:25PM +0200, Hannes Reinecke wrote:
>> When resetting the controller there is no point whatsoever to
>> have a scan run in parallel; we cannot access the controller and
> 
> scan won't be run in parallel, because .scan_work is embedded in
> 'struct nvme_ctrl' which is per-HBA.
> 
Wrong. We do.
Not sure why having it embedded in the controller structure might
prevent this from happening; both reset and scan are embedded, but
running on different queues:

void nvme_queue_scan(struct nvme_ctrl *ctrl)
{
	/*
	 * Only new queue scan work when admin and IO queues are both alive
	 */
	if (ctrl->state == NVME_CTRL_LIVE)
		queue_work(nvme_wq, &ctrl->scan_work);
}
EXPORT_SYMBOL_GPL(nvme_queue_scan);

int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
{
	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
		return -EBUSY;
	if (!queue_work(nvme_reset_wq, &ctrl->reset_work))
		return -EBUSY;
	return 0;
}
EXPORT_SYMBOL_GPL(nvme_reset_ctrl);

So there's nothing stopping them to run in parallel.

>> we cannot tell which devices are present and which not.
>> Additionally we'll run a scan after reset anyway.
>> So flush existing scans before reconnecting, ensuring to
>> short-circuit the scan workqueue function if the controller state
>> isn't live to avoid lockups.
> 
> This way may cause dead-lock.
> 
> 1) nvme_revalidate_disk() might freeze queue in flush context, however
> any in-flight requests won't be completed until reset is done, so
> deadlock may be caused by flushing scans in reset context.
> 
Which is why I'm checking the controller state; I've observed the
deadlock plenty of times before introducing the controller state check.

> 2) sync IO may be involved in revalidate_disk() which is called in
> scan context, so deadlock is caused for same reason with 1).
> 
I/O during revalidate_disk() is protected by the state check, too, so we
won't be issuing any I/O during resetting.

To be precise: any I/O in flight when reset is triggered will be
terminated, and any subsequent I/O is short-circuited by the state check.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-21  6:14     ` Hannes Reinecke
@ 2019-06-21  6:58       ` Ming Lei
  2019-06-21  7:59         ` Hannes Reinecke
  0 siblings, 1 reply; 28+ messages in thread
From: Ming Lei @ 2019-06-21  6:58 UTC (permalink / raw)


On Fri, Jun 21, 2019@08:14:45AM +0200, Hannes Reinecke wrote:
> On 6/20/19 3:36 AM, Ming Lei wrote:
> > On Tue, Jun 18, 2019@12:10:25PM +0200, Hannes Reinecke wrote:
> >> When resetting the controller there is no point whatsoever to
> >> have a scan run in parallel; we cannot access the controller and
> > 
> > scan won't be run in parallel, because .scan_work is embedded in
> > 'struct nvme_ctrl' which is per-HBA.
> > 
> Wrong. We do.
> Not sure why having it embedded in the controller structure might
> prevent this from happening; both reset and scan are embedded, but
> running on different queues:

I mean the scan_work function itself is run exclusively, but yes, it can be 
run when resetting is in-progress.

> 
> void nvme_queue_scan(struct nvme_ctrl *ctrl)
> {
> 	/*
> 	 * Only new queue scan work when admin and IO queues are both alive
> 	 */
> 	if (ctrl->state == NVME_CTRL_LIVE)
> 		queue_work(nvme_wq, &ctrl->scan_work);
> }
> EXPORT_SYMBOL_GPL(nvme_queue_scan);
> 
> int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
> {
> 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> 		return -EBUSY;
> 	if (!queue_work(nvme_reset_wq, &ctrl->reset_work))
> 		return -EBUSY;
> 	return 0;
> }
> EXPORT_SYMBOL_GPL(nvme_reset_ctrl);
> 
> So there's nothing stopping them to run in parallel.
> 
> >> we cannot tell which devices are present and which not.
> >> Additionally we'll run a scan after reset anyway.
> >> So flush existing scans before reconnecting, ensuring to
> >> short-circuit the scan workqueue function if the controller state
> >> isn't live to avoid lockups.
> > 
> > This way may cause dead-lock.
> > 
> > 1) nvme_revalidate_disk() might freeze queue in flush context, however
> > any in-flight requests won't be completed until reset is done, so
> > deadlock may be caused by flushing scans in reset context.
> > 
> Which is why I'm checking the controller state; I've observed the
> deadlock plenty of times before introducing the controller state check.

Your check can't help wrt. the deadlock, for example:

1) in scan work context:

- blk_mq_freeze_queue() is being started after passing the controller
  state check

2) timeout & reset is triggered in another context at the exact same time:

- all in-flight IOs won't be freed until disable controller & reset is done.

- now flush_work() in reset context can't move on, because
  blk_mq_freeze_queue() in scan context can't make progress.

> 
> > 2) sync IO may be involved in revalidate_disk() which is called in
> > scan context, so deadlock is caused for same reason with 1).
> > 
> I/O during revalidate_disk() is protected by the state check, too, so we
> won't be issuing any I/O during resetting.
> 
> To be precise: any I/O in flight when reset is triggered will be
> terminated, and any subsequent I/O is short-circuited by the state check.

No, any I/O in flight before resetting is just terminated from hardware,
but still in blk-mq sw or scheduler queue, so either sync IO or queue
freezing won't make progress.

Please see nvme_complete_rq(), all these IO will be retried usually.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-21  6:58       ` Ming Lei
@ 2019-06-21  7:59         ` Hannes Reinecke
  2019-06-21 17:23           ` James Smart
                             ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Hannes Reinecke @ 2019-06-21  7:59 UTC (permalink / raw)


On 6/21/19 8:58 AM, Ming Lei wrote:
> On Fri, Jun 21, 2019@08:14:45AM +0200, Hannes Reinecke wrote:
>> On 6/20/19 3:36 AM, Ming Lei wrote:
>>> On Tue, Jun 18, 2019@12:10:25PM +0200, Hannes Reinecke wrote:
>>>> When resetting the controller there is no point whatsoever to
>>>> have a scan run in parallel; we cannot access the controller and
>>>
>>> scan won't be run in parallel, because .scan_work is embedded in
>>> 'struct nvme_ctrl' which is per-HBA.
>>>
>> Wrong. We do.
>> Not sure why having it embedded in the controller structure might
>> prevent this from happening; both reset and scan are embedded, but
>> running on different queues:
> 
> I mean the scan_work function itself is run exclusively, but yes, it can be 
> run when resetting is in-progress.
> 
>>
>> void nvme_queue_scan(struct nvme_ctrl *ctrl)
>> {
>> 	/*
>> 	 * Only new queue scan work when admin and IO queues are both alive
>> 	 */
>> 	if (ctrl->state == NVME_CTRL_LIVE)
>> 		queue_work(nvme_wq, &ctrl->scan_work);
>> }
>> EXPORT_SYMBOL_GPL(nvme_queue_scan);
>>
>> int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
>> {
>> 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
>> 		return -EBUSY;
>> 	if (!queue_work(nvme_reset_wq, &ctrl->reset_work))
>> 		return -EBUSY;
>> 	return 0;
>> }
>> EXPORT_SYMBOL_GPL(nvme_reset_ctrl);
>>
>> So there's nothing stopping them to run in parallel.
>>
>>>> we cannot tell which devices are present and which not.
>>>> Additionally we'll run a scan after reset anyway.
>>>> So flush existing scans before reconnecting, ensuring to
>>>> short-circuit the scan workqueue function if the controller state
>>>> isn't live to avoid lockups.
>>>
>>> This way may cause dead-lock.
>>>
>>> 1) nvme_revalidate_disk() might freeze queue in flush context, however
>>> any in-flight requests won't be completed until reset is done, so
>>> deadlock may be caused by flushing scans in reset context.
>>>
>> Which is why I'm checking the controller state; I've observed the
>> deadlock plenty of times before introducing the controller state check.
> 
> Your check can't help wrt. the deadlock, for example:
> 
> 1) in scan work context:
> 
> - blk_mq_freeze_queue() is being started after passing the controller
>   state check
> 
> 2) timeout & reset is triggered in another context at the exact same time:
> 
> - all in-flight IOs won't be freed until disable controller & reset is done.
> 
> - now flush_work() in reset context can't move on, because
>   blk_mq_freeze_queue() in scan context can't make progress.
> 
There might be a difference between RDMA and FC implementations; for FC
we terminate all outstanding I/Os from the HW side, so each I/O will be
returned with an aborted status.
Which for all tests I (and NetApp :-) did was enough to get
'blk_mq_freeze_queue()' unstuck and the flush_work to complete.
We _did_ observed, however, that the state checks are absolutely
critical to this, otherwise we indeed ended up with a stuck flush_work().

>>
>>> 2) sync IO may be involved in revalidate_disk() which is called in
>>> scan context, so deadlock is caused for same reason with 1).
>>>
>> I/O during revalidate_disk() is protected by the state check, too, so we
>> won't be issuing any I/O during resetting.
>>
>> To be precise: any I/O in flight when reset is triggered will be
>> terminated, and any subsequent I/O is short-circuited by the state check.
> 
> No, any I/O in flight before resetting is just terminated from hardware,
> but still in blk-mq sw or scheduler queue, so either sync IO or queue
> freezing won't make progress.
> 
> Please see nvme_complete_rq(), all these IO will be retried usually.
> 
For the multipath case the request will be requeued via
blk_steal_bios(), and the original request will be completed.
So the queue indeed becomes free, and we can continue.

Things _might_ be different for the non-multipath case, though; I'll
have to check.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-19 20:04           ` Sagi Grimberg
@ 2019-06-21 16:26             ` Sagi Grimberg
  2019-06-24  5:48               ` Hannes Reinecke
  2019-06-24  6:13               ` Hannes Reinecke
  0 siblings, 2 replies; 28+ messages in thread
From: Sagi Grimberg @ 2019-06-21 16:26 UTC (permalink / raw)


Ping?

Hannes did you get to look into my questions?

On 6/19/19 1:04 PM, Sagi Grimberg wrote:
> 
>>>>>> When resetting the controller there is no point whatsoever to
>>>>>> have a scan run in parallel;
>>>>>
>>>>> There is also no point of trying to prevent it.
>>>>>
>>>> Oh, I don't mind having a scan running in parallel once reset is
>>>> underway; that will be properly short-circuited anyway.
>>>> What I object to is having scanning and resetting _starting_ at the 
>>>> same
>>>> time, as then we incur all sorts of race conditions.
>>>
>>> But your patch does not inherently protect against it, these state
>>> conditions sprinkled can easily race the reset.
>>>
>> The reasoning is:
>> - (a) I/O is started from the scan thread
>> - (b) reset triggers
>> - (b) reset abort I/O
>> - (a) I/O returns with an error
>> - (a) checks the state and skips remaining I/O
>>
>>> Note that I do agree with patch #1, but I don't understand how patch
>>> #2 is fixing the problem other than narrowing the window at best.
>>>
>> Only if the state is not propagated to the scan thread.
>> My reasoning is that the thread will have a context switch (as it's 
>> doing I/O), and hence the state change will be propagated.
> 
> Still not guaranteed.
> 
> 
>>> Not sure what you mean by stall, are you referring to the case that
>>> it takes a long time to reconnect?
>>>
>> The stall is here (to stick with the above example):
>> - (a) scan thread issues I/O
>> - (b) reset triggers
>> - (b) reset aborts all I/O
>> - (a) I/O returns with an error
>> - (a) scan thread issues next I/O
>> - (b) flushes scan thread
> 
> But resets do not flush the scan thread (before your patch). That
> is what I'm failing to understand. the scan work is only flushed
> in nvme_remove_namespaces().
> 
>> - (a) I/O is held by the core as the controller is in reset
>> - (b) flush hangs
>>
>>> The solution is to make sure we drain the inflight I/O, not half-way
>>> trying to prevent them from happening.
>>>
>> See above. Ths problem is that we're draining I/O only once, but the 
>> scan thread is issuing _several_ I/O, of which only _one_ will be 
>> aborted.
> 
> But with your patch #1 it will bail if it failed and the state was 
> propagated.
> 
>>>> Problem here is that I/O will be held during reset, and
>>>> resubmitted once reset is finished.
>>>>
>>>> The transport drivers work around this problem to terminate any I/O
>>>> prior to resetting the controller,
>>>
>>> Its not a work around.
>>>
>> See above. Not a workaound, but not sufficient to let the scan thread 
>> making progress, either.
>>
>>>> but this will only cover I/O which
>>>> had been pending _before_ reset was called.
>>>> Any I/O being started after that will still be held, and we would
>>>> live-lock trying to flush the scan thread.
>>>
>>> But resets do not flush the scan thread (before you restored it).
>>>
>> Yes. But the scan thread will tear down namespaces when running in 
>> parallel with resets, and then we're running into a race condition.
> 
> But where is the "hang" you are referring to which flushes the scan
> thread?
> 
>>> But again, I want to understand after patch #1 is applied, what are the
>>> symptoms and why patch #2 is solving them.
> 
> Can we take a step back here? I think we are mixing two different issues
> here. in your trace you refer to a use-after-free panic when this
> patch is trying to solve a "hang" condition.
> 
> Can we understand what is the phenomenon after patch #1 is applied?
> 
>>>> static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>>>> {
>>>> ????struct nvme_ns *ns;
>>>>
>>>> ????ns = nvme_find_get_ns(ctrl, nsid);
>>>> ????if (ns) {
>>>> ??????? if (ns->disk && revalidate_disk(ns->disk))
>>>> ??????????? nvme_ns_remove(ns);
>>>> ??????? nvme_put_ns(ns);
>>>> ????} else
>>>> ??????? nvme_alloc_ns(ctrl, nsid);
>>>> }
>>>>
>>>
>>> Which is what patch #1 is for right?
>>>
>> That's what I thought initially, too, but it turned out to be not 
>> sufficient.
> 
> Not sufficient because it hangs? or panics?
> 
>>> We can have an incremental patch that moves the ns removal from
>>> ctrl->namespaces also before.
>>>
>> Again, I'm not sure if that's sufficient.
> 
> [...]
> 
>> Ultimately, it's the imbalance in nvme_ns_remove() which is causing 
>> this crash.
>> nvme_ns_remove() will call blk_cleanup_queue() _before_ removing the 
>> namespace from the list.
> 
> This is what I suggest as an incremental change to the patch I
> referenced. blk_cleanup_queue() is called after the ns is removed
> from the list.
> 
>> which means that any call to nvme_stop_queues() or nvme_start_queues() 
>> happening between blk_cleanup_queue() and list_del() will be hitting 
>> this issue.
> 
> This is the use-after-free again right?
> 
>> And without an explicit flush of the scan thread we cannot avoid this.
>> Leaving us with the choice of either rebalance nvme_ns_remove() (which 
>> is what the patchset from Ming Lei tried to do) or we flush the scan 
>> thread, which is what I have been doing.
> 
> You may be right, I just need a little help getting there as so far I'm
> not convinced that there is no other way. I'm also not 100% clear
> on what is the phenomenon after patch #1 is applied, is it a crash or
> a hang?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-21  7:59         ` Hannes Reinecke
@ 2019-06-21 17:23           ` James Smart
  2019-06-21 17:23           ` James Smart
  2019-06-24  3:29           ` Ming Lei
  2 siblings, 0 replies; 28+ messages in thread
From: James Smart @ 2019-06-21 17:23 UTC (permalink / raw)



>> Your check can't help wrt. the deadlock, for example:
>>
>> 1) in scan work context:
>>
>> - blk_mq_freeze_queue() is being started after passing the controller
>>    state check
>>
>> 2) timeout & reset is triggered in another context at the exact same time:
>>
>> - all in-flight IOs won't be freed until disable controller & reset is done.
>>
>> - now flush_work() in reset context can't move on, because
>>    blk_mq_freeze_queue() in scan context can't make progress.
>>
> There might be a difference between RDMA and FC implementations; for FC
> we terminate all outstanding I/Os from the HW side, so each I/O will be
> returned with an aborted status.
> Which for all tests I (and NetApp :-) did was enough to get
> 'blk_mq_freeze_queue()' unstuck and the flush_work to complete.
> We _did_ observed, however, that the state checks are absolutely
> critical to this, otherwise we indeed ended up with a stuck flush_work().

RDMA and TCP does the same thing at what is basically the same point - 
as far as terminating all outstanding io. The difference is how they 
terminate.? RDMA and TCP use nvme_cancel_request() rather than a call 
that induces work on the link. nvme_cancel_request() is near immediate. 
FC will take longer for the i/o's to clear - that's the difference.

-- james

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-21  7:59         ` Hannes Reinecke
  2019-06-21 17:23           ` James Smart
@ 2019-06-21 17:23           ` James Smart
  2019-06-24  3:29           ` Ming Lei
  2 siblings, 0 replies; 28+ messages in thread
From: James Smart @ 2019-06-21 17:23 UTC (permalink / raw)



>> Your check can't help wrt. the deadlock, for example:
>>
>> 1) in scan work context:
>>
>> - blk_mq_freeze_queue() is being started after passing the controller
>>    state check
>>
>> 2) timeout & reset is triggered in another context at the exact same time:
>>
>> - all in-flight IOs won't be freed until disable controller & reset is done.
>>
>> - now flush_work() in reset context can't move on, because
>>    blk_mq_freeze_queue() in scan context can't make progress.
>>
> There might be a difference between RDMA and FC implementations; for FC
> we terminate all outstanding I/Os from the HW side, so each I/O will be
> returned with an aborted status.
> Which for all tests I (and NetApp :-) did was enough to get
> 'blk_mq_freeze_queue()' unstuck and the flush_work to complete.
> We _did_ observed, however, that the state checks are absolutely
> critical to this, otherwise we indeed ended up with a stuck flush_work().

RDMA and TCP does the same thing at what is basically the same point - 
as far as terminating all outstanding io. The difference is how they 
terminate.? RDMA and TCP use nvme_cancel_request() rather than a call 
that induces work on the link. nvme_cancel_request() is near immediate. 
FC will take longer for the i/o's to clear - that's the difference.

-- james

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-21  7:59         ` Hannes Reinecke
  2019-06-21 17:23           ` James Smart
  2019-06-21 17:23           ` James Smart
@ 2019-06-24  3:29           ` Ming Lei
  2 siblings, 0 replies; 28+ messages in thread
From: Ming Lei @ 2019-06-24  3:29 UTC (permalink / raw)


On Fri, Jun 21, 2019@09:59:02AM +0200, Hannes Reinecke wrote:
> On 6/21/19 8:58 AM, Ming Lei wrote:
> > On Fri, Jun 21, 2019@08:14:45AM +0200, Hannes Reinecke wrote:
> >> On 6/20/19 3:36 AM, Ming Lei wrote:
> >>> On Tue, Jun 18, 2019@12:10:25PM +0200, Hannes Reinecke wrote:
> >>>> When resetting the controller there is no point whatsoever to
> >>>> have a scan run in parallel; we cannot access the controller and
> >>>
> >>> scan won't be run in parallel, because .scan_work is embedded in
> >>> 'struct nvme_ctrl' which is per-HBA.
> >>>
> >> Wrong. We do.
> >> Not sure why having it embedded in the controller structure might
> >> prevent this from happening; both reset and scan are embedded, but
> >> running on different queues:
> > 
> > I mean the scan_work function itself is run exclusively, but yes, it can be 
> > run when resetting is in-progress.
> > 
> >>
> >> void nvme_queue_scan(struct nvme_ctrl *ctrl)
> >> {
> >> 	/*
> >> 	 * Only new queue scan work when admin and IO queues are both alive
> >> 	 */
> >> 	if (ctrl->state == NVME_CTRL_LIVE)
> >> 		queue_work(nvme_wq, &ctrl->scan_work);
> >> }
> >> EXPORT_SYMBOL_GPL(nvme_queue_scan);
> >>
> >> int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
> >> {
> >> 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> >> 		return -EBUSY;
> >> 	if (!queue_work(nvme_reset_wq, &ctrl->reset_work))
> >> 		return -EBUSY;
> >> 	return 0;
> >> }
> >> EXPORT_SYMBOL_GPL(nvme_reset_ctrl);
> >>
> >> So there's nothing stopping them to run in parallel.
> >>
> >>>> we cannot tell which devices are present and which not.
> >>>> Additionally we'll run a scan after reset anyway.
> >>>> So flush existing scans before reconnecting, ensuring to
> >>>> short-circuit the scan workqueue function if the controller state
> >>>> isn't live to avoid lockups.
> >>>
> >>> This way may cause dead-lock.
> >>>
> >>> 1) nvme_revalidate_disk() might freeze queue in flush context, however
> >>> any in-flight requests won't be completed until reset is done, so
> >>> deadlock may be caused by flushing scans in reset context.
> >>>
> >> Which is why I'm checking the controller state; I've observed the
> >> deadlock plenty of times before introducing the controller state check.
> > 
> > Your check can't help wrt. the deadlock, for example:
> > 
> > 1) in scan work context:
> > 
> > - blk_mq_freeze_queue() is being started after passing the controller
> >   state check
> > 
> > 2) timeout & reset is triggered in another context at the exact same time:
> > 
> > - all in-flight IOs won't be freed until disable controller & reset is done.
> > 
> > - now flush_work() in reset context can't move on, because
> >   blk_mq_freeze_queue() in scan context can't make progress.
> > 
> There might be a difference between RDMA and FC implementations; for FC
> we terminate all outstanding I/Os from the HW side, so each I/O will be
> returned with an aborted status.
> Which for all tests I (and NetApp :-) did was enough to get
> 'blk_mq_freeze_queue()' unstuck and the flush_work to complete.
> We _did_ observed, however, that the state checks are absolutely
> critical to this, otherwise we indeed ended up with a stuck flush_work().

I have explained that the check can't fix the issue completely, and it is
like a workround.

> 
> >>
> >>> 2) sync IO may be involved in revalidate_disk() which is called in
> >>> scan context, so deadlock is caused for same reason with 1).
> >>>
> >> I/O during revalidate_disk() is protected by the state check, too, so we
> >> won't be issuing any I/O during resetting.
> >>
> >> To be precise: any I/O in flight when reset is triggered will be
> >> terminated, and any subsequent I/O is short-circuited by the state check.
> > 
> > No, any I/O in flight before resetting is just terminated from hardware,
> > but still in blk-mq sw or scheduler queue, so either sync IO or queue
> > freezing won't make progress.
> > 
> > Please see nvme_complete_rq(), all these IO will be retried usually.
> > 
> For the multipath case the request will be requeued via
> blk_steal_bios(), and the original request will be completed.

The sync IO during scan is actually waiting for bio completion, not request.

So there is still the issue in case of multipath.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-21 16:26             ` Sagi Grimberg
@ 2019-06-24  5:48               ` Hannes Reinecke
  2019-06-24  6:13               ` Hannes Reinecke
  1 sibling, 0 replies; 28+ messages in thread
From: Hannes Reinecke @ 2019-06-24  5:48 UTC (permalink / raw)


On 6/21/19 6:26 PM, Sagi Grimberg wrote:
> Ping?
> 
> Hannes did you get to look into my questions?
> 
Yeah, I did, but bugzilla was down on Friday so I couldn't lookup details.
More to follow :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-21 16:26             ` Sagi Grimberg
  2019-06-24  5:48               ` Hannes Reinecke
@ 2019-06-24  6:13               ` Hannes Reinecke
  2019-06-24 18:08                 ` Sagi Grimberg
  1 sibling, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2019-06-24  6:13 UTC (permalink / raw)


On 6/21/19 6:26 PM, Sagi Grimberg wrote:
> Ping?
> 
> Hannes did you get to look into my questions?
> 
> On 6/19/19 1:04 PM, Sagi Grimberg wrote:
>>
>>>>>>> When resetting the controller there is no point whatsoever to
>>>>>>> have a scan run in parallel;
>>>>>>
>>>>>> There is also no point of trying to prevent it.
>>>>>>
>>>>> Oh, I don't mind having a scan running in parallel once reset is
>>>>> underway; that will be properly short-circuited anyway.
>>>>> What I object to is having scanning and resetting _starting_ at the
>>>>> same
>>>>> time, as then we incur all sorts of race conditions.
>>>>
>>>> But your patch does not inherently protect against it, these state
>>>> conditions sprinkled can easily race the reset.
>>>>
>>> The reasoning is:
>>> - (a) I/O is started from the scan thread
>>> - (b) reset triggers
>>> - (b) reset abort I/O
>>> - (a) I/O returns with an error
>>> - (a) checks the state and skips remaining I/O
>>>
>>>> Note that I do agree with patch #1, but I don't understand how patch
>>>> #2 is fixing the problem other than narrowing the window at best.
>>>>
>>> Only if the state is not propagated to the scan thread.
>>> My reasoning is that the thread will have a context switch (as it's
>>> doing I/O), and hence the state change will be propagated.
>>
>> Still not guaranteed.
>>
>>
>>>> Not sure what you mean by stall, are you referring to the case that
>>>> it takes a long time to reconnect?
>>>>
>>> The stall is here (to stick with the above example):
>>> - (a) scan thread issues I/O
>>> - (b) reset triggers
>>> - (b) reset aborts all I/O
>>> - (a) I/O returns with an error
>>> - (a) scan thread issues next I/O
>>> - (b) flushes scan thread
>>
>> But resets do not flush the scan thread (before your patch). That
>> is what I'm failing to understand. the scan work is only flushed
>> in nvme_remove_namespaces().
>>
>>> - (a) I/O is held by the core as the controller is in reset
>>> - (b) flush hangs
>>>
>>>> The solution is to make sure we drain the inflight I/O, not half-way
>>>> trying to prevent them from happening.
>>>>
>>> See above. Ths problem is that we're draining I/O only once, but the
>>> scan thread is issuing _several_ I/O, of which only _one_ will be
>>> aborted.
>>
>> But with your patch #1 it will bail if it failed and the state was
>> propagated.
>>
>>>>> Problem here is that I/O will be held during reset, and
>>>>> resubmitted once reset is finished.
>>>>>
>>>>> The transport drivers work around this problem to terminate any I/O
>>>>> prior to resetting the controller,
>>>>
>>>> Its not a work around.
>>>>
>>> See above. Not a workaound, but not sufficient to let the scan thread
>>> making progress, either.
>>>
>>>>> but this will only cover I/O which
>>>>> had been pending _before_ reset was called.
>>>>> Any I/O being started after that will still be held, and we would
>>>>> live-lock trying to flush the scan thread.
>>>>
>>>> But resets do not flush the scan thread (before you restored it).
>>>>
>>> Yes. But the scan thread will tear down namespaces when running in
>>> parallel with resets, and then we're running into a race condition.
>>
>> But where is the "hang" you are referring to which flushes the scan
>> thread?
>>
>>>> But again, I want to understand after patch #1 is applied, what are the
>>>> symptoms and why patch #2 is solving them.
>>
>> Can we take a step back here? I think we are mixing two different issues
>> here. in your trace you refer to a use-after-free panic when this
>> patch is trying to solve a "hang" condition.
>>
>> Can we understand what is the phenomenon after patch #1 is applied?
>>
>>>>> static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>>>>> {
>>>>> ????struct nvme_ns *ns;
>>>>>
>>>>> ????ns = nvme_find_get_ns(ctrl, nsid);
>>>>> ????if (ns) {
>>>>> ??????? if (ns->disk && revalidate_disk(ns->disk))
>>>>> ??????????? nvme_ns_remove(ns);
>>>>> ??????? nvme_put_ns(ns);
>>>>> ????} else
>>>>> ??????? nvme_alloc_ns(ctrl, nsid);
>>>>> }
>>>>>
>>>>
>>>> Which is what patch #1 is for right?
>>>>
>>> That's what I thought initially, too, but it turned out to be not
>>> sufficient.
>>
>> Not sufficient because it hangs? or panics?
>>
It hangs, and we're seeing a warning:

kernel: [67088.344034] WARNING: CPU: 4 PID: 25020 at
../lib/percpu-refcount.c:334 percpu_ref_kill_and_confirm+0x7a/0xa0
[ .. ]
kernel: [67088.344106] Call Trace:
kernel: [67088.344112]  blk_freeze_queue_start+0x2a/0x40
kernel: [67088.344114]  blk_freeze_queue+0xe/0x40
kernel: [67088.344118]  nvme_update_disk_info+0x36/0x260 [nvme_core]
kernel: [67088.344122]  __nvme_revalidate_disk+0xca/0xf0 [nvme_core]
kernel: [67088.344125]  nvme_revalidate_disk+0xa6/0x120 [nvme_core]
kernel: [67088.344127]  ? blk_mq_get_tag+0xa3/0x220
kernel: [67088.344130]  revalidate_disk+0x23/0xc0
kernel: [67088.344133]  nvme_validate_ns+0x43/0x830 [nvme_core]
kernel: [67088.344137]  ? wake_up_q+0x70/0x70
kernel: [67088.344139]  ? blk_mq_free_request+0x12a/0x160
kernel: [67088.344142]  ? __nvme_submit_sync_cmd+0x73/0xe0 [nvme_core]
kernel: [67088.344145]  nvme_scan_work+0x2b3/0x350 [nvme_core]
kernel: [67088.344149]  process_one_work+0x1da/0x400

>From which I've inferred that we're still running a scan in parallel to
reset, and that the scan thread is calling 'blk_freeze_queue()' on a
queue which is already torn down.

>>>> We can have an incremental patch that moves the ns removal from
>>>> ctrl->namespaces also before.
>>>>
>>> Again, I'm not sure if that's sufficient.
>>
>> [...]
>>
>>> Ultimately, it's the imbalance in nvme_ns_remove() which is causing
>>> this crash.
>>> nvme_ns_remove() will call blk_cleanup_queue() _before_ removing the
>>> namespace from the list.
>>
>> This is what I suggest as an incremental change to the patch I
>> referenced. blk_cleanup_queue() is called after the ns is removed
>> from the list.
>>
>>> which means that any call to nvme_stop_queues() or
>>> nvme_start_queues() happening between blk_cleanup_queue() and
>>> list_del() will be hitting this issue.
>>
>> This is the use-after-free again right?
>>
>>> And without an explicit flush of the scan thread we cannot avoid this.
>>> Leaving us with the choice of either rebalance nvme_ns_remove()
>>> (which is what the patchset from Ming Lei tried to do) or we flush
>>> the scan thread, which is what I have been doing.
>>
>> You may be right, I just need a little help getting there as so far I'm
>> not convinced that there is no other way. I'm also not 100% clear
>> on what is the phenomenon after patch #1 is applied, is it a crash or
>> a hang?

See above. The system hangs with a warning.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-24  6:13               ` Hannes Reinecke
@ 2019-06-24 18:08                 ` Sagi Grimberg
  2019-06-24 18:51                   ` James Smart
  2019-06-25  6:07                   ` Hannes Reinecke
  0 siblings, 2 replies; 28+ messages in thread
From: Sagi Grimberg @ 2019-06-24 18:08 UTC (permalink / raw)



>>>> That's what I thought initially, too, but it turned out to be not
>>>> sufficient.
>>>
>>> Not sufficient because it hangs? or panics?
>>>
> It hangs, and we're seeing a warning:
> 
> kernel: [67088.344034] WARNING: CPU: 4 PID: 25020 at
> ../lib/percpu-refcount.c:334 percpu_ref_kill_and_confirm+0x7a/0xa0
> [ .. ]
> kernel: [67088.344106] Call Trace:
> kernel: [67088.344112]  blk_freeze_queue_start+0x2a/0x40
> kernel: [67088.344114]  blk_freeze_queue+0xe/0x40
> kernel: [67088.344118]  nvme_update_disk_info+0x36/0x260 [nvme_core]
> kernel: [67088.344122]  __nvme_revalidate_disk+0xca/0xf0 [nvme_core]
> kernel: [67088.344125]  nvme_revalidate_disk+0xa6/0x120 [nvme_core]
> kernel: [67088.344127]  ? blk_mq_get_tag+0xa3/0x220
> kernel: [67088.344130]  revalidate_disk+0x23/0xc0
> kernel: [67088.344133]  nvme_validate_ns+0x43/0x830 [nvme_core]
> kernel: [67088.344137]  ? wake_up_q+0x70/0x70
> kernel: [67088.344139]  ? blk_mq_free_request+0x12a/0x160
> kernel: [67088.344142]  ? __nvme_submit_sync_cmd+0x73/0xe0 [nvme_core]
> kernel: [67088.344145]  nvme_scan_work+0x2b3/0x350 [nvme_core]
> kernel: [67088.344149]  process_one_work+0x1da/0x400
> 
>  From which I've inferred that we're still running a scan in parallel to
> reset, and that the scan thread is calling 'blk_freeze_queue()' on a
> queue which is already torn down.


Where is the scan triggered from? there is no scan call from the reset
path.

Is there a namespace removal or something else that triggers AEN
to make this happen?

What exactly is the scenario?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-24 18:08                 ` Sagi Grimberg
@ 2019-06-24 18:51                   ` James Smart
  2019-06-25  6:07                   ` Hannes Reinecke
  1 sibling, 0 replies; 28+ messages in thread
From: James Smart @ 2019-06-24 18:51 UTC (permalink / raw)




On 6/24/2019 11:08 AM, Sagi Grimberg wrote:
>
>>>>> That's what I thought initially, too, but it turned out to be not
>>>>> sufficient.
>>>>
>>>> Not sufficient because it hangs? or panics?
>>>>
>> It hangs, and we're seeing a warning:
>>
>> kernel: [67088.344034] WARNING: CPU: 4 PID: 25020 at
>> ../lib/percpu-refcount.c:334 percpu_ref_kill_and_confirm+0x7a/0xa0
>> [ .. ]
>> kernel: [67088.344106] Call Trace:
>> kernel: [67088.344112]? blk_freeze_queue_start+0x2a/0x40
>> kernel: [67088.344114]? blk_freeze_queue+0xe/0x40
>> kernel: [67088.344118]? nvme_update_disk_info+0x36/0x260 [nvme_core]
>> kernel: [67088.344122]? __nvme_revalidate_disk+0xca/0xf0 [nvme_core]
>> kernel: [67088.344125]? nvme_revalidate_disk+0xa6/0x120 [nvme_core]
>> kernel: [67088.344127]? ? blk_mq_get_tag+0xa3/0x220
>> kernel: [67088.344130]? revalidate_disk+0x23/0xc0
>> kernel: [67088.344133]? nvme_validate_ns+0x43/0x830 [nvme_core]
>> kernel: [67088.344137]? ? wake_up_q+0x70/0x70
>> kernel: [67088.344139]? ? blk_mq_free_request+0x12a/0x160
>> kernel: [67088.344142]? ? __nvme_submit_sync_cmd+0x73/0xe0 [nvme_core]
>> kernel: [67088.344145]? nvme_scan_work+0x2b3/0x350 [nvme_core]
>> kernel: [67088.344149]? process_one_work+0x1da/0x400
>>
>> ?From which I've inferred that we're still running a scan in parallel to
>> reset, and that the scan thread is calling 'blk_freeze_queue()' on a
>> queue which is already torn down.
>
>
> Where is the scan triggered from? there is no scan call from the reset
> path.
>
> Is there a namespace removal or something else that triggers AEN
> to make this happen?
>
> What exactly is the scenario?

I believe this is from quick link bounce or logout/login testing on FC.

The loss of connectivity to the subsystem causes a reset to be done on 
the controller. Re-login causes the transport to send a discovery scan 
-or- the reconnect timer will fire, causing the controller to reconnect 
and issue the scan.

These resets and reconnects are right on the heels of each other - thus 
scan may not be finished before the reset occurs.? Reset will complete, 
as the ios it knows about are cleaned up.

-- james

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-24 18:08                 ` Sagi Grimberg
  2019-06-24 18:51                   ` James Smart
@ 2019-06-25  6:07                   ` Hannes Reinecke
  2019-06-25 21:50                     ` Sagi Grimberg
  1 sibling, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2019-06-25  6:07 UTC (permalink / raw)


On 6/24/19 8:08 PM, Sagi Grimberg wrote:
> 
>>>>> That's what I thought initially, too, but it turned out to be not
>>>>> sufficient.
>>>>
>>>> Not sufficient because it hangs? or panics?
>>>>
>> It hangs, and we're seeing a warning:
>>
>> kernel: [67088.344034] WARNING: CPU: 4 PID: 25020 at
>> ../lib/percpu-refcount.c:334 percpu_ref_kill_and_confirm+0x7a/0xa0
>> [ .. ]
>> kernel: [67088.344106] Call Trace:
>> kernel: [67088.344112]? blk_freeze_queue_start+0x2a/0x40
>> kernel: [67088.344114]? blk_freeze_queue+0xe/0x40
>> kernel: [67088.344118]? nvme_update_disk_info+0x36/0x260 [nvme_core]
>> kernel: [67088.344122]? __nvme_revalidate_disk+0xca/0xf0 [nvme_core]
>> kernel: [67088.344125]? nvme_revalidate_disk+0xa6/0x120 [nvme_core]
>> kernel: [67088.344127]? ? blk_mq_get_tag+0xa3/0x220
>> kernel: [67088.344130]? revalidate_disk+0x23/0xc0
>> kernel: [67088.344133]? nvme_validate_ns+0x43/0x830 [nvme_core]
>> kernel: [67088.344137]? ? wake_up_q+0x70/0x70
>> kernel: [67088.344139]? ? blk_mq_free_request+0x12a/0x160
>> kernel: [67088.344142]? ? __nvme_submit_sync_cmd+0x73/0xe0 [nvme_core]
>> kernel: [67088.344145]? nvme_scan_work+0x2b3/0x350 [nvme_core]
>> kernel: [67088.344149]? process_one_work+0x1da/0x400
>>
>> ?From which I've inferred that we're still running a scan in parallel to
>> reset, and that the scan thread is calling 'blk_freeze_queue()' on a
>> queue which is already torn down.
> 
> 
> Where is the scan triggered from? there is no scan call from the reset
> path.
> 
It's triggered from AEN, being received around the same time when reset
triggers.
There's actually a change that the AEN handling itself triggered the
reset, but I haven't be able to decipher that from the crash dump.

> Is there a namespace removal or something else that triggers AEN
> to make this happen?
> 
> What exactly is the scenario?

The scenario is multiple storage failover on NetApp OnTAP while I/O is
running.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-25  6:07                   ` Hannes Reinecke
@ 2019-06-25 21:50                     ` Sagi Grimberg
  2019-06-26  5:34                       ` Hannes Reinecke
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2019-06-25 21:50 UTC (permalink / raw)



>>>>>> That's what I thought initially, too, but it turned out to be not
>>>>>> sufficient.
>>>>>
>>>>> Not sufficient because it hangs? or panics?
>>>>>
>>> It hangs, and we're seeing a warning:
>>>
>>> kernel: [67088.344034] WARNING: CPU: 4 PID: 25020 at
>>> ../lib/percpu-refcount.c:334 percpu_ref_kill_and_confirm+0x7a/0xa0
>>> [ .. ]
>>> kernel: [67088.344106] Call Trace:
>>> kernel: [67088.344112]? blk_freeze_queue_start+0x2a/0x40
>>> kernel: [67088.344114]? blk_freeze_queue+0xe/0x40
>>> kernel: [67088.344118]? nvme_update_disk_info+0x36/0x260 [nvme_core]
>>> kernel: [67088.344122]? __nvme_revalidate_disk+0xca/0xf0 [nvme_core]
>>> kernel: [67088.344125]? nvme_revalidate_disk+0xa6/0x120 [nvme_core]
>>> kernel: [67088.344127]? ? blk_mq_get_tag+0xa3/0x220
>>> kernel: [67088.344130]? revalidate_disk+0x23/0xc0
>>> kernel: [67088.344133]? nvme_validate_ns+0x43/0x830 [nvme_core]
>>> kernel: [67088.344137]? ? wake_up_q+0x70/0x70
>>> kernel: [67088.344139]? ? blk_mq_free_request+0x12a/0x160
>>> kernel: [67088.344142]? ? __nvme_submit_sync_cmd+0x73/0xe0 [nvme_core]
>>> kernel: [67088.344145]? nvme_scan_work+0x2b3/0x350 [nvme_core]
>>> kernel: [67088.344149]? process_one_work+0x1da/0x400
>>>
>>>  ?From which I've inferred that we're still running a scan in parallel to
>>> reset, and that the scan thread is calling 'blk_freeze_queue()' on a
>>> queue which is already torn down.
>>
>>
>> Where is the scan triggered from? there is no scan call from the reset
>> path.
>>
> It's triggered from AEN, being received around the same time when reset
> triggers.
> There's actually a change that the AEN handling itself triggered the
> reset, but I haven't be able to decipher that from the crash dump.
> 
>> Is there a namespace removal or something else that triggers AEN
>> to make this happen?
>>
>> What exactly is the scenario?
> 
> The scenario is multiple storage failover on NetApp OnTAP while I/O is
> running.

Hannes,

I'm still not convinced that the transports need to flush the scan work
on resets.

Does the below help as an alternative:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 024fb219de17..074bcb1e797a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1665,6 +1665,10 @@ static void __nvme_revalidate_disk(struct gendisk 
*disk, struct nvme_id_ns *id)
  {
         struct nvme_ns *ns = disk->private_data;

+       /* if ns is removing we cannot mangle with the request queue */
+       if (test_bit(NVME_NS_REMOVING, &ns->flags))
+               return;
+
         /*
          * If identify namespace failed, use default 512 byte block size so
          * block layer can use before failing read/write for 0 capacity.
--

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-25 21:50                     ` Sagi Grimberg
@ 2019-06-26  5:34                       ` Hannes Reinecke
  2019-06-26 20:22                         ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2019-06-26  5:34 UTC (permalink / raw)


On 6/25/19 11:50 PM, Sagi Grimberg wrote:
> 
>>>>>>> That's what I thought initially, too, but it turned out to be not
>>>>>>> sufficient.
>>>>>>
>>>>>> Not sufficient because it hangs? or panics?
>>>>>>
>>>> It hangs, and we're seeing a warning:
>>>>
>>>> kernel: [67088.344034] WARNING: CPU: 4 PID: 25020 at
>>>> ../lib/percpu-refcount.c:334 percpu_ref_kill_and_confirm+0x7a/0xa0
>>>> [ .. ]
>>>> kernel: [67088.344106] Call Trace:
>>>> kernel: [67088.344112]? blk_freeze_queue_start+0x2a/0x40
>>>> kernel: [67088.344114]? blk_freeze_queue+0xe/0x40
>>>> kernel: [67088.344118]? nvme_update_disk_info+0x36/0x260 [nvme_core]
>>>> kernel: [67088.344122]? __nvme_revalidate_disk+0xca/0xf0 [nvme_core]
>>>> kernel: [67088.344125]? nvme_revalidate_disk+0xa6/0x120 [nvme_core]
>>>> kernel: [67088.344127]? ? blk_mq_get_tag+0xa3/0x220
>>>> kernel: [67088.344130]? revalidate_disk+0x23/0xc0
>>>> kernel: [67088.344133]? nvme_validate_ns+0x43/0x830 [nvme_core]
>>>> kernel: [67088.344137]? ? wake_up_q+0x70/0x70
>>>> kernel: [67088.344139]? ? blk_mq_free_request+0x12a/0x160
>>>> kernel: [67088.344142]? ? __nvme_submit_sync_cmd+0x73/0xe0 [nvme_core]
>>>> kernel: [67088.344145]? nvme_scan_work+0x2b3/0x350 [nvme_core]
>>>> kernel: [67088.344149]? process_one_work+0x1da/0x400
>>>>
>>>> ??From which I've inferred that we're still running a scan in 
>>>> parallel to
>>>> reset, and that the scan thread is calling 'blk_freeze_queue()' on a
>>>> queue which is already torn down.
>>>
>>>
>>> Where is the scan triggered from? there is no scan call from the reset
>>> path.
>>>
>> It's triggered from AEN, being received around the same time when reset
>> triggers.
>> There's actually a change that the AEN handling itself triggered the
>> reset, but I haven't be able to decipher that from the crash dump.
>>
>>> Is there a namespace removal or something else that triggers AEN
>>> to make this happen?
>>>
>>> What exactly is the scenario?
>>
>> The scenario is multiple storage failover on NetApp OnTAP while I/O is
>> running.
> 
> Hannes,
> 
> I'm still not convinced that the transports need to flush the scan work
> on resets.
> 
> Does the below help as an alternative:
> -- 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 024fb219de17..074bcb1e797a 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1665,6 +1665,10 @@ static void __nvme_revalidate_disk(struct gendisk 
> *disk, struct nvme_id_ns *id)
>  ?{
>  ??????? struct nvme_ns *ns = disk->private_data;
> 
> +?????? /* if ns is removing we cannot mangle with the request queue */
> +?????? if (test_bit(NVME_NS_REMOVING, &ns->flags))
> +?????????????? return;
> +
>  ??????? /*
>  ???????? * If identify namespace failed, use default 512 byte block 
> size so
>  ???????? * block layer can use before failing read/write for 0 capacity.
> -- 

Hmm. Not sure if that is sufficient, though.
We do issue several I/O in the course of nvme_revalidate_disk(), and as 
we're running fully asynchronous here the disk might be removed at any 
given time. Cf the above stack trace; we're getting an error for the 
_second_ I/O (nvme_identify_ns is issuing I/O, too, but went through 
without errors).
So if we were to go that route we'd need to protect each and every I/O 
with that test.
Which then would be similar to my patch, only the 'resetting' state 
check replaced by the 'removing' check.

Cheers,

Hanne
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare at suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-26  5:34                       ` Hannes Reinecke
@ 2019-06-26 20:22                         ` Sagi Grimberg
  2019-07-02  5:38                           ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2019-06-26 20:22 UTC (permalink / raw)



> Hmm. Not sure if that is sufficient, though.
> We do issue several I/O in the course of nvme_revalidate_disk(), and as 
> we're running fully asynchronous here the disk might be removed at any 
> given time. Cf the above stack trace; we're getting an error for the 
> _second_ I/O (nvme_identify_ns is issuing I/O, too, but went through 
> without errors).

Its not protecting against the I/O, but rather against queue freeze.

> So if we were to go that route we'd need to protect each and every I/O 
> with that test.

Why?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-06-26 20:22                         ` Sagi Grimberg
@ 2019-07-02  5:38                           ` Sagi Grimberg
  2019-07-02 13:29                             ` Hannes Reinecke
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2019-07-02  5:38 UTC (permalink / raw)



>> Hmm. Not sure if that is sufficient, though.
>> We do issue several I/O in the course of nvme_revalidate_disk(), and 
>> as we're running fully asynchronous here the disk might be removed at 
>> any given time. Cf the above stack trace; we're getting an error for 
>> the _second_ I/O (nvme_identify_ns is issuing I/O, too, but went 
>> through without errors).
> 
> Its not protecting against the I/O, but rather against queue freeze.
> 
>> So if we were to go that route we'd need to protect each and every I/O 
>> with that test.
> 
> Why?

Hannes?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme: flush scan_work when resetting controller
  2019-07-02  5:38                           ` Sagi Grimberg
@ 2019-07-02 13:29                             ` Hannes Reinecke
  0 siblings, 0 replies; 28+ messages in thread
From: Hannes Reinecke @ 2019-07-02 13:29 UTC (permalink / raw)


On 7/2/19 7:38 AM, Sagi Grimberg wrote:
> 
>>> Hmm. Not sure if that is sufficient, though.
>>> We do issue several I/O in the course of nvme_revalidate_disk(), and
>>> as we're running fully asynchronous here the disk might be removed at
>>> any given time. Cf the above stack trace; we're getting an error for
>>> the _second_ I/O (nvme_identify_ns is issuing I/O, too, but went
>>> through without errors).
>>
>> Its not protecting against the I/O, but rather against queue freeze.
>>
>>> So if we were to go that route we'd need to protect each and every
>>> I/O with that test.
>>
>> Why?
> 
> Hannes?

Ah, right.

So, I had been operating under the impression that we crash due to an
invalid namespace access when doing I/O with the rescan thread.
Your assumption is slightly different, namely that the scan thread is
crashing trying to quiesce I/O, but normal I/O would handled just fine.

I'll see to give your patch a spin, will let you know the results.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-07-02 13:29 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-18 10:10 [PATCH 0/2] nvme: flush rescan worker before resetting Hannes Reinecke
2019-06-18 10:10 ` [PATCH 1/2] nvme: Do not remove namespaces during reset Hannes Reinecke
2019-06-18 17:30   ` Sagi Grimberg
2019-06-20  1:22   ` Ming Lei
2019-06-18 10:10 ` [PATCH 2/2] nvme: flush scan_work when resetting controller Hannes Reinecke
2019-06-18 17:41   ` Sagi Grimberg
2019-06-19  6:22     ` Hannes Reinecke
2019-06-19 16:56       ` Sagi Grimberg
2019-06-19 18:45         ` Hannes Reinecke
2019-06-19 20:04           ` Sagi Grimberg
2019-06-21 16:26             ` Sagi Grimberg
2019-06-24  5:48               ` Hannes Reinecke
2019-06-24  6:13               ` Hannes Reinecke
2019-06-24 18:08                 ` Sagi Grimberg
2019-06-24 18:51                   ` James Smart
2019-06-25  6:07                   ` Hannes Reinecke
2019-06-25 21:50                     ` Sagi Grimberg
2019-06-26  5:34                       ` Hannes Reinecke
2019-06-26 20:22                         ` Sagi Grimberg
2019-07-02  5:38                           ` Sagi Grimberg
2019-07-02 13:29                             ` Hannes Reinecke
2019-06-20  1:36   ` Ming Lei
2019-06-21  6:14     ` Hannes Reinecke
2019-06-21  6:58       ` Ming Lei
2019-06-21  7:59         ` Hannes Reinecke
2019-06-21 17:23           ` James Smart
2019-06-21 17:23           ` James Smart
2019-06-24  3:29           ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.