linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] block: re-introduce blk_mq_complete_request_sync
@ 2020-10-08 21:37 Sagi Grimberg
  2020-10-09  4:39 ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Sagi Grimberg @ 2020-10-08 21:37 UTC (permalink / raw)
  To: linux-nvme, linux-block, Christoph Hellwig, Keith Busch
  Cc: Yi Zhang, Ming Lei, Jens Axboe

In nvme-tcp and nvme-rdma, the timeout handler in certain conditions
needs to complete an aborted request. The timeout handler serializes
against an error recovery sequence that iterates over the inflight
requests and cancels them.

However, the fact that blk_mq_complete_request may defer to a different
core that serialization breaks.

Hence re-introduce blk_mq_complete_request_sync and use that in the
timeout handler to make sure that we don't get a use-after-free
condition.

This was uncovered by the blktests:
--
$ nvme_trtype=tcp ./check nvme/012

[ 2152.546343] run blktests nvme/012 at 2020-10-08 05:59:03
[ 2152.799640] loop: module loaded
[ 2152.835222] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 2152.860697] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 2152.937889] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
[ 2152.952085] nvme nvme0: creating 12 I/O queues.
[ 2152.958042] nvme nvme0: mapped 12/0/0 default/read/poll queues.
[ 2152.969948] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420
[ 2154.927953] XFS (nvme0n1): Mounting V5 Filesystem
[ 2154.942432] XFS (nvme0n1): Ending clean mount
[ 2154.948183] xfs filesystem being mounted at /mnt/blktests supports timestamps until 2038 (0x7fffffff)
[ 2215.193645] nvme nvme0: queue 7: timeout request 0x11 type 4
[ 2215.199331] nvme nvme0: starting error recovery
[ 2215.203890] nvme nvme0: queue 7: timeout request 0x12 type 4
[ 2215.204483] block nvme0n1: no usable path - requeuing I/O
[ 2215.214976] block nvme0n1: no usable path - requeuing I/O
[ 2215.215495] nvme nvme0: Reconnecting in 10 seconds...
[ 2215.215502] ------------[ cut here ]------------
[ 2215.215505] refcount_t: underflow; use-after-free.
[ 2215.215617] WARNING: CPU: 6 PID: 45 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
[ 2215.215745] RSP: 0018:ffffb71b80837dc8 EFLAGS: 00010292
[ 2215.215750] RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffff93f37dcd8d08
[ 2215.215753] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff93f37dcd8d00
[ 2215.215755] RBP: ffff93f37a812400 R08: 00000203c5221fce R09: ffffffffa7fffbc4
[ 2215.215758] R10: 0000000000000477 R11: 000000000002835c R12: ffff93f37a8124e8
[ 2215.215761] R13: ffff93f37a2b0000 R14: ffffb71b80837e70 R15: ffff93f37a2b0000
[ 2215.215765] FS:  0000000000000000(0000) GS:ffff93f37dcc0000(0000) knlGS:0000000000000000
[ 2215.215768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2215.215771] CR2: 00005637b4137028 CR3: 000000007c1be000 CR4: 00000000000006e0
[ 2215.215773] Call Trace:
[ 2215.215784]  blk_mq_check_expired+0x109/0x1b0
[ 2215.215797]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
[ 2215.215801]  ? blk_poll+0x300/0x300
[ 2215.215806]  blk_mq_timeout_work+0x44/0xe0
[ 2215.215814]  process_one_work+0x1b4/0x370
[ 2215.215820]  worker_thread+0x53/0x3e0
[ 2215.215825]  ? process_one_work+0x370/0x370
[ 2215.215829]  kthread+0x11b/0x140
[ 2215.215834]  ? __kthread_bind_mask+0x60/0x60
[ 2215.215841]  ret_from_fork+0x22/0x30
[ 2215.215847] ---[ end trace 7d137e36e23c0d19 ]---
--

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 236187c4ed19 ("nvme-tcp: fix timeout handler")
Fixes: 0475a8dcbcee ("nvme-rdma: fix timeout handler")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
Yi, would appreciate your Tested-by tag

 block/blk-mq.c           | 7 +++++++
 drivers/nvme/host/rdma.c | 2 +-
 drivers/nvme/host/tcp.c  | 2 +-
 include/linux/blk-mq.h   | 1 +
 4 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f8e1e759c905..2d154722ef39 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -729,6 +729,13 @@ bool blk_mq_complete_request_remote(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
 
+void blk_mq_complete_request_sync(struct request *rq)
+{
+	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
+	rq->q->mq_ops->complete(rq);
+}
+EXPORT_SYMBOL_GPL(blk_mq_complete_request_sync);
+
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 9e378d0a0c01..45fc605349da 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1975,7 +1975,7 @@ static void nvme_rdma_complete_timed_out(struct request *rq)
 	nvme_rdma_stop_queue(queue);
 	if (!blk_mq_request_completed(rq)) {
 		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
-		blk_mq_complete_request(rq);
+		blk_mq_complete_request_sync(rq);
 	}
 	mutex_unlock(&ctrl->teardown_lock);
 }
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 8f4f29f18b8c..629b025685d1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2177,7 +2177,7 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
 	nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
 	if (!blk_mq_request_completed(rq)) {
 		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
-		blk_mq_complete_request(rq);
+		blk_mq_complete_request_sync(rq);
 	}
 	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 90da3582b91d..f90c258b5075 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -491,6 +491,7 @@ void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list);
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs);
 void blk_mq_complete_request(struct request *rq);
+void blk_mq_complete_request_sync(struct request *rq);
 bool blk_mq_complete_request_remote(struct request *rq);
 bool blk_mq_queue_stopped(struct request_queue *q);
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-08 21:37 [PATCH] block: re-introduce blk_mq_complete_request_sync Sagi Grimberg
@ 2020-10-09  4:39 ` Ming Lei
  2020-10-09  5:03   ` Yi Zhang
  2020-10-09  8:11   ` Sagi Grimberg
  0 siblings, 2 replies; 26+ messages in thread
From: Ming Lei @ 2020-10-09  4:39 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, linux-block, Christoph Hellwig, Keith Busch,
	Yi Zhang, Jens Axboe

On Thu, Oct 08, 2020 at 02:37:50PM -0700, Sagi Grimberg wrote:
> In nvme-tcp and nvme-rdma, the timeout handler in certain conditions
> needs to complete an aborted request. The timeout handler serializes
> against an error recovery sequence that iterates over the inflight
> requests and cancels them.
> 
> However, the fact that blk_mq_complete_request may defer to a different
> core that serialization breaks.
> 
> Hence re-introduce blk_mq_complete_request_sync and use that in the
> timeout handler to make sure that we don't get a use-after-free
> condition.
> 
> This was uncovered by the blktests:
> --
> $ nvme_trtype=tcp ./check nvme/012
> 
> [ 2152.546343] run blktests nvme/012 at 2020-10-08 05:59:03
> [ 2152.799640] loop: module loaded
> [ 2152.835222] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [ 2152.860697] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
> [ 2152.937889] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
> [ 2152.952085] nvme nvme0: creating 12 I/O queues.
> [ 2152.958042] nvme nvme0: mapped 12/0/0 default/read/poll queues.
> [ 2152.969948] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420
> [ 2154.927953] XFS (nvme0n1): Mounting V5 Filesystem
> [ 2154.942432] XFS (nvme0n1): Ending clean mount
> [ 2154.948183] xfs filesystem being mounted at /mnt/blktests supports timestamps until 2038 (0x7fffffff)
> [ 2215.193645] nvme nvme0: queue 7: timeout request 0x11 type 4
> [ 2215.199331] nvme nvme0: starting error recovery
> [ 2215.203890] nvme nvme0: queue 7: timeout request 0x12 type 4
> [ 2215.204483] block nvme0n1: no usable path - requeuing I/O
> [ 2215.214976] block nvme0n1: no usable path - requeuing I/O
> [ 2215.215495] nvme nvme0: Reconnecting in 10 seconds...
> [ 2215.215502] ------------[ cut here ]------------
> [ 2215.215505] refcount_t: underflow; use-after-free.
> [ 2215.215617] WARNING: CPU: 6 PID: 45 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
> [ 2215.215745] RSP: 0018:ffffb71b80837dc8 EFLAGS: 00010292
> [ 2215.215750] RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffff93f37dcd8d08
> [ 2215.215753] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff93f37dcd8d00
> [ 2215.215755] RBP: ffff93f37a812400 R08: 00000203c5221fce R09: ffffffffa7fffbc4
> [ 2215.215758] R10: 0000000000000477 R11: 000000000002835c R12: ffff93f37a8124e8
> [ 2215.215761] R13: ffff93f37a2b0000 R14: ffffb71b80837e70 R15: ffff93f37a2b0000
> [ 2215.215765] FS:  0000000000000000(0000) GS:ffff93f37dcc0000(0000) knlGS:0000000000000000
> [ 2215.215768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2215.215771] CR2: 00005637b4137028 CR3: 000000007c1be000 CR4: 00000000000006e0
> [ 2215.215773] Call Trace:
> [ 2215.215784]  blk_mq_check_expired+0x109/0x1b0
> [ 2215.215797]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
> [ 2215.215801]  ? blk_poll+0x300/0x300
> [ 2215.215806]  blk_mq_timeout_work+0x44/0xe0
> [ 2215.215814]  process_one_work+0x1b4/0x370
> [ 2215.215820]  worker_thread+0x53/0x3e0
> [ 2215.215825]  ? process_one_work+0x370/0x370
> [ 2215.215829]  kthread+0x11b/0x140
> [ 2215.215834]  ? __kthread_bind_mask+0x60/0x60
> [ 2215.215841]  ret_from_fork+0x22/0x30
> [ 2215.215847] ---[ end trace 7d137e36e23c0d19 ]---
> --
> 
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Fixes: 236187c4ed19 ("nvme-tcp: fix timeout handler")
> Fixes: 0475a8dcbcee ("nvme-rdma: fix timeout handler")
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> Yi, would appreciate your Tested-by tag
> 
>  block/blk-mq.c           | 7 +++++++
>  drivers/nvme/host/rdma.c | 2 +-
>  drivers/nvme/host/tcp.c  | 2 +-
>  include/linux/blk-mq.h   | 1 +
>  4 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f8e1e759c905..2d154722ef39 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -729,6 +729,13 @@ bool blk_mq_complete_request_remote(struct request *rq)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
>  
> +void blk_mq_complete_request_sync(struct request *rq)
> +{
> +	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
> +	rq->q->mq_ops->complete(rq);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_complete_request_sync);
> +
>  /**
>   * blk_mq_complete_request - end I/O on a request
>   * @rq:		the request being processed
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 9e378d0a0c01..45fc605349da 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1975,7 +1975,7 @@ static void nvme_rdma_complete_timed_out(struct request *rq)
>  	nvme_rdma_stop_queue(queue);
>  	if (!blk_mq_request_completed(rq)) {
>  		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
> -		blk_mq_complete_request(rq);
> +		blk_mq_complete_request_sync(rq);
>  	}
>  	mutex_unlock(&ctrl->teardown_lock);
>  }
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 8f4f29f18b8c..629b025685d1 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2177,7 +2177,7 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
>  	nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>  	if (!blk_mq_request_completed(rq)) {
>  		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
> -		blk_mq_complete_request(rq);
> +		blk_mq_complete_request_sync(rq);

Or complete the request in the following way? Then one block layer API
can be saved:

	blk_mq_complete_request_remote(rq);
	nvme_complete_rq(rq);


thanks,
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-09  4:39 ` Ming Lei
@ 2020-10-09  5:03   ` Yi Zhang
  2020-10-09  8:09     ` Sagi Grimberg
  2020-10-09  8:11   ` Sagi Grimberg
  1 sibling, 1 reply; 26+ messages in thread
From: Yi Zhang @ 2020-10-09  5:03 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Jens Axboe, linux-nvme, linux-block, Keith Busch, Christoph Hellwig

Hi Sagi

I applied this patch on block origin/for-next and still can reproduce it.

[  724.267865] run blktests nvme/012 at 2020-10-09 00:48:52
[  724.518498] loop: module loaded
[  724.552880] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[  724.586609] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[  724.648898] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
[  724.663300] nvme nvme0: creating 12 I/O queues.
[  724.669826] nvme nvme0: mapped 12/0/0 default/read/poll queues.
[  724.681350] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420
[  726.641605] XFS (nvme0n1): Mounting V5 Filesystem
[  726.654637] XFS (nvme0n1): Ending clean mount
[  726.660302] xfs filesystem being mounted at /mnt/blktests supports timestamps until 2038 (0x7fffffff)
[  787.042066] nvme nvme0: queue 2: timeout request 0xe type 4
[  787.047649] nvme nvme0: starting error recovery
[  787.052211] nvme nvme0: queue 4: timeout request 0x61 type 4
[  787.053407] block nvme0n1: no usable path - requeuing I/O
[  787.054543] nvme nvme0: Reconnecting in 10 seconds...
[  787.057877] ------------[ cut here ]------------
[  787.063274] block nvme0n1: no usable path - requeuing I/O
[  787.068313] refcount_t: underflow; use-after-free.
[  787.068370] WARNING: CPU: 3 PID: 30 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
[  787.072941] block nvme0n1: no usable path - requeuing I/O
[  787.072945] block nvme0n1: no usable path - requeuing I/O
[  787.078333] Modules linked in: loop nvme_tcp nvme_fabrics nvme_core nvmet_tcp nvmet rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer amd64_edac_mod snd edac_mce_amd soundcore kvm_amd ccp kvm pcspkr irqbypass hp_wmi sparse_keymap rfkill k10temp wmi_bmof nv_tco i2c_nforce2 acpi_cpufreq ip_tables xfs nouveau video mxm_wmi i2c_algo_bit drm_kms_helper mptsas cec scsi_transport_sas ttm mptscsih ata_generic firewire_ohci pata_acpi firewire_core drm serio_raw mptbase forcedeth crc_itu_t sata_nv pata_amd wmi floppy
[  787.083142] block nvme0n1: no usable path - requeuing I/O
[  787.091304] CPU: 3 PID: 30 Comm: kworker/3:0H Not tainted 5.9.0-rc8.update+ #3
[  787.096673] block nvme0n1: no usable path - requeuing I/O
[  787.096674] block nvme0n1: no usable path - requeuing I/O
[  787.096680] block nvme0n1: no usable path - requeuing I/O
[  787.096681] block nvme0n1: no usable path - requeuing I/O
[  787.096683] block nvme0n1: no usable path - requeuing I/O
[  787.201242] Hardware name: Hewlett-Packard HP xw9400 Workstation/0A1Ch, BIOS 786D6 v04.02 04/21/2009
[  787.210366] Workqueue: kblockd blk_mq_timeout_work
[  787.215161] RIP: 0010:refcount_warn_saturate+0xa6/0xf0
[  787.220295] Code: 05 aa 98 22 01 01 e8 9f cf ad ff 0f 0b c3 80 3d 98 98 22 01 00 75 95 48 c7 c7 48 54 3e a2 c6 05 88 98 22 01 01 e8 80 cf ad ff <0f> 0b c3 80 3d 77 98 22 01 00 0f 85 72 ff ff ff 48 c7 c7 a0 54 3e
[  787.239021] RSP: 0018:ffffb860407b3dc8 EFLAGS: 00010292
[  787.244240] RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffff9d93ddc58d08
[  787.251361] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9d93ddc58d00
[  787.258480] RBP: ffff9d93dbd6c200 R08: 000000b740ed60e8 R09: ffffffffa2fffbc4
[  787.265612] R10: 0000000000000477 R11: 0000000000028320 R12: ffff9d93dbd6c2e8
[  787.272741] R13: ffff9d93dbea0000 R14: ffffb860407b3e70 R15: ffff9d93dbea0000
[  787.279864] FS:  0000000000000000(0000) GS:ffff9d93ddc40000(0000) knlGS:0000000000000000
[  787.287948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  787.293695] CR2: 00007f45fd8fa024 CR3: 00000000792dc000 CR4: 00000000000006e0
[  787.300820] Call Trace:
[  787.303277]  blk_mq_check_expired+0x109/0x1b0
[  787.307638]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
[  787.312524]  ? blk_poll+0x300/0x300
[  787.316022]  blk_mq_timeout_work+0x44/0xe0
[  787.320128]  process_one_work+0x1b4/0x370
[  787.324137]  worker_thread+0x53/0x3e0
[  787.327809]  ? process_one_work+0x370/0x370
[  787.332015]  kthread+0x11b/0x140
[  787.335245]  ? __kthread_bind_mask+0x60/0x60
[  787.339515]  ret_from_fork+0x22/0x30
[  787.343094] ---[ end trace 606cf9189379fcfc ]---


Best Regards,
  Yi Zhang


----- Original Message -----
From: "Ming Lei" <ming.lei@redhat.com>
To: "Sagi Grimberg" <sagi@grimberg.me>
Cc: "Jens Axboe" <axboe@kernel.dk>, "Yi Zhang" <yi.zhang@redhat.com>, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, "Keith Busch" <kbusch@kernel.org>, "Christoph Hellwig" <hch@lst.de>
Sent: Friday, October 9, 2020 12:39:38 PM
Subject: Re: [PATCH] block: re-introduce blk_mq_complete_request_sync

On Thu, Oct 08, 2020 at 02:37:50PM -0700, Sagi Grimberg wrote:
> In nvme-tcp and nvme-rdma, the timeout handler in certain conditions
> needs to complete an aborted request. The timeout handler serializes
> against an error recovery sequence that iterates over the inflight
> requests and cancels them.
> 
> However, the fact that blk_mq_complete_request may defer to a different
> core that serialization breaks.
> 
> Hence re-introduce blk_mq_complete_request_sync and use that in the
> timeout handler to make sure that we don't get a use-after-free
> condition.
> 
> This was uncovered by the blktests:
> --
> $ nvme_trtype=tcp ./check nvme/012
> 
> [ 2152.546343] run blktests nvme/012 at 2020-10-08 05:59:03
> [ 2152.799640] loop: module loaded
> [ 2152.835222] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [ 2152.860697] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
> [ 2152.937889] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
> [ 2152.952085] nvme nvme0: creating 12 I/O queues.
> [ 2152.958042] nvme nvme0: mapped 12/0/0 default/read/poll queues.
> [ 2152.969948] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420
> [ 2154.927953] XFS (nvme0n1): Mounting V5 Filesystem
> [ 2154.942432] XFS (nvme0n1): Ending clean mount
> [ 2154.948183] xfs filesystem being mounted at /mnt/blktests supports timestamps until 2038 (0x7fffffff)
> [ 2215.193645] nvme nvme0: queue 7: timeout request 0x11 type 4
> [ 2215.199331] nvme nvme0: starting error recovery
> [ 2215.203890] nvme nvme0: queue 7: timeout request 0x12 type 4
> [ 2215.204483] block nvme0n1: no usable path - requeuing I/O
> [ 2215.214976] block nvme0n1: no usable path - requeuing I/O
> [ 2215.215495] nvme nvme0: Reconnecting in 10 seconds...
> [ 2215.215502] ------------[ cut here ]------------
> [ 2215.215505] refcount_t: underflow; use-after-free.
> [ 2215.215617] WARNING: CPU: 6 PID: 45 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
> [ 2215.215745] RSP: 0018:ffffb71b80837dc8 EFLAGS: 00010292
> [ 2215.215750] RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffff93f37dcd8d08
> [ 2215.215753] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff93f37dcd8d00
> [ 2215.215755] RBP: ffff93f37a812400 R08: 00000203c5221fce R09: ffffffffa7fffbc4
> [ 2215.215758] R10: 0000000000000477 R11: 000000000002835c R12: ffff93f37a8124e8
> [ 2215.215761] R13: ffff93f37a2b0000 R14: ffffb71b80837e70 R15: ffff93f37a2b0000
> [ 2215.215765] FS:  0000000000000000(0000) GS:ffff93f37dcc0000(0000) knlGS:0000000000000000
> [ 2215.215768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2215.215771] CR2: 00005637b4137028 CR3: 000000007c1be000 CR4: 00000000000006e0
> [ 2215.215773] Call Trace:
> [ 2215.215784]  blk_mq_check_expired+0x109/0x1b0
> [ 2215.215797]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
> [ 2215.215801]  ? blk_poll+0x300/0x300
> [ 2215.215806]  blk_mq_timeout_work+0x44/0xe0
> [ 2215.215814]  process_one_work+0x1b4/0x370
> [ 2215.215820]  worker_thread+0x53/0x3e0
> [ 2215.215825]  ? process_one_work+0x370/0x370
> [ 2215.215829]  kthread+0x11b/0x140
> [ 2215.215834]  ? __kthread_bind_mask+0x60/0x60
> [ 2215.215841]  ret_from_fork+0x22/0x30
> [ 2215.215847] ---[ end trace 7d137e36e23c0d19 ]---
> --
> 
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Fixes: 236187c4ed19 ("nvme-tcp: fix timeout handler")
> Fixes: 0475a8dcbcee ("nvme-rdma: fix timeout handler")
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> Yi, would appreciate your Tested-by tag
> 
>  block/blk-mq.c           | 7 +++++++
>  drivers/nvme/host/rdma.c | 2 +-
>  drivers/nvme/host/tcp.c  | 2 +-
>  include/linux/blk-mq.h   | 1 +
>  4 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f8e1e759c905..2d154722ef39 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -729,6 +729,13 @@ bool blk_mq_complete_request_remote(struct request *rq)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
>  
> +void blk_mq_complete_request_sync(struct request *rq)
> +{
> +	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
> +	rq->q->mq_ops->complete(rq);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_complete_request_sync);
> +
>  /**
>   * blk_mq_complete_request - end I/O on a request
>   * @rq:		the request being processed
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 9e378d0a0c01..45fc605349da 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1975,7 +1975,7 @@ static void nvme_rdma_complete_timed_out(struct request *rq)
>  	nvme_rdma_stop_queue(queue);
>  	if (!blk_mq_request_completed(rq)) {
>  		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
> -		blk_mq_complete_request(rq);
> +		blk_mq_complete_request_sync(rq);
>  	}
>  	mutex_unlock(&ctrl->teardown_lock);
>  }
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 8f4f29f18b8c..629b025685d1 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2177,7 +2177,7 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
>  	nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>  	if (!blk_mq_request_completed(rq)) {
>  		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
> -		blk_mq_complete_request(rq);
> +		blk_mq_complete_request_sync(rq);

Or complete the request in the following way? Then one block layer API
can be saved:

	blk_mq_complete_request_remote(rq);
	nvme_complete_rq(rq);


thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-09  5:03   ` Yi Zhang
@ 2020-10-09  8:09     ` Sagi Grimberg
  2020-10-09 13:55       ` Yi Zhang
  0 siblings, 1 reply; 26+ messages in thread
From: Sagi Grimberg @ 2020-10-09  8:09 UTC (permalink / raw)
  To: Yi Zhang, Ming Lei
  Cc: Jens Axboe, linux-block, Keith Busch, linux-nvme, Christoph Hellwig

> Hi Sagi
> 
> I applied this patch on block origin/for-next and still can reproduce it.

That's unexpected, can you try this patch?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 629b025685d1..46428ff0b0fc 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2175,7 +2175,7 @@ static void nvme_tcp_complete_timed_out(struct 
request *rq)
         /* fence other contexts that may complete the command */
         mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
         nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
-       if (!blk_mq_request_completed(rq)) {
+       if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq)) {
                 nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
                 blk_mq_complete_request_sync(rq);
         }
--

> 
> [  724.267865] run blktests nvme/012 at 2020-10-09 00:48:52
> [  724.518498] loop: module loaded
> [  724.552880] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [  724.586609] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
> [  724.648898] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
> [  724.663300] nvme nvme0: creating 12 I/O queues.
> [  724.669826] nvme nvme0: mapped 12/0/0 default/read/poll queues.
> [  724.681350] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420
> [  726.641605] XFS (nvme0n1): Mounting V5 Filesystem
> [  726.654637] XFS (nvme0n1): Ending clean mount
> [  726.660302] xfs filesystem being mounted at /mnt/blktests supports timestamps until 2038 (0x7fffffff)
> [  787.042066] nvme nvme0: queue 2: timeout request 0xe type 4
> [  787.047649] nvme nvme0: starting error recovery
> [  787.052211] nvme nvme0: queue 4: timeout request 0x61 type 4
> [  787.053407] block nvme0n1: no usable path - requeuing I/O
> [  787.054543] nvme nvme0: Reconnecting in 10 seconds...
> [  787.057877] ------------[ cut here ]------------
> [  787.063274] block nvme0n1: no usable path - requeuing I/O
> [  787.068313] refcount_t: underflow; use-after-free.
> [  787.068370] WARNING: CPU: 3 PID: 30 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
> [  787.072941] block nvme0n1: no usable path - requeuing I/O
> [  787.072945] block nvme0n1: no usable path - requeuing I/O
> [  787.078333] Modules linked in: loop nvme_tcp nvme_fabrics nvme_core nvmet_tcp nvmet rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer amd64_edac_mod snd edac_mce_amd soundcore kvm_amd ccp kvm pcspkr irqbypass hp_wmi sparse_keymap rfkill k10temp wmi_bmof nv_tco i2c_nforce2 acpi_cpufreq ip_tables xfs nouveau video mxm_wmi i2c_algo_bit drm_kms_helper mptsas cec scsi_transport_sas ttm mptscsih ata_generic firewire_ohci pata_acpi firewire_core drm serio_raw mptbase forcedeth crc_itu_t sata_nv pata_amd wmi floppy
> [  787.083142] block nvme0n1: no usable path - requeuing I/O
> [  787.091304] CPU: 3 PID: 30 Comm: kworker/3:0H Not tainted 5.9.0-rc8.update+ #3
> [  787.096673] block nvme0n1: no usable path - requeuing I/O
> [  787.096674] block nvme0n1: no usable path - requeuing I/O
> [  787.096680] block nvme0n1: no usable path - requeuing I/O
> [  787.096681] block nvme0n1: no usable path - requeuing I/O
> [  787.096683] block nvme0n1: no usable path - requeuing I/O
> [  787.201242] Hardware name: Hewlett-Packard HP xw9400 Workstation/0A1Ch, BIOS 786D6 v04.02 04/21/2009
> [  787.210366] Workqueue: kblockd blk_mq_timeout_work
> [  787.215161] RIP: 0010:refcount_warn_saturate+0xa6/0xf0
> [  787.220295] Code: 05 aa 98 22 01 01 e8 9f cf ad ff 0f 0b c3 80 3d 98 98 22 01 00 75 95 48 c7 c7 48 54 3e a2 c6 05 88 98 22 01 01 e8 80 cf ad ff <0f> 0b c3 80 3d 77 98 22 01 00 0f 85 72 ff ff ff 48 c7 c7 a0 54 3e
> [  787.239021] RSP: 0018:ffffb860407b3dc8 EFLAGS: 00010292
> [  787.244240] RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffff9d93ddc58d08
> [  787.251361] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9d93ddc58d00
> [  787.258480] RBP: ffff9d93dbd6c200 R08: 000000b740ed60e8 R09: ffffffffa2fffbc4
> [  787.265612] R10: 0000000000000477 R11: 0000000000028320 R12: ffff9d93dbd6c2e8
> [  787.272741] R13: ffff9d93dbea0000 R14: ffffb860407b3e70 R15: ffff9d93dbea0000
> [  787.279864] FS:  0000000000000000(0000) GS:ffff9d93ddc40000(0000) knlGS:0000000000000000
> [  787.287948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  787.293695] CR2: 00007f45fd8fa024 CR3: 00000000792dc000 CR4: 00000000000006e0
> [  787.300820] Call Trace:
> [  787.303277]  blk_mq_check_expired+0x109/0x1b0
> [  787.307638]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
> [  787.312524]  ? blk_poll+0x300/0x300
> [  787.316022]  blk_mq_timeout_work+0x44/0xe0
> [  787.320128]  process_one_work+0x1b4/0x370
> [  787.324137]  worker_thread+0x53/0x3e0
> [  787.327809]  ? process_one_work+0x370/0x370
> [  787.332015]  kthread+0x11b/0x140
> [  787.335245]  ? __kthread_bind_mask+0x60/0x60
> [  787.339515]  ret_from_fork+0x22/0x30
> [  787.343094] ---[ end trace 606cf9189379fcfc ]---
> 
> 
> Best Regards,
>    Yi Zhang
> 
> 
> ----- Original Message -----
> From: "Ming Lei" <ming.lei@redhat.com>
> To: "Sagi Grimberg" <sagi@grimberg.me>
> Cc: "Jens Axboe" <axboe@kernel.dk>, "Yi Zhang" <yi.zhang@redhat.com>, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, "Keith Busch" <kbusch@kernel.org>, "Christoph Hellwig" <hch@lst.de>
> Sent: Friday, October 9, 2020 12:39:38 PM
> Subject: Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
> 
> On Thu, Oct 08, 2020 at 02:37:50PM -0700, Sagi Grimberg wrote:
>> In nvme-tcp and nvme-rdma, the timeout handler in certain conditions
>> needs to complete an aborted request. The timeout handler serializes
>> against an error recovery sequence that iterates over the inflight
>> requests and cancels them.
>>
>> However, the fact that blk_mq_complete_request may defer to a different
>> core that serialization breaks.
>>
>> Hence re-introduce blk_mq_complete_request_sync and use that in the
>> timeout handler to make sure that we don't get a use-after-free
>> condition.
>>
>> This was uncovered by the blktests:
>> --
>> $ nvme_trtype=tcp ./check nvme/012
>>
>> [ 2152.546343] run blktests nvme/012 at 2020-10-08 05:59:03
>> [ 2152.799640] loop: module loaded
>> [ 2152.835222] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>> [ 2152.860697] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>> [ 2152.937889] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>> [ 2152.952085] nvme nvme0: creating 12 I/O queues.
>> [ 2152.958042] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>> [ 2152.969948] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420
>> [ 2154.927953] XFS (nvme0n1): Mounting V5 Filesystem
>> [ 2154.942432] XFS (nvme0n1): Ending clean mount
>> [ 2154.948183] xfs filesystem being mounted at /mnt/blktests supports timestamps until 2038 (0x7fffffff)
>> [ 2215.193645] nvme nvme0: queue 7: timeout request 0x11 type 4
>> [ 2215.199331] nvme nvme0: starting error recovery
>> [ 2215.203890] nvme nvme0: queue 7: timeout request 0x12 type 4
>> [ 2215.204483] block nvme0n1: no usable path - requeuing I/O
>> [ 2215.214976] block nvme0n1: no usable path - requeuing I/O
>> [ 2215.215495] nvme nvme0: Reconnecting in 10 seconds...
>> [ 2215.215502] ------------[ cut here ]------------
>> [ 2215.215505] refcount_t: underflow; use-after-free.
>> [ 2215.215617] WARNING: CPU: 6 PID: 45 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
>> [ 2215.215745] RSP: 0018:ffffb71b80837dc8 EFLAGS: 00010292
>> [ 2215.215750] RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffff93f37dcd8d08
>> [ 2215.215753] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff93f37dcd8d00
>> [ 2215.215755] RBP: ffff93f37a812400 R08: 00000203c5221fce R09: ffffffffa7fffbc4
>> [ 2215.215758] R10: 0000000000000477 R11: 000000000002835c R12: ffff93f37a8124e8
>> [ 2215.215761] R13: ffff93f37a2b0000 R14: ffffb71b80837e70 R15: ffff93f37a2b0000
>> [ 2215.215765] FS:  0000000000000000(0000) GS:ffff93f37dcc0000(0000) knlGS:0000000000000000
>> [ 2215.215768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 2215.215771] CR2: 00005637b4137028 CR3: 000000007c1be000 CR4: 00000000000006e0
>> [ 2215.215773] Call Trace:
>> [ 2215.215784]  blk_mq_check_expired+0x109/0x1b0
>> [ 2215.215797]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
>> [ 2215.215801]  ? blk_poll+0x300/0x300
>> [ 2215.215806]  blk_mq_timeout_work+0x44/0xe0
>> [ 2215.215814]  process_one_work+0x1b4/0x370
>> [ 2215.215820]  worker_thread+0x53/0x3e0
>> [ 2215.215825]  ? process_one_work+0x370/0x370
>> [ 2215.215829]  kthread+0x11b/0x140
>> [ 2215.215834]  ? __kthread_bind_mask+0x60/0x60
>> [ 2215.215841]  ret_from_fork+0x22/0x30
>> [ 2215.215847] ---[ end trace 7d137e36e23c0d19 ]---
>> --
>>
>> Reported-by: Yi Zhang <yi.zhang@redhat.com>
>> Fixes: 236187c4ed19 ("nvme-tcp: fix timeout handler")
>> Fixes: 0475a8dcbcee ("nvme-rdma: fix timeout handler")
>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>> ---
>> Yi, would appreciate your Tested-by tag
>>
>>   block/blk-mq.c           | 7 +++++++
>>   drivers/nvme/host/rdma.c | 2 +-
>>   drivers/nvme/host/tcp.c  | 2 +-
>>   include/linux/blk-mq.h   | 1 +
>>   4 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index f8e1e759c905..2d154722ef39 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -729,6 +729,13 @@ bool blk_mq_complete_request_remote(struct request *rq)
>>   }
>>   EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
>>   
>> +void blk_mq_complete_request_sync(struct request *rq)
>> +{
>> +	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
>> +	rq->q->mq_ops->complete(rq);
>> +}
>> +EXPORT_SYMBOL_GPL(blk_mq_complete_request_sync);
>> +
>>   /**
>>    * blk_mq_complete_request - end I/O on a request
>>    * @rq:		the request being processed
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index 9e378d0a0c01..45fc605349da 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -1975,7 +1975,7 @@ static void nvme_rdma_complete_timed_out(struct request *rq)
>>   	nvme_rdma_stop_queue(queue);
>>   	if (!blk_mq_request_completed(rq)) {
>>   		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>> -		blk_mq_complete_request(rq);
>> +		blk_mq_complete_request_sync(rq);
>>   	}
>>   	mutex_unlock(&ctrl->teardown_lock);
>>   }
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 8f4f29f18b8c..629b025685d1 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -2177,7 +2177,7 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
>>   	nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>>   	if (!blk_mq_request_completed(rq)) {
>>   		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>> -		blk_mq_complete_request(rq);
>> +		blk_mq_complete_request_sync(rq);
> 
> Or complete the request in the following way? Then one block layer API
> can be saved:
> 
> 	blk_mq_complete_request_remote(rq);
> 	nvme_complete_rq(rq);
> 
> 
> thanks,
> Ming
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-09  4:39 ` Ming Lei
  2020-10-09  5:03   ` Yi Zhang
@ 2020-10-09  8:11   ` Sagi Grimberg
  1 sibling, 0 replies; 26+ messages in thread
From: Sagi Grimberg @ 2020-10-09  8:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-nvme, linux-block, Christoph Hellwig, Keith Busch,
	Yi Zhang, Jens Axboe


>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 8f4f29f18b8c..629b025685d1 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -2177,7 +2177,7 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
>>   	nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>>   	if (!blk_mq_request_completed(rq)) {
>>   		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>> -		blk_mq_complete_request(rq);
>> +		blk_mq_complete_request_sync(rq);
> 
> Or complete the request in the following way? Then one block layer API
> can be saved:
> 
> 	blk_mq_complete_request_remote(rq);
> 	nvme_complete_rq(rq);

Not sure I follow, how does this work?

Anyways, I think that blk_mq_complete_request_sync is a clean
and useful interface. We can do it in nvme but I don't think
it would be better.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-09  8:09     ` Sagi Grimberg
@ 2020-10-09 13:55       ` Yi Zhang
  2020-10-09 18:29         ` Sagi Grimberg
  0 siblings, 1 reply; 26+ messages in thread
From: Yi Zhang @ 2020-10-09 13:55 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Jens Axboe, linux-block, Keith Busch, linux-nvme, Christoph Hellwig

Hi Sagi

On 10/9/20 4:09 PM, Sagi Grimberg wrote:
>> Hi Sagi
>>
>> I applied this patch on block origin/for-next and still can reproduce 
>> it.
>
> That's unexpected, can you try this patch?
> -- 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 629b025685d1..46428ff0b0fc 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2175,7 +2175,7 @@ static void nvme_tcp_complete_timed_out(struct 
> request *rq)
>         /* fence other contexts that may complete the command */
>         mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>         nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
> -       if (!blk_mq_request_completed(rq)) {
> +       if (blk_mq_request_started(rq) && 
> !blk_mq_request_completed(rq)) {
>                 nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>                 blk_mq_complete_request_sync(rq);
>         }
No WARNING with this patch, but the test still will be hang and never 
finished, here is the kernel log:

[ 3327.011067] run blktests nvme/012 at 2020-10-09 06:21:59
[ 3327.278169] loop: module loaded
[ 3327.314287] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 3327.346276] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 3327.396746] nvmet: creating controller 1 for subsystem 
blktests-subsystem-1 for NQN 
nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
[ 3327.411000] nvme nvme0: creating 12 I/O queues.
[ 3327.416963] nvme nvme0: mapped 12/0/0 default/read/poll queues.
[ 3327.428836] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 
127.0.0.1:4420
[ 3329.403658] XFS (nvme0n1): Mounting V5 Filesystem
[ 3329.417213] XFS (nvme0n1): Ending clean mount
[ 3329.423044] xfs filesystem being mounted at /mnt/blktests supports 
timestamps until 2038 (0x7fffffff)
[ 3390.069126] nvme nvme0: queue 6: timeout request 0x21 type 4
[ 3390.074797] nvme nvme0: starting error recovery
[ 3390.079362] nvme nvme0: queue 6: timeout request 0x22 type 4
[ 3390.079932] block nvme0n1: no usable path - requeuing I/O
[ 3390.081999] nvme nvme0: Reconnecting in 10 seconds...
[ 3390.095467] block nvme0n1: no usable path - requeuing I/O
[ 3390.100879] block nvme0n1: no usable path - requeuing I/O
[ 3390.106279] block nvme0n1: no usable path - requeuing I/O
[ 3390.111691] block nvme0n1: no usable path - requeuing I/O
[ 3390.117098] block nvme0n1: no usable path - requeuing I/O
[ 3390.122507] block nvme0n1: no usable path - requeuing I/O
[ 3390.127920] block nvme0n1: no usable path - requeuing I/O
[ 3390.133333] block nvme0n1: no usable path - requeuing I/O
[ 3390.138734] block nvme0n1: no usable path - requeuing I/O
[ 3415.669659] nvmet: ctrl 2 keep-alive timer (15 seconds) expired!
[ 3415.675682] nvmet: ctrl 2 fatal error occurred!
[ 3419.765738] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[ 3419.771762] nvmet: ctrl 1 fatal error occurred!
[ 3440.000352] nvmet: creating controller 2 for subsystem 
blktests-subsystem-1 for NQN 
nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
[ 3463.286700] nvme nvme0: queue 0: timeout request 0x0 type 4
[ 3463.292357] nvme nvme0: Connect command failed, error wo/DNR bit: 881
[ 3463.298830] nvme nvme0: failed to connect queue: 0 ret=881
[ 3463.304352] nvme nvme0: Failed reconnect attempt 1
[ 3463.309156] nvme nvme0: Reconnecting in 10 seconds...
[ 3473.527574] nvmet: creating controller 1 for subsystem 
blktests-subsystem-1 for NQN 
nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
[ 3473.541438] nvme_ns_head_submit_bio: 47 callbacks suppressed
[ 3473.541445] block nvme0n1: no usable path - requeuing I/O
[ 3473.552521] block nvme0n1: no usable path - requeuing I/O
[ 3473.557923] block nvme0n1: no usable path - requeuing I/O
[ 3473.563319] block nvme0n1: no usable path - requeuing I/O
[ 3473.568739] block nvme0n1: no usable path - requeuing I/O
[ 3473.574678] block nvme0n1: no usable path - requeuing I/O
[ 3473.580068] block nvme0n1: no usable path - requeuing I/O
[ 3473.585461] block nvme0n1: no usable path - requeuing I/O
[ 3473.590861] block nvme0n1: no usable path - requeuing I/O
[ 3473.596249] block nvme0n1: no usable path - requeuing I/O
[ 3473.601880] nvme nvme0: creating 12 I/O queues.
[ 3473.613955] nvme nvme0: mapped 12/0/0 default/read/poll queues.
[ 3473.622106] nvme nvme0: Successfully reconnected (2 attempt)
[ 3531.384170] nvme nvme0: queue 1: timeout request 0x62 type 4
[ 3531.389837] nvme nvme0: starting error recovery
[ 3531.394396] nvme nvme0: queue 7: timeout request 0x61 type 4
[ 3531.394954] nvme_ns_head_submit_bio: 14 callbacks suppressed
[ 3531.394964] block nvme0n1: no usable path - requeuing I/O
[ 3531.411147] block nvme0n1: no usable path - requeuing I/O
[ 3531.416548] block nvme0n1: no usable path - requeuing I/O
[ 3531.421954] block nvme0n1: no usable path - requeuing I/O
[ 3531.427361] block nvme0n1: no usable path - requeuing I/O
[ 3531.432763] block nvme0n1: no usable path - requeuing I/O
[ 3531.438287] block nvme0n1: no usable path - requeuing I/O
[ 3531.443701] block nvme0n1: no usable path - requeuing I/O
[ 3531.449103] block nvme0n1: no usable path - requeuing I/O
[ 3531.454497] block nvme0n1: no usable path - requeuing I/O
[ 3560.568832] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[ 3560.574839] nvmet: ctrl 1 fatal error occurred!


> -- 
>
>>
>> [  724.267865] run blktests nvme/012 at 2020-10-09 00:48:52
>> [  724.518498] loop: module loaded
>> [  724.552880] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>> [  724.586609] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>> [  724.648898] nvmet: creating controller 1 for subsystem 
>> blktests-subsystem-1 for NQN 
>> nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>> [  724.663300] nvme nvme0: creating 12 I/O queues.
>> [  724.669826] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>> [  724.681350] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 
>> 127.0.0.1:4420
>> [  726.641605] XFS (nvme0n1): Mounting V5 Filesystem
>> [  726.654637] XFS (nvme0n1): Ending clean mount
>> [  726.660302] xfs filesystem being mounted at /mnt/blktests supports 
>> timestamps until 2038 (0x7fffffff)
>> [  787.042066] nvme nvme0: queue 2: timeout request 0xe type 4
>> [  787.047649] nvme nvme0: starting error recovery
>> [  787.052211] nvme nvme0: queue 4: timeout request 0x61 type 4
>> [  787.053407] block nvme0n1: no usable path - requeuing I/O
>> [  787.054543] nvme nvme0: Reconnecting in 10 seconds...
>> [  787.057877] ------------[ cut here ]------------
>> [  787.063274] block nvme0n1: no usable path - requeuing I/O
>> [  787.068313] refcount_t: underflow; use-after-free.
>> [  787.068370] WARNING: CPU: 3 PID: 30 at lib/refcount.c:28 
>> refcount_warn_saturate+0xa6/0xf0
>> [  787.072941] block nvme0n1: no usable path - requeuing I/O
>> [  787.072945] block nvme0n1: no usable path - requeuing I/O
>> [  787.078333] Modules linked in: loop nvme_tcp nvme_fabrics 
>> nvme_core nvmet_tcp nvmet rpcsec_gss_krb5 auth_rpcgss nfsv4 
>> dns_resolver nfs lockd grace fscache sunrpc snd_hda_codec_realtek 
>> snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg 
>> snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer amd64_edac_mod 
>> snd edac_mce_amd soundcore kvm_amd ccp kvm pcspkr irqbypass hp_wmi 
>> sparse_keymap rfkill k10temp wmi_bmof nv_tco i2c_nforce2 acpi_cpufreq 
>> ip_tables xfs nouveau video mxm_wmi i2c_algo_bit drm_kms_helper 
>> mptsas cec scsi_transport_sas ttm mptscsih ata_generic firewire_ohci 
>> pata_acpi firewire_core drm serio_raw mptbase forcedeth crc_itu_t 
>> sata_nv pata_amd wmi floppy
>> [  787.083142] block nvme0n1: no usable path - requeuing I/O
>> [  787.091304] CPU: 3 PID: 30 Comm: kworker/3:0H Not tainted 
>> 5.9.0-rc8.update+ #3
>> [  787.096673] block nvme0n1: no usable path - requeuing I/O
>> [  787.096674] block nvme0n1: no usable path - requeuing I/O
>> [  787.096680] block nvme0n1: no usable path - requeuing I/O
>> [  787.096681] block nvme0n1: no usable path - requeuing I/O
>> [  787.096683] block nvme0n1: no usable path - requeuing I/O
>> [  787.201242] Hardware name: Hewlett-Packard HP xw9400 
>> Workstation/0A1Ch, BIOS 786D6 v04.02 04/21/2009
>> [  787.210366] Workqueue: kblockd blk_mq_timeout_work
>> [  787.215161] RIP: 0010:refcount_warn_saturate+0xa6/0xf0
>> [  787.220295] Code: 05 aa 98 22 01 01 e8 9f cf ad ff 0f 0b c3 80 3d 
>> 98 98 22 01 00 75 95 48 c7 c7 48 54 3e a2 c6 05 88 98 22 01 01 e8 80 
>> cf ad ff <0f> 0b c3 80 3d 77 98 22 01 00 0f 85 72 ff ff ff 48 c7 c7 
>> a0 54 3e
>> [  787.239021] RSP: 0018:ffffb860407b3dc8 EFLAGS: 00010292
>> [  787.244240] RAX: 0000000000000026 RBX: 0000000000000000 RCX: 
>> ffff9d93ddc58d08
>> [  787.251361] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: 
>> ffff9d93ddc58d00
>> [  787.258480] RBP: ffff9d93dbd6c200 R08: 000000b740ed60e8 R09: 
>> ffffffffa2fffbc4
>> [  787.265612] R10: 0000000000000477 R11: 0000000000028320 R12: 
>> ffff9d93dbd6c2e8
>> [  787.272741] R13: ffff9d93dbea0000 R14: ffffb860407b3e70 R15: 
>> ffff9d93dbea0000
>> [  787.279864] FS:  0000000000000000(0000) GS:ffff9d93ddc40000(0000) 
>> knlGS:0000000000000000
>> [  787.287948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  787.293695] CR2: 00007f45fd8fa024 CR3: 00000000792dc000 CR4: 
>> 00000000000006e0
>> [  787.300820] Call Trace:
>> [  787.303277]  blk_mq_check_expired+0x109/0x1b0
>> [  787.307638]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
>> [  787.312524]  ? blk_poll+0x300/0x300
>> [  787.316022]  blk_mq_timeout_work+0x44/0xe0
>> [  787.320128]  process_one_work+0x1b4/0x370
>> [  787.324137]  worker_thread+0x53/0x3e0
>> [  787.327809]  ? process_one_work+0x370/0x370
>> [  787.332015]  kthread+0x11b/0x140
>> [  787.335245]  ? __kthread_bind_mask+0x60/0x60
>> [  787.339515]  ret_from_fork+0x22/0x30
>> [  787.343094] ---[ end trace 606cf9189379fcfc ]---
>>
>>
>> Best Regards,
>>    Yi Zhang
>>
>>
>> ----- Original Message -----
>> From: "Ming Lei" <ming.lei@redhat.com>
>> To: "Sagi Grimberg" <sagi@grimberg.me>
>> Cc: "Jens Axboe" <axboe@kernel.dk>, "Yi Zhang" <yi.zhang@redhat.com>, 
>> linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, "Keith 
>> Busch" <kbusch@kernel.org>, "Christoph Hellwig" <hch@lst.de>
>> Sent: Friday, October 9, 2020 12:39:38 PM
>> Subject: Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
>>
>> On Thu, Oct 08, 2020 at 02:37:50PM -0700, Sagi Grimberg wrote:
>>> In nvme-tcp and nvme-rdma, the timeout handler in certain conditions
>>> needs to complete an aborted request. The timeout handler serializes
>>> against an error recovery sequence that iterates over the inflight
>>> requests and cancels them.
>>>
>>> However, the fact that blk_mq_complete_request may defer to a different
>>> core that serialization breaks.
>>>
>>> Hence re-introduce blk_mq_complete_request_sync and use that in the
>>> timeout handler to make sure that we don't get a use-after-free
>>> condition.
>>>
>>> This was uncovered by the blktests:
>>> -- 
>>> $ nvme_trtype=tcp ./check nvme/012
>>>
>>> [ 2152.546343] run blktests nvme/012 at 2020-10-08 05:59:03
>>> [ 2152.799640] loop: module loaded
>>> [ 2152.835222] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [ 2152.860697] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>>> [ 2152.937889] nvmet: creating controller 1 for subsystem 
>>> blktests-subsystem-1 for NQN 
>>> nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>>> [ 2152.952085] nvme nvme0: creating 12 I/O queues.
>>> [ 2152.958042] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>>> [ 2152.969948] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", 
>>> addr 127.0.0.1:4420
>>> [ 2154.927953] XFS (nvme0n1): Mounting V5 Filesystem
>>> [ 2154.942432] XFS (nvme0n1): Ending clean mount
>>> [ 2154.948183] xfs filesystem being mounted at /mnt/blktests 
>>> supports timestamps until 2038 (0x7fffffff)
>>> [ 2215.193645] nvme nvme0: queue 7: timeout request 0x11 type 4
>>> [ 2215.199331] nvme nvme0: starting error recovery
>>> [ 2215.203890] nvme nvme0: queue 7: timeout request 0x12 type 4
>>> [ 2215.204483] block nvme0n1: no usable path - requeuing I/O
>>> [ 2215.214976] block nvme0n1: no usable path - requeuing I/O
>>> [ 2215.215495] nvme nvme0: Reconnecting in 10 seconds...
>>> [ 2215.215502] ------------[ cut here ]------------
>>> [ 2215.215505] refcount_t: underflow; use-after-free.
>>> [ 2215.215617] WARNING: CPU: 6 PID: 45 at lib/refcount.c:28 
>>> refcount_warn_saturate+0xa6/0xf0
>>> [ 2215.215745] RSP: 0018:ffffb71b80837dc8 EFLAGS: 00010292
>>> [ 2215.215750] RAX: 0000000000000026 RBX: 0000000000000000 RCX: 
>>> ffff93f37dcd8d08
>>> [ 2215.215753] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: 
>>> ffff93f37dcd8d00
>>> [ 2215.215755] RBP: ffff93f37a812400 R08: 00000203c5221fce R09: 
>>> ffffffffa7fffbc4
>>> [ 2215.215758] R10: 0000000000000477 R11: 000000000002835c R12: 
>>> ffff93f37a8124e8
>>> [ 2215.215761] R13: ffff93f37a2b0000 R14: ffffb71b80837e70 R15: 
>>> ffff93f37a2b0000
>>> [ 2215.215765] FS:  0000000000000000(0000) GS:ffff93f37dcc0000(0000) 
>>> knlGS:0000000000000000
>>> [ 2215.215768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 2215.215771] CR2: 00005637b4137028 CR3: 000000007c1be000 CR4: 
>>> 00000000000006e0
>>> [ 2215.215773] Call Trace:
>>> [ 2215.215784]  blk_mq_check_expired+0x109/0x1b0
>>> [ 2215.215797]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
>>> [ 2215.215801]  ? blk_poll+0x300/0x300
>>> [ 2215.215806]  blk_mq_timeout_work+0x44/0xe0
>>> [ 2215.215814]  process_one_work+0x1b4/0x370
>>> [ 2215.215820]  worker_thread+0x53/0x3e0
>>> [ 2215.215825]  ? process_one_work+0x370/0x370
>>> [ 2215.215829]  kthread+0x11b/0x140
>>> [ 2215.215834]  ? __kthread_bind_mask+0x60/0x60
>>> [ 2215.215841]  ret_from_fork+0x22/0x30
>>> [ 2215.215847] ---[ end trace 7d137e36e23c0d19 ]---
>>> -- 
>>>
>>> Reported-by: Yi Zhang <yi.zhang@redhat.com>
>>> Fixes: 236187c4ed19 ("nvme-tcp: fix timeout handler")
>>> Fixes: 0475a8dcbcee ("nvme-rdma: fix timeout handler")
>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>> ---
>>> Yi, would appreciate your Tested-by tag
>>>
>>>   block/blk-mq.c           | 7 +++++++
>>>   drivers/nvme/host/rdma.c | 2 +-
>>>   drivers/nvme/host/tcp.c  | 2 +-
>>>   include/linux/blk-mq.h   | 1 +
>>>   4 files changed, 10 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index f8e1e759c905..2d154722ef39 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -729,6 +729,13 @@ bool blk_mq_complete_request_remote(struct 
>>> request *rq)
>>>   }
>>>   EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
>>>   +void blk_mq_complete_request_sync(struct request *rq)
>>> +{
>>> +    WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
>>> +    rq->q->mq_ops->complete(rq);
>>> +}
>>> +EXPORT_SYMBOL_GPL(blk_mq_complete_request_sync);
>>> +
>>>   /**
>>>    * blk_mq_complete_request - end I/O on a request
>>>    * @rq:        the request being processed
>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>> index 9e378d0a0c01..45fc605349da 100644
>>> --- a/drivers/nvme/host/rdma.c
>>> +++ b/drivers/nvme/host/rdma.c
>>> @@ -1975,7 +1975,7 @@ static void 
>>> nvme_rdma_complete_timed_out(struct request *rq)
>>>       nvme_rdma_stop_queue(queue);
>>>       if (!blk_mq_request_completed(rq)) {
>>>           nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>> -        blk_mq_complete_request(rq);
>>> +        blk_mq_complete_request_sync(rq);
>>>       }
>>>       mutex_unlock(&ctrl->teardown_lock);
>>>   }
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 8f4f29f18b8c..629b025685d1 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -2177,7 +2177,7 @@ static void nvme_tcp_complete_timed_out(struct 
>>> request *rq)
>>>       nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>>>       if (!blk_mq_request_completed(rq)) {
>>>           nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>> -        blk_mq_complete_request(rq);
>>> +        blk_mq_complete_request_sync(rq);
>>
>> Or complete the request in the following way? Then one block layer API
>> can be saved:
>>
>>     blk_mq_complete_request_remote(rq);
>>     nvme_complete_rq(rq);
>>
>>
>> thanks,
>> Ming
>>
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-09 13:55       ` Yi Zhang
@ 2020-10-09 18:29         ` Sagi Grimberg
  2020-10-10  6:08           ` Yi Zhang
  0 siblings, 1 reply; 26+ messages in thread
From: Sagi Grimberg @ 2020-10-09 18:29 UTC (permalink / raw)
  To: Yi Zhang, Ming Lei
  Cc: Jens Axboe, linux-block, Keith Busch, linux-nvme, Christoph Hellwig



On 10/9/20 6:55 AM, Yi Zhang wrote:
> Hi Sagi
> 
> On 10/9/20 4:09 PM, Sagi Grimberg wrote:
>>> Hi Sagi
>>>
>>> I applied this patch on block origin/for-next and still can reproduce 
>>> it.
>>
>> That's unexpected, can you try this patch?
>> -- 
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 629b025685d1..46428ff0b0fc 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -2175,7 +2175,7 @@ static void nvme_tcp_complete_timed_out(struct 
>> request *rq)
>>         /* fence other contexts that may complete the command */
>>         mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>         nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>> -       if (!blk_mq_request_completed(rq)) {
>> +       if (blk_mq_request_started(rq) && 
>> !blk_mq_request_completed(rq)) {
>>                 nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>                 blk_mq_complete_request_sync(rq);
>>         }
> No WARNING with this patch, but the test still will be hang and never 
> finished, here is the kernel log:

Thanks Yi,

Good to know no warning, that confirms the issue analysis.

As for the test, I'm able to successfuly pass this on my VM, and
this test is not designed to generate errors, just fio to
a file backed namespace. I'm wandering what is causing these
delays in I/O that cause timeouts...

> 
> [ 3327.011067] run blktests nvme/012 at 2020-10-09 06:21:59
> [ 3327.278169] loop: module loaded
> [ 3327.314287] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [ 3327.346276] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
> [ 3327.396746] nvmet: creating controller 1 for subsystem 
> blktests-subsystem-1 for NQN 
> nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
> [ 3327.411000] nvme nvme0: creating 12 I/O queues.
> [ 3327.416963] nvme nvme0: mapped 12/0/0 default/read/poll queues.
> [ 3327.428836] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 
> 127.0.0.1:4420
> [ 3329.403658] XFS (nvme0n1): Mounting V5 Filesystem
> [ 3329.417213] XFS (nvme0n1): Ending clean mount
> [ 3329.423044] xfs filesystem being mounted at /mnt/blktests supports 
> timestamps until 2038 (0x7fffffff)
> [ 3390.069126] nvme nvme0: queue 6: timeout request 0x21 type 4
> [ 3390.074797] nvme nvme0: starting error recovery
> [ 3390.079362] nvme nvme0: queue 6: timeout request 0x22 type 4
> [ 3390.079932] block nvme0n1: no usable path - requeuing I/O
> [ 3390.081999] nvme nvme0: Reconnecting in 10 seconds...
> [ 3390.095467] block nvme0n1: no usable path - requeuing I/O
> [ 3390.100879] block nvme0n1: no usable path - requeuing I/O
> [ 3390.106279] block nvme0n1: no usable path - requeuing I/O
> [ 3390.111691] block nvme0n1: no usable path - requeuing I/O
> [ 3390.117098] block nvme0n1: no usable path - requeuing I/O
> [ 3390.122507] block nvme0n1: no usable path - requeuing I/O
> [ 3390.127920] block nvme0n1: no usable path - requeuing I/O
> [ 3390.133333] block nvme0n1: no usable path - requeuing I/O
> [ 3390.138734] block nvme0n1: no usable path - requeuing I/O
> [ 3415.669659] nvmet: ctrl 2 keep-alive timer (15 seconds) expired!
> [ 3415.675682] nvmet: ctrl 2 fatal error occurred!
> [ 3419.765738] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> [ 3419.771762] nvmet: ctrl 1 fatal error occurred!
> [ 3440.000352] nvmet: creating controller 2 for subsystem 
> blktests-subsystem-1 for NQN 
> nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
> [ 3463.286700] nvme nvme0: queue 0: timeout request 0x0 type 4
> [ 3463.292357] nvme nvme0: Connect command failed, error wo/DNR bit: 881
> [ 3463.298830] nvme nvme0: failed to connect queue: 0 ret=881
> [ 3463.304352] nvme nvme0: Failed reconnect attempt 1
> [ 3463.309156] nvme nvme0: Reconnecting in 10 seconds...
> [ 3473.527574] nvmet: creating controller 1 for subsystem 
> blktests-subsystem-1 for NQN 
> nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
> [ 3473.541438] nvme_ns_head_submit_bio: 47 callbacks suppressed
> [ 3473.541445] block nvme0n1: no usable path - requeuing I/O
> [ 3473.552521] block nvme0n1: no usable path - requeuing I/O
> [ 3473.557923] block nvme0n1: no usable path - requeuing I/O
> [ 3473.563319] block nvme0n1: no usable path - requeuing I/O
> [ 3473.568739] block nvme0n1: no usable path - requeuing I/O
> [ 3473.574678] block nvme0n1: no usable path - requeuing I/O
> [ 3473.580068] block nvme0n1: no usable path - requeuing I/O
> [ 3473.585461] block nvme0n1: no usable path - requeuing I/O
> [ 3473.590861] block nvme0n1: no usable path - requeuing I/O
> [ 3473.596249] block nvme0n1: no usable path - requeuing I/O
> [ 3473.601880] nvme nvme0: creating 12 I/O queues.
> [ 3473.613955] nvme nvme0: mapped 12/0/0 default/read/poll queues.
> [ 3473.622106] nvme nvme0: Successfully reconnected (2 attempt)
> [ 3531.384170] nvme nvme0: queue 1: timeout request 0x62 type 4
> [ 3531.389837] nvme nvme0: starting error recovery
> [ 3531.394396] nvme nvme0: queue 7: timeout request 0x61 type 4
> [ 3531.394954] nvme_ns_head_submit_bio: 14 callbacks suppressed
> [ 3531.394964] block nvme0n1: no usable path - requeuing I/O
> [ 3531.411147] block nvme0n1: no usable path - requeuing I/O
> [ 3531.416548] block nvme0n1: no usable path - requeuing I/O
> [ 3531.421954] block nvme0n1: no usable path - requeuing I/O
> [ 3531.427361] block nvme0n1: no usable path - requeuing I/O
> [ 3531.432763] block nvme0n1: no usable path - requeuing I/O
> [ 3531.438287] block nvme0n1: no usable path - requeuing I/O
> [ 3531.443701] block nvme0n1: no usable path - requeuing I/O
> [ 3531.449103] block nvme0n1: no usable path - requeuing I/O
> [ 3531.454497] block nvme0n1: no usable path - requeuing I/O
> [ 3560.568832] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> [ 3560.574839] nvmet: ctrl 1 fatal error occurred!
> 
> 
>> -- 
>>
>>>
>>> [  724.267865] run blktests nvme/012 at 2020-10-09 00:48:52
>>> [  724.518498] loop: module loaded
>>> [  724.552880] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [  724.586609] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>>> [  724.648898] nvmet: creating controller 1 for subsystem 
>>> blktests-subsystem-1 for NQN 
>>> nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>>> [  724.663300] nvme nvme0: creating 12 I/O queues.
>>> [  724.669826] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>>> [  724.681350] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 
>>> 127.0.0.1:4420
>>> [  726.641605] XFS (nvme0n1): Mounting V5 Filesystem
>>> [  726.654637] XFS (nvme0n1): Ending clean mount
>>> [  726.660302] xfs filesystem being mounted at /mnt/blktests supports 
>>> timestamps until 2038 (0x7fffffff)
>>> [  787.042066] nvme nvme0: queue 2: timeout request 0xe type 4
>>> [  787.047649] nvme nvme0: starting error recovery
>>> [  787.052211] nvme nvme0: queue 4: timeout request 0x61 type 4
>>> [  787.053407] block nvme0n1: no usable path - requeuing I/O
>>> [  787.054543] nvme nvme0: Reconnecting in 10 seconds...
>>> [  787.057877] ------------[ cut here ]------------
>>> [  787.063274] block nvme0n1: no usable path - requeuing I/O
>>> [  787.068313] refcount_t: underflow; use-after-free.
>>> [  787.068370] WARNING: CPU: 3 PID: 30 at lib/refcount.c:28 
>>> refcount_warn_saturate+0xa6/0xf0
>>> [  787.072941] block nvme0n1: no usable path - requeuing I/O
>>> [  787.072945] block nvme0n1: no usable path - requeuing I/O
>>> [  787.078333] Modules linked in: loop nvme_tcp nvme_fabrics 
>>> nvme_core nvmet_tcp nvmet rpcsec_gss_krb5 auth_rpcgss nfsv4 
>>> dns_resolver nfs lockd grace fscache sunrpc snd_hda_codec_realtek 
>>> snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg 
>>> snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer amd64_edac_mod 
>>> snd edac_mce_amd soundcore kvm_amd ccp kvm pcspkr irqbypass hp_wmi 
>>> sparse_keymap rfkill k10temp wmi_bmof nv_tco i2c_nforce2 acpi_cpufreq 
>>> ip_tables xfs nouveau video mxm_wmi i2c_algo_bit drm_kms_helper 
>>> mptsas cec scsi_transport_sas ttm mptscsih ata_generic firewire_ohci 
>>> pata_acpi firewire_core drm serio_raw mptbase forcedeth crc_itu_t 
>>> sata_nv pata_amd wmi floppy
>>> [  787.083142] block nvme0n1: no usable path - requeuing I/O
>>> [  787.091304] CPU: 3 PID: 30 Comm: kworker/3:0H Not tainted 
>>> 5.9.0-rc8.update+ #3
>>> [  787.096673] block nvme0n1: no usable path - requeuing I/O
>>> [  787.096674] block nvme0n1: no usable path - requeuing I/O
>>> [  787.096680] block nvme0n1: no usable path - requeuing I/O
>>> [  787.096681] block nvme0n1: no usable path - requeuing I/O
>>> [  787.096683] block nvme0n1: no usable path - requeuing I/O
>>> [  787.201242] Hardware name: Hewlett-Packard HP xw9400 
>>> Workstation/0A1Ch, BIOS 786D6 v04.02 04/21/2009
>>> [  787.210366] Workqueue: kblockd blk_mq_timeout_work
>>> [  787.215161] RIP: 0010:refcount_warn_saturate+0xa6/0xf0
>>> [  787.220295] Code: 05 aa 98 22 01 01 e8 9f cf ad ff 0f 0b c3 80 3d 
>>> 98 98 22 01 00 75 95 48 c7 c7 48 54 3e a2 c6 05 88 98 22 01 01 e8 80 
>>> cf ad ff <0f> 0b c3 80 3d 77 98 22 01 00 0f 85 72 ff ff ff 48 c7 c7 
>>> a0 54 3e
>>> [  787.239021] RSP: 0018:ffffb860407b3dc8 EFLAGS: 00010292
>>> [  787.244240] RAX: 0000000000000026 RBX: 0000000000000000 RCX: 
>>> ffff9d93ddc58d08
>>> [  787.251361] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: 
>>> ffff9d93ddc58d00
>>> [  787.258480] RBP: ffff9d93dbd6c200 R08: 000000b740ed60e8 R09: 
>>> ffffffffa2fffbc4
>>> [  787.265612] R10: 0000000000000477 R11: 0000000000028320 R12: 
>>> ffff9d93dbd6c2e8
>>> [  787.272741] R13: ffff9d93dbea0000 R14: ffffb860407b3e70 R15: 
>>> ffff9d93dbea0000
>>> [  787.279864] FS:  0000000000000000(0000) GS:ffff9d93ddc40000(0000) 
>>> knlGS:0000000000000000
>>> [  787.287948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  787.293695] CR2: 00007f45fd8fa024 CR3: 00000000792dc000 CR4: 
>>> 00000000000006e0
>>> [  787.300820] Call Trace:
>>> [  787.303277]  blk_mq_check_expired+0x109/0x1b0
>>> [  787.307638]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
>>> [  787.312524]  ? blk_poll+0x300/0x300
>>> [  787.316022]  blk_mq_timeout_work+0x44/0xe0
>>> [  787.320128]  process_one_work+0x1b4/0x370
>>> [  787.324137]  worker_thread+0x53/0x3e0
>>> [  787.327809]  ? process_one_work+0x370/0x370
>>> [  787.332015]  kthread+0x11b/0x140
>>> [  787.335245]  ? __kthread_bind_mask+0x60/0x60
>>> [  787.339515]  ret_from_fork+0x22/0x30
>>> [  787.343094] ---[ end trace 606cf9189379fcfc ]---
>>>
>>>
>>> Best Regards,
>>>    Yi Zhang
>>>
>>>
>>> ----- Original Message -----
>>> From: "Ming Lei" <ming.lei@redhat.com>
>>> To: "Sagi Grimberg" <sagi@grimberg.me>
>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "Yi Zhang" <yi.zhang@redhat.com>, 
>>> linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, "Keith 
>>> Busch" <kbusch@kernel.org>, "Christoph Hellwig" <hch@lst.de>
>>> Sent: Friday, October 9, 2020 12:39:38 PM
>>> Subject: Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
>>>
>>> On Thu, Oct 08, 2020 at 02:37:50PM -0700, Sagi Grimberg wrote:
>>>> In nvme-tcp and nvme-rdma, the timeout handler in certain conditions
>>>> needs to complete an aborted request. The timeout handler serializes
>>>> against an error recovery sequence that iterates over the inflight
>>>> requests and cancels them.
>>>>
>>>> However, the fact that blk_mq_complete_request may defer to a different
>>>> core that serialization breaks.
>>>>
>>>> Hence re-introduce blk_mq_complete_request_sync and use that in the
>>>> timeout handler to make sure that we don't get a use-after-free
>>>> condition.
>>>>
>>>> This was uncovered by the blktests:
>>>> -- 
>>>> $ nvme_trtype=tcp ./check nvme/012
>>>>
>>>> [ 2152.546343] run blktests nvme/012 at 2020-10-08 05:59:03
>>>> [ 2152.799640] loop: module loaded
>>>> [ 2152.835222] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>>> [ 2152.860697] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>>>> [ 2152.937889] nvmet: creating controller 1 for subsystem 
>>>> blktests-subsystem-1 for NQN 
>>>> nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>>>> [ 2152.952085] nvme nvme0: creating 12 I/O queues.
>>>> [ 2152.958042] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>>>> [ 2152.969948] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", 
>>>> addr 127.0.0.1:4420
>>>> [ 2154.927953] XFS (nvme0n1): Mounting V5 Filesystem
>>>> [ 2154.942432] XFS (nvme0n1): Ending clean mount
>>>> [ 2154.948183] xfs filesystem being mounted at /mnt/blktests 
>>>> supports timestamps until 2038 (0x7fffffff)
>>>> [ 2215.193645] nvme nvme0: queue 7: timeout request 0x11 type 4
>>>> [ 2215.199331] nvme nvme0: starting error recovery
>>>> [ 2215.203890] nvme nvme0: queue 7: timeout request 0x12 type 4
>>>> [ 2215.204483] block nvme0n1: no usable path - requeuing I/O
>>>> [ 2215.214976] block nvme0n1: no usable path - requeuing I/O
>>>> [ 2215.215495] nvme nvme0: Reconnecting in 10 seconds...
>>>> [ 2215.215502] ------------[ cut here ]------------
>>>> [ 2215.215505] refcount_t: underflow; use-after-free.
>>>> [ 2215.215617] WARNING: CPU: 6 PID: 45 at lib/refcount.c:28 
>>>> refcount_warn_saturate+0xa6/0xf0
>>>> [ 2215.215745] RSP: 0018:ffffb71b80837dc8 EFLAGS: 00010292
>>>> [ 2215.215750] RAX: 0000000000000026 RBX: 0000000000000000 RCX: 
>>>> ffff93f37dcd8d08
>>>> [ 2215.215753] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: 
>>>> ffff93f37dcd8d00
>>>> [ 2215.215755] RBP: ffff93f37a812400 R08: 00000203c5221fce R09: 
>>>> ffffffffa7fffbc4
>>>> [ 2215.215758] R10: 0000000000000477 R11: 000000000002835c R12: 
>>>> ffff93f37a8124e8
>>>> [ 2215.215761] R13: ffff93f37a2b0000 R14: ffffb71b80837e70 R15: 
>>>> ffff93f37a2b0000
>>>> [ 2215.215765] FS:  0000000000000000(0000) GS:ffff93f37dcc0000(0000) 
>>>> knlGS:0000000000000000
>>>> [ 2215.215768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [ 2215.215771] CR2: 00005637b4137028 CR3: 000000007c1be000 CR4: 
>>>> 00000000000006e0
>>>> [ 2215.215773] Call Trace:
>>>> [ 2215.215784]  blk_mq_check_expired+0x109/0x1b0
>>>> [ 2215.215797]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
>>>> [ 2215.215801]  ? blk_poll+0x300/0x300
>>>> [ 2215.215806]  blk_mq_timeout_work+0x44/0xe0
>>>> [ 2215.215814]  process_one_work+0x1b4/0x370
>>>> [ 2215.215820]  worker_thread+0x53/0x3e0
>>>> [ 2215.215825]  ? process_one_work+0x370/0x370
>>>> [ 2215.215829]  kthread+0x11b/0x140
>>>> [ 2215.215834]  ? __kthread_bind_mask+0x60/0x60
>>>> [ 2215.215841]  ret_from_fork+0x22/0x30
>>>> [ 2215.215847] ---[ end trace 7d137e36e23c0d19 ]---
>>>> -- 
>>>>
>>>> Reported-by: Yi Zhang <yi.zhang@redhat.com>
>>>> Fixes: 236187c4ed19 ("nvme-tcp: fix timeout handler")
>>>> Fixes: 0475a8dcbcee ("nvme-rdma: fix timeout handler")
>>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>>> ---
>>>> Yi, would appreciate your Tested-by tag
>>>>
>>>>   block/blk-mq.c           | 7 +++++++
>>>>   drivers/nvme/host/rdma.c | 2 +-
>>>>   drivers/nvme/host/tcp.c  | 2 +-
>>>>   include/linux/blk-mq.h   | 1 +
>>>>   4 files changed, 10 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index f8e1e759c905..2d154722ef39 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -729,6 +729,13 @@ bool blk_mq_complete_request_remote(struct 
>>>> request *rq)
>>>>   }
>>>>   EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
>>>>   +void blk_mq_complete_request_sync(struct request *rq)
>>>> +{
>>>> +    WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
>>>> +    rq->q->mq_ops->complete(rq);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(blk_mq_complete_request_sync);
>>>> +
>>>>   /**
>>>>    * blk_mq_complete_request - end I/O on a request
>>>>    * @rq:        the request being processed
>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>> index 9e378d0a0c01..45fc605349da 100644
>>>> --- a/drivers/nvme/host/rdma.c
>>>> +++ b/drivers/nvme/host/rdma.c
>>>> @@ -1975,7 +1975,7 @@ static void 
>>>> nvme_rdma_complete_timed_out(struct request *rq)
>>>>       nvme_rdma_stop_queue(queue);
>>>>       if (!blk_mq_request_completed(rq)) {
>>>>           nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>>> -        blk_mq_complete_request(rq);
>>>> +        blk_mq_complete_request_sync(rq);
>>>>       }
>>>>       mutex_unlock(&ctrl->teardown_lock);
>>>>   }
>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>> index 8f4f29f18b8c..629b025685d1 100644
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -2177,7 +2177,7 @@ static void nvme_tcp_complete_timed_out(struct 
>>>> request *rq)
>>>>       nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>>>>       if (!blk_mq_request_completed(rq)) {
>>>>           nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>>> -        blk_mq_complete_request(rq);
>>>> +        blk_mq_complete_request_sync(rq);
>>>
>>> Or complete the request in the following way? Then one block layer API
>>> can be saved:
>>>
>>>     blk_mq_complete_request_remote(rq);
>>>     nvme_complete_rq(rq);
>>>
>>>
>>> thanks,
>>> Ming
>>>
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>>
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>>
>>
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-09 18:29         ` Sagi Grimberg
@ 2020-10-10  6:08           ` Yi Zhang
  2020-10-12  3:59             ` Chao Leng
  0 siblings, 1 reply; 26+ messages in thread
From: Yi Zhang @ 2020-10-10  6:08 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch, Christoph Hellwig



On 10/10/20 2:29 AM, Sagi Grimberg wrote:
>
>
> On 10/9/20 6:55 AM, Yi Zhang wrote:
>> Hi Sagi
>>
>> On 10/9/20 4:09 PM, Sagi Grimberg wrote:
>>>> Hi Sagi
>>>>
>>>> I applied this patch on block origin/for-next and still can 
>>>> reproduce it.
>>>
>>> That's unexpected, can you try this patch?
>>> -- 
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 629b025685d1..46428ff0b0fc 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -2175,7 +2175,7 @@ static void nvme_tcp_complete_timed_out(struct 
>>> request *rq)
>>>         /* fence other contexts that may complete the command */
>>>         mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>         nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>>> -       if (!blk_mq_request_completed(rq)) {
>>> +       if (blk_mq_request_started(rq) && 
>>> !blk_mq_request_completed(rq)) {
>>>                 nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>>                 blk_mq_complete_request_sync(rq);
>>>         }
>> No WARNING with this patch, but the test still will be hang and never 
>> finished, here is the kernel log:
>
> Thanks Yi,
>
> Good to know no warning, that confirms the issue analysis.
>
Cool, feel fee to add Tested-by: Yi Zhang <yi.zhang@redhat.com>

> As for the test, I'm able to successfuly pass this on my VM, and
> this test is not designed to generate errors, just fio to
> a file backed namespace. I'm wandering what is causing these
> delays in I/O that cause timeouts...
>
I just found all the failure occurred on AMD with 4G memory, not sure 
whether it's related to the low memory.
Let me know if you need any test for further investigation . :)
>>
>> [ 3327.011067] run blktests nvme/012 at 2020-10-09 06:21:59
>> [ 3327.278169] loop: module loaded
>> [ 3327.314287] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>> [ 3327.346276] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>> [ 3327.396746] nvmet: creating controller 1 for subsystem 
>> blktests-subsystem-1 for NQN 
>> nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>> [ 3327.411000] nvme nvme0: creating 12 I/O queues.
>> [ 3327.416963] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>> [ 3327.428836] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 
>> 127.0.0.1:4420
>> [ 3329.403658] XFS (nvme0n1): Mounting V5 Filesystem
>> [ 3329.417213] XFS (nvme0n1): Ending clean mount
>> [ 3329.423044] xfs filesystem being mounted at /mnt/blktests supports 
>> timestamps until 2038 (0x7fffffff)
>> [ 3390.069126] nvme nvme0: queue 6: timeout request 0x21 type 4
>> [ 3390.074797] nvme nvme0: starting error recovery
>> [ 3390.079362] nvme nvme0: queue 6: timeout request 0x22 type 4
>> [ 3390.079932] block nvme0n1: no usable path - requeuing I/O
>> [ 3390.081999] nvme nvme0: Reconnecting in 10 seconds...
>> [ 3390.095467] block nvme0n1: no usable path - requeuing I/O
>> [ 3390.100879] block nvme0n1: no usable path - requeuing I/O
>> [ 3390.106279] block nvme0n1: no usable path - requeuing I/O
>> [ 3390.111691] block nvme0n1: no usable path - requeuing I/O
>> [ 3390.117098] block nvme0n1: no usable path - requeuing I/O
>> [ 3390.122507] block nvme0n1: no usable path - requeuing I/O
>> [ 3390.127920] block nvme0n1: no usable path - requeuing I/O
>> [ 3390.133333] block nvme0n1: no usable path - requeuing I/O
>> [ 3390.138734] block nvme0n1: no usable path - requeuing I/O
>> [ 3415.669659] nvmet: ctrl 2 keep-alive timer (15 seconds) expired!
>> [ 3415.675682] nvmet: ctrl 2 fatal error occurred!
>> [ 3419.765738] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>> [ 3419.771762] nvmet: ctrl 1 fatal error occurred!
>> [ 3440.000352] nvmet: creating controller 2 for subsystem 
>> blktests-subsystem-1 for NQN 
>> nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>> [ 3463.286700] nvme nvme0: queue 0: timeout request 0x0 type 4
>> [ 3463.292357] nvme nvme0: Connect command failed, error wo/DNR bit: 881
>> [ 3463.298830] nvme nvme0: failed to connect queue: 0 ret=881
>> [ 3463.304352] nvme nvme0: Failed reconnect attempt 1
>> [ 3463.309156] nvme nvme0: Reconnecting in 10 seconds...
>> [ 3473.527574] nvmet: creating controller 1 for subsystem 
>> blktests-subsystem-1 for NQN 
>> nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>> [ 3473.541438] nvme_ns_head_submit_bio: 47 callbacks suppressed
>> [ 3473.541445] block nvme0n1: no usable path - requeuing I/O
>> [ 3473.552521] block nvme0n1: no usable path - requeuing I/O
>> [ 3473.557923] block nvme0n1: no usable path - requeuing I/O
>> [ 3473.563319] block nvme0n1: no usable path - requeuing I/O
>> [ 3473.568739] block nvme0n1: no usable path - requeuing I/O
>> [ 3473.574678] block nvme0n1: no usable path - requeuing I/O
>> [ 3473.580068] block nvme0n1: no usable path - requeuing I/O
>> [ 3473.585461] block nvme0n1: no usable path - requeuing I/O
>> [ 3473.590861] block nvme0n1: no usable path - requeuing I/O
>> [ 3473.596249] block nvme0n1: no usable path - requeuing I/O
>> [ 3473.601880] nvme nvme0: creating 12 I/O queues.
>> [ 3473.613955] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>> [ 3473.622106] nvme nvme0: Successfully reconnected (2 attempt)
>> [ 3531.384170] nvme nvme0: queue 1: timeout request 0x62 type 4
>> [ 3531.389837] nvme nvme0: starting error recovery
>> [ 3531.394396] nvme nvme0: queue 7: timeout request 0x61 type 4
>> [ 3531.394954] nvme_ns_head_submit_bio: 14 callbacks suppressed
>> [ 3531.394964] block nvme0n1: no usable path - requeuing I/O
>> [ 3531.411147] block nvme0n1: no usable path - requeuing I/O
>> [ 3531.416548] block nvme0n1: no usable path - requeuing I/O
>> [ 3531.421954] block nvme0n1: no usable path - requeuing I/O
>> [ 3531.427361] block nvme0n1: no usable path - requeuing I/O
>> [ 3531.432763] block nvme0n1: no usable path - requeuing I/O
>> [ 3531.438287] block nvme0n1: no usable path - requeuing I/O
>> [ 3531.443701] block nvme0n1: no usable path - requeuing I/O
>> [ 3531.449103] block nvme0n1: no usable path - requeuing I/O
>> [ 3531.454497] block nvme0n1: no usable path - requeuing I/O
>> [ 3560.568832] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>> [ 3560.574839] nvmet: ctrl 1 fatal error occurred!
>>
>>
>>> -- 
>>>
>>>>
>>>> [  724.267865] run blktests nvme/012 at 2020-10-09 00:48:52
>>>> [  724.518498] loop: module loaded
>>>> [  724.552880] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>>> [  724.586609] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>>>> [  724.648898] nvmet: creating controller 1 for subsystem 
>>>> blktests-subsystem-1 for NQN 
>>>> nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>>>> [  724.663300] nvme nvme0: creating 12 I/O queues.
>>>> [  724.669826] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>>>> [  724.681350] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", 
>>>> addr 127.0.0.1:4420
>>>> [  726.641605] XFS (nvme0n1): Mounting V5 Filesystem
>>>> [  726.654637] XFS (nvme0n1): Ending clean mount
>>>> [  726.660302] xfs filesystem being mounted at /mnt/blktests 
>>>> supports timestamps until 2038 (0x7fffffff)
>>>> [  787.042066] nvme nvme0: queue 2: timeout request 0xe type 4
>>>> [  787.047649] nvme nvme0: starting error recovery
>>>> [  787.052211] nvme nvme0: queue 4: timeout request 0x61 type 4
>>>> [  787.053407] block nvme0n1: no usable path - requeuing I/O
>>>> [  787.054543] nvme nvme0: Reconnecting in 10 seconds...
>>>> [  787.057877] ------------[ cut here ]------------
>>>> [  787.063274] block nvme0n1: no usable path - requeuing I/O
>>>> [  787.068313] refcount_t: underflow; use-after-free.
>>>> [  787.068370] WARNING: CPU: 3 PID: 30 at lib/refcount.c:28 
>>>> refcount_warn_saturate+0xa6/0xf0
>>>> [  787.072941] block nvme0n1: no usable path - requeuing I/O
>>>> [  787.072945] block nvme0n1: no usable path - requeuing I/O
>>>> [  787.078333] Modules linked in: loop nvme_tcp nvme_fabrics 
>>>> nvme_core nvmet_tcp nvmet rpcsec_gss_krb5 auth_rpcgss nfsv4 
>>>> dns_resolver nfs lockd grace fscache sunrpc snd_hda_codec_realtek 
>>>> snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg 
>>>> snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer 
>>>> amd64_edac_mod snd edac_mce_amd soundcore kvm_amd ccp kvm pcspkr 
>>>> irqbypass hp_wmi sparse_keymap rfkill k10temp wmi_bmof nv_tco 
>>>> i2c_nforce2 acpi_cpufreq ip_tables xfs nouveau video mxm_wmi 
>>>> i2c_algo_bit drm_kms_helper mptsas cec scsi_transport_sas ttm 
>>>> mptscsih ata_generic firewire_ohci pata_acpi firewire_core drm 
>>>> serio_raw mptbase forcedeth crc_itu_t sata_nv pata_amd wmi floppy
>>>> [  787.083142] block nvme0n1: no usable path - requeuing I/O
>>>> [  787.091304] CPU: 3 PID: 30 Comm: kworker/3:0H Not tainted 
>>>> 5.9.0-rc8.update+ #3
>>>> [  787.096673] block nvme0n1: no usable path - requeuing I/O
>>>> [  787.096674] block nvme0n1: no usable path - requeuing I/O
>>>> [  787.096680] block nvme0n1: no usable path - requeuing I/O
>>>> [  787.096681] block nvme0n1: no usable path - requeuing I/O
>>>> [  787.096683] block nvme0n1: no usable path - requeuing I/O
>>>> [  787.201242] Hardware name: Hewlett-Packard HP xw9400 
>>>> Workstation/0A1Ch, BIOS 786D6 v04.02 04/21/2009
>>>> [  787.210366] Workqueue: kblockd blk_mq_timeout_work
>>>> [  787.215161] RIP: 0010:refcount_warn_saturate+0xa6/0xf0
>>>> [  787.220295] Code: 05 aa 98 22 01 01 e8 9f cf ad ff 0f 0b c3 80 
>>>> 3d 98 98 22 01 00 75 95 48 c7 c7 48 54 3e a2 c6 05 88 98 22 01 01 
>>>> e8 80 cf ad ff <0f> 0b c3 80 3d 77 98 22 01 00 0f 85 72 ff ff ff 48 
>>>> c7 c7 a0 54 3e
>>>> [  787.239021] RSP: 0018:ffffb860407b3dc8 EFLAGS: 00010292
>>>> [  787.244240] RAX: 0000000000000026 RBX: 0000000000000000 RCX: 
>>>> ffff9d93ddc58d08
>>>> [  787.251361] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: 
>>>> ffff9d93ddc58d00
>>>> [  787.258480] RBP: ffff9d93dbd6c200 R08: 000000b740ed60e8 R09: 
>>>> ffffffffa2fffbc4
>>>> [  787.265612] R10: 0000000000000477 R11: 0000000000028320 R12: 
>>>> ffff9d93dbd6c2e8
>>>> [  787.272741] R13: ffff9d93dbea0000 R14: ffffb860407b3e70 R15: 
>>>> ffff9d93dbea0000
>>>> [  787.279864] FS:  0000000000000000(0000) 
>>>> GS:ffff9d93ddc40000(0000) knlGS:0000000000000000
>>>> [  787.287948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  787.293695] CR2: 00007f45fd8fa024 CR3: 00000000792dc000 CR4: 
>>>> 00000000000006e0
>>>> [  787.300820] Call Trace:
>>>> [  787.303277]  blk_mq_check_expired+0x109/0x1b0
>>>> [  787.307638]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
>>>> [  787.312524]  ? blk_poll+0x300/0x300
>>>> [  787.316022]  blk_mq_timeout_work+0x44/0xe0
>>>> [  787.320128]  process_one_work+0x1b4/0x370
>>>> [  787.324137]  worker_thread+0x53/0x3e0
>>>> [  787.327809]  ? process_one_work+0x370/0x370
>>>> [  787.332015]  kthread+0x11b/0x140
>>>> [  787.335245]  ? __kthread_bind_mask+0x60/0x60
>>>> [  787.339515]  ret_from_fork+0x22/0x30
>>>> [  787.343094] ---[ end trace 606cf9189379fcfc ]---
>>>>
>>>>
>>>> Best Regards,
>>>>    Yi Zhang
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: "Ming Lei" <ming.lei@redhat.com>
>>>> To: "Sagi Grimberg" <sagi@grimberg.me>
>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "Yi Zhang" 
>>>> <yi.zhang@redhat.com>, linux-nvme@lists.infradead.org, 
>>>> linux-block@vger.kernel.org, "Keith Busch" <kbusch@kernel.org>, 
>>>> "Christoph Hellwig" <hch@lst.de>
>>>> Sent: Friday, October 9, 2020 12:39:38 PM
>>>> Subject: Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
>>>>
>>>> On Thu, Oct 08, 2020 at 02:37:50PM -0700, Sagi Grimberg wrote:
>>>>> In nvme-tcp and nvme-rdma, the timeout handler in certain conditions
>>>>> needs to complete an aborted request. The timeout handler serializes
>>>>> against an error recovery sequence that iterates over the inflight
>>>>> requests and cancels them.
>>>>>
>>>>> However, the fact that blk_mq_complete_request may defer to a 
>>>>> different
>>>>> core that serialization breaks.
>>>>>
>>>>> Hence re-introduce blk_mq_complete_request_sync and use that in the
>>>>> timeout handler to make sure that we don't get a use-after-free
>>>>> condition.
>>>>>
>>>>> This was uncovered by the blktests:
>>>>> -- 
>>>>> $ nvme_trtype=tcp ./check nvme/012
>>>>>
>>>>> [ 2152.546343] run blktests nvme/012 at 2020-10-08 05:59:03
>>>>> [ 2152.799640] loop: module loaded
>>>>> [ 2152.835222] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>>>> [ 2152.860697] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>>>>> [ 2152.937889] nvmet: creating controller 1 for subsystem 
>>>>> blktests-subsystem-1 for NQN 
>>>>> nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>>>>> [ 2152.952085] nvme nvme0: creating 12 I/O queues.
>>>>> [ 2152.958042] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>>>>> [ 2152.969948] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", 
>>>>> addr 127.0.0.1:4420
>>>>> [ 2154.927953] XFS (nvme0n1): Mounting V5 Filesystem
>>>>> [ 2154.942432] XFS (nvme0n1): Ending clean mount
>>>>> [ 2154.948183] xfs filesystem being mounted at /mnt/blktests 
>>>>> supports timestamps until 2038 (0x7fffffff)
>>>>> [ 2215.193645] nvme nvme0: queue 7: timeout request 0x11 type 4
>>>>> [ 2215.199331] nvme nvme0: starting error recovery
>>>>> [ 2215.203890] nvme nvme0: queue 7: timeout request 0x12 type 4
>>>>> [ 2215.204483] block nvme0n1: no usable path - requeuing I/O
>>>>> [ 2215.214976] block nvme0n1: no usable path - requeuing I/O
>>>>> [ 2215.215495] nvme nvme0: Reconnecting in 10 seconds...
>>>>> [ 2215.215502] ------------[ cut here ]------------
>>>>> [ 2215.215505] refcount_t: underflow; use-after-free.
>>>>> [ 2215.215617] WARNING: CPU: 6 PID: 45 at lib/refcount.c:28 
>>>>> refcount_warn_saturate+0xa6/0xf0
>>>>> [ 2215.215745] RSP: 0018:ffffb71b80837dc8 EFLAGS: 00010292
>>>>> [ 2215.215750] RAX: 0000000000000026 RBX: 0000000000000000 RCX: 
>>>>> ffff93f37dcd8d08
>>>>> [ 2215.215753] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: 
>>>>> ffff93f37dcd8d00
>>>>> [ 2215.215755] RBP: ffff93f37a812400 R08: 00000203c5221fce R09: 
>>>>> ffffffffa7fffbc4
>>>>> [ 2215.215758] R10: 0000000000000477 R11: 000000000002835c R12: 
>>>>> ffff93f37a8124e8
>>>>> [ 2215.215761] R13: ffff93f37a2b0000 R14: ffffb71b80837e70 R15: 
>>>>> ffff93f37a2b0000
>>>>> [ 2215.215765] FS:  0000000000000000(0000) 
>>>>> GS:ffff93f37dcc0000(0000) knlGS:0000000000000000
>>>>> [ 2215.215768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [ 2215.215771] CR2: 00005637b4137028 CR3: 000000007c1be000 CR4: 
>>>>> 00000000000006e0
>>>>> [ 2215.215773] Call Trace:
>>>>> [ 2215.215784]  blk_mq_check_expired+0x109/0x1b0
>>>>> [ 2215.215797]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
>>>>> [ 2215.215801]  ? blk_poll+0x300/0x300
>>>>> [ 2215.215806]  blk_mq_timeout_work+0x44/0xe0
>>>>> [ 2215.215814]  process_one_work+0x1b4/0x370
>>>>> [ 2215.215820]  worker_thread+0x53/0x3e0
>>>>> [ 2215.215825]  ? process_one_work+0x370/0x370
>>>>> [ 2215.215829]  kthread+0x11b/0x140
>>>>> [ 2215.215834]  ? __kthread_bind_mask+0x60/0x60
>>>>> [ 2215.215841]  ret_from_fork+0x22/0x30
>>>>> [ 2215.215847] ---[ end trace 7d137e36e23c0d19 ]---
>>>>> -- 
>>>>>
>>>>> Reported-by: Yi Zhang <yi.zhang@redhat.com>
>>>>> Fixes: 236187c4ed19 ("nvme-tcp: fix timeout handler")
>>>>> Fixes: 0475a8dcbcee ("nvme-rdma: fix timeout handler")
>>>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>>>> ---
>>>>> Yi, would appreciate your Tested-by tag
>>>>>
>>>>>   block/blk-mq.c           | 7 +++++++
>>>>>   drivers/nvme/host/rdma.c | 2 +-
>>>>>   drivers/nvme/host/tcp.c  | 2 +-
>>>>>   include/linux/blk-mq.h   | 1 +
>>>>>   4 files changed, 10 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>> index f8e1e759c905..2d154722ef39 100644
>>>>> --- a/block/blk-mq.c
>>>>> +++ b/block/blk-mq.c
>>>>> @@ -729,6 +729,13 @@ bool blk_mq_complete_request_remote(struct 
>>>>> request *rq)
>>>>>   }
>>>>>   EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
>>>>>   +void blk_mq_complete_request_sync(struct request *rq)
>>>>> +{
>>>>> +    WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
>>>>> +    rq->q->mq_ops->complete(rq);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(blk_mq_complete_request_sync);
>>>>> +
>>>>>   /**
>>>>>    * blk_mq_complete_request - end I/O on a request
>>>>>    * @rq:        the request being processed
>>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>>> index 9e378d0a0c01..45fc605349da 100644
>>>>> --- a/drivers/nvme/host/rdma.c
>>>>> +++ b/drivers/nvme/host/rdma.c
>>>>> @@ -1975,7 +1975,7 @@ static void 
>>>>> nvme_rdma_complete_timed_out(struct request *rq)
>>>>>       nvme_rdma_stop_queue(queue);
>>>>>       if (!blk_mq_request_completed(rq)) {
>>>>>           nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>>>> -        blk_mq_complete_request(rq);
>>>>> +        blk_mq_complete_request_sync(rq);
>>>>>       }
>>>>>       mutex_unlock(&ctrl->teardown_lock);
>>>>>   }
>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>> index 8f4f29f18b8c..629b025685d1 100644
>>>>> --- a/drivers/nvme/host/tcp.c
>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>> @@ -2177,7 +2177,7 @@ static void 
>>>>> nvme_tcp_complete_timed_out(struct request *rq)
>>>>>       nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>>>>>       if (!blk_mq_request_completed(rq)) {
>>>>>           nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>>>> -        blk_mq_complete_request(rq);
>>>>> +        blk_mq_complete_request_sync(rq);
>>>>
>>>> Or complete the request in the following way? Then one block layer API
>>>> can be saved:
>>>>
>>>>     blk_mq_complete_request_remote(rq);
>>>>     nvme_complete_rq(rq);
>>>>
>>>>
>>>> thanks,
>>>> Ming
>>>>
>>>>
>>>> _______________________________________________
>>>> Linux-nvme mailing list
>>>> Linux-nvme@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>>>
>>>>
>>>> _______________________________________________
>>>> Linux-nvme mailing list
>>>> Linux-nvme@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>>>
>>>
>>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-10  6:08           ` Yi Zhang
@ 2020-10-12  3:59             ` Chao Leng
  2020-10-12  8:13               ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Chao Leng @ 2020-10-12  3:59 UTC (permalink / raw)
  To: Yi Zhang, Sagi Grimberg, Ming Lei
  Cc: Jens Axboe, linux-block, Keith Busch, linux-nvme, Christoph Hellwig



On 2020/10/10 14:08, Yi Zhang wrote:
> 
> 
> On 10/10/20 2:29 AM, Sagi Grimberg wrote:
>>
>>
>> On 10/9/20 6:55 AM, Yi Zhang wrote:
>>> Hi Sagi
>>>
>>> On 10/9/20 4:09 PM, Sagi Grimberg wrote:
>>>>> Hi Sagi
>>>>>
>>>>> I applied this patch on block origin/for-next and still can reproduce it.
>>>>
>>>> That's unexpected, can you try this patch?
>>>> -- 
>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>> index 629b025685d1..46428ff0b0fc 100644
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -2175,7 +2175,7 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
>>>>         /* fence other contexts that may complete the command */
>>>>         mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>>         nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>>>> -       if (!blk_mq_request_completed(rq)) {
>>>> +       if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq)) {
>>>>                 nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>>>                 blk_mq_complete_request_sync(rq);
>>>>         }
This may just reduce the probability. The concurrency of timeout and teardown will cause the same request
be treated repeatly, this is not we expected.
In the teardown process, after quiesced queues delete the timer and cancel the timeout work maybe a better option.
Teardown process will cancel all doing IO, timer and timeout work should be canceled.
After unquiesce queue, the timer of queue will be add when start request.
>>> No WARNING with this patch, but the test still will be hang and never finished, here is the kernel log:
>>
>> Thanks Yi,
>>
>> Good to know no warning, that confirms the issue analysis.
>>
> Cool, feel fee to add Tested-by: Yi Zhang <yi.zhang@redhat.com>
> 
>> As for the test, I'm able to successfuly pass this on my VM, and
>> this test is not designed to generate errors, just fio to
>> a file backed namespace. I'm wandering what is causing these
>> delays in I/O that cause timeouts...
>>
> I just found all the failure occurred on AMD with 4G memory, not sure whether it's related to the low memory.
> Let me know if you need any test for further investigation . :)
>>>
>>> [ 3327.011067] run blktests nvme/012 at 2020-10-09 06:21:59
>>> [ 3327.278169] loop: module loaded
>>> [ 3327.314287] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [ 3327.346276] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>>> [ 3327.396746] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>>> [ 3327.411000] nvme nvme0: creating 12 I/O queues.
>>> [ 3327.416963] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>>> [ 3327.428836] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420
>>> [ 3329.403658] XFS (nvme0n1): Mounting V5 Filesystem
>>> [ 3329.417213] XFS (nvme0n1): Ending clean mount
>>> [ 3329.423044] xfs filesystem being mounted at /mnt/blktests supports timestamps until 2038 (0x7fffffff)
>>> [ 3390.069126] nvme nvme0: queue 6: timeout request 0x21 type 4
>>> [ 3390.074797] nvme nvme0: starting error recovery
>>> [ 3390.079362] nvme nvme0: queue 6: timeout request 0x22 type 4
>>> [ 3390.079932] block nvme0n1: no usable path - requeuing I/O
>>> [ 3390.081999] nvme nvme0: Reconnecting in 10 seconds...
>>> [ 3390.095467] block nvme0n1: no usable path - requeuing I/O
>>> [ 3390.100879] block nvme0n1: no usable path - requeuing I/O
>>> [ 3390.106279] block nvme0n1: no usable path - requeuing I/O
>>> [ 3390.111691] block nvme0n1: no usable path - requeuing I/O
>>> [ 3390.117098] block nvme0n1: no usable path - requeuing I/O
>>> [ 3390.122507] block nvme0n1: no usable path - requeuing I/O
>>> [ 3390.127920] block nvme0n1: no usable path - requeuing I/O
>>> [ 3390.133333] block nvme0n1: no usable path - requeuing I/O
>>> [ 3390.138734] block nvme0n1: no usable path - requeuing I/O
>>> [ 3415.669659] nvmet: ctrl 2 keep-alive timer (15 seconds) expired!
>>> [ 3415.675682] nvmet: ctrl 2 fatal error occurred!
>>> [ 3419.765738] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>>> [ 3419.771762] nvmet: ctrl 1 fatal error occurred!
>>> [ 3440.000352] nvmet: creating controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>>> [ 3463.286700] nvme nvme0: queue 0: timeout request 0x0 type 4
>>> [ 3463.292357] nvme nvme0: Connect command failed, error wo/DNR bit: 881
>>> [ 3463.298830] nvme nvme0: failed to connect queue: 0 ret=881
>>> [ 3463.304352] nvme nvme0: Failed reconnect attempt 1
>>> [ 3463.309156] nvme nvme0: Reconnecting in 10 seconds...
>>> [ 3473.527574] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>>> [ 3473.541438] nvme_ns_head_submit_bio: 47 callbacks suppressed
>>> [ 3473.541445] block nvme0n1: no usable path - requeuing I/O
>>> [ 3473.552521] block nvme0n1: no usable path - requeuing I/O
>>> [ 3473.557923] block nvme0n1: no usable path - requeuing I/O
>>> [ 3473.563319] block nvme0n1: no usable path - requeuing I/O
>>> [ 3473.568739] block nvme0n1: no usable path - requeuing I/O
>>> [ 3473.574678] block nvme0n1: no usable path - requeuing I/O
>>> [ 3473.580068] block nvme0n1: no usable path - requeuing I/O
>>> [ 3473.585461] block nvme0n1: no usable path - requeuing I/O
>>> [ 3473.590861] block nvme0n1: no usable path - requeuing I/O
>>> [ 3473.596249] block nvme0n1: no usable path - requeuing I/O
>>> [ 3473.601880] nvme nvme0: creating 12 I/O queues.
>>> [ 3473.613955] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>>> [ 3473.622106] nvme nvme0: Successfully reconnected (2 attempt)
>>> [ 3531.384170] nvme nvme0: queue 1: timeout request 0x62 type 4
>>> [ 3531.389837] nvme nvme0: starting error recovery
>>> [ 3531.394396] nvme nvme0: queue 7: timeout request 0x61 type 4
>>> [ 3531.394954] nvme_ns_head_submit_bio: 14 callbacks suppressed
>>> [ 3531.394964] block nvme0n1: no usable path - requeuing I/O
>>> [ 3531.411147] block nvme0n1: no usable path - requeuing I/O
>>> [ 3531.416548] block nvme0n1: no usable path - requeuing I/O
>>> [ 3531.421954] block nvme0n1: no usable path - requeuing I/O
>>> [ 3531.427361] block nvme0n1: no usable path - requeuing I/O
>>> [ 3531.432763] block nvme0n1: no usable path - requeuing I/O
>>> [ 3531.438287] block nvme0n1: no usable path - requeuing I/O
>>> [ 3531.443701] block nvme0n1: no usable path - requeuing I/O
>>> [ 3531.449103] block nvme0n1: no usable path - requeuing I/O
>>> [ 3531.454497] block nvme0n1: no usable path - requeuing I/O
>>> [ 3560.568832] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>>> [ 3560.574839] nvmet: ctrl 1 fatal error occurred!
>>>
>>>
>>>> -- 
>>>>
>>>>>
>>>>> [  724.267865] run blktests nvme/012 at 2020-10-09 00:48:52
>>>>> [  724.518498] loop: module loaded
>>>>> [  724.552880] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>>>> [  724.586609] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>>>>> [  724.648898] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>>>>> [  724.663300] nvme nvme0: creating 12 I/O queues.
>>>>> [  724.669826] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>>>>> [  724.681350] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420
>>>>> [  726.641605] XFS (nvme0n1): Mounting V5 Filesystem
>>>>> [  726.654637] XFS (nvme0n1): Ending clean mount
>>>>> [  726.660302] xfs filesystem being mounted at /mnt/blktests supports timestamps until 2038 (0x7fffffff)
>>>>> [  787.042066] nvme nvme0: queue 2: timeout request 0xe type 4
>>>>> [  787.047649] nvme nvme0: starting error recovery
>>>>> [  787.052211] nvme nvme0: queue 4: timeout request 0x61 type 4
>>>>> [  787.053407] block nvme0n1: no usable path - requeuing I/O
>>>>> [  787.054543] nvme nvme0: Reconnecting in 10 seconds...
>>>>> [  787.057877] ------------[ cut here ]------------
>>>>> [  787.063274] block nvme0n1: no usable path - requeuing I/O
>>>>> [  787.068313] refcount_t: underflow; use-after-free.
>>>>> [  787.068370] WARNING: CPU: 3 PID: 30 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
>>>>> [  787.072941] block nvme0n1: no usable path - requeuing I/O
>>>>> [  787.072945] block nvme0n1: no usable path - requeuing I/O
>>>>> [  787.078333] Modules linked in: loop nvme_tcp nvme_fabrics nvme_core nvmet_tcp nvmet rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer amd64_edac_mod snd edac_mce_amd soundcore kvm_amd ccp kvm pcspkr irqbypass hp_wmi sparse_keymap rfkill k10temp wmi_bmof nv_tco i2c_nforce2 acpi_cpufreq ip_tables xfs nouveau video mxm_wmi i2c_algo_bit drm_kms_helper mptsas cec scsi_transport_sas ttm mptscsih ata_generic firewire_ohci pata_acpi firewire_core drm serio_raw mptbase forcedeth crc_itu_t sata_nv pata_amd wmi floppy
>>>>> [  787.083142] block nvme0n1: no usable path - requeuing I/O
>>>>> [  787.091304] CPU: 3 PID: 30 Comm: kworker/3:0H Not tainted 5.9.0-rc8.update+ #3
>>>>> [  787.096673] block nvme0n1: no usable path - requeuing I/O
>>>>> [  787.096674] block nvme0n1: no usable path - requeuing I/O
>>>>> [  787.096680] block nvme0n1: no usable path - requeuing I/O
>>>>> [  787.096681] block nvme0n1: no usable path - requeuing I/O
>>>>> [  787.096683] block nvme0n1: no usable path - requeuing I/O
>>>>> [  787.201242] Hardware name: Hewlett-Packard HP xw9400 Workstation/0A1Ch, BIOS 786D6 v04.02 04/21/2009
>>>>> [  787.210366] Workqueue: kblockd blk_mq_timeout_work
>>>>> [  787.215161] RIP: 0010:refcount_warn_saturate+0xa6/0xf0
>>>>> [  787.220295] Code: 05 aa 98 22 01 01 e8 9f cf ad ff 0f 0b c3 80 3d 98 98 22 01 00 75 95 48 c7 c7 48 54 3e a2 c6 05 88 98 22 01 01 e8 80 cf ad ff <0f> 0b c3 80 3d 77 98 22 01 00 0f 85 72 ff ff ff 48 c7 c7 a0 54 3e
>>>>> [  787.239021] RSP: 0018:ffffb860407b3dc8 EFLAGS: 00010292
>>>>> [  787.244240] RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffff9d93ddc58d08
>>>>> [  787.251361] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9d93ddc58d00
>>>>> [  787.258480] RBP: ffff9d93dbd6c200 R08: 000000b740ed60e8 R09: ffffffffa2fffbc4
>>>>> [  787.265612] R10: 0000000000000477 R11: 0000000000028320 R12: ffff9d93dbd6c2e8
>>>>> [  787.272741] R13: ffff9d93dbea0000 R14: ffffb860407b3e70 R15: ffff9d93dbea0000
>>>>> [  787.279864] FS:  0000000000000000(0000) GS:ffff9d93ddc40000(0000) knlGS:0000000000000000
>>>>> [  787.287948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  787.293695] CR2: 00007f45fd8fa024 CR3: 00000000792dc000 CR4: 00000000000006e0
>>>>> [  787.300820] Call Trace:
>>>>> [  787.303277]  blk_mq_check_expired+0x109/0x1b0
>>>>> [  787.307638]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
>>>>> [  787.312524]  ? blk_poll+0x300/0x300
>>>>> [  787.316022]  blk_mq_timeout_work+0x44/0xe0
>>>>> [  787.320128]  process_one_work+0x1b4/0x370
>>>>> [  787.324137]  worker_thread+0x53/0x3e0
>>>>> [  787.327809]  ? process_one_work+0x370/0x370
>>>>> [  787.332015]  kthread+0x11b/0x140
>>>>> [  787.335245]  ? __kthread_bind_mask+0x60/0x60
>>>>> [  787.339515]  ret_from_fork+0x22/0x30
>>>>> [  787.343094] ---[ end trace 606cf9189379fcfc ]---
>>>>>
>>>>>
>>>>> Best Regards,
>>>>>    Yi Zhang
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Ming Lei" <ming.lei@redhat.com>
>>>>> To: "Sagi Grimberg" <sagi@grimberg.me>
>>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "Yi Zhang" <yi.zhang@redhat.com>, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, "Keith Busch" <kbusch@kernel.org>, "Christoph Hellwig" <hch@lst.de>
>>>>> Sent: Friday, October 9, 2020 12:39:38 PM
>>>>> Subject: Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
>>>>>
>>>>> On Thu, Oct 08, 2020 at 02:37:50PM -0700, Sagi Grimberg wrote:
>>>>>> In nvme-tcp and nvme-rdma, the timeout handler in certain conditions
>>>>>> needs to complete an aborted request. The timeout handler serializes
>>>>>> against an error recovery sequence that iterates over the inflight
>>>>>> requests and cancels them.
>>>>>>
>>>>>> However, the fact that blk_mq_complete_request may defer to a different
>>>>>> core that serialization breaks.
>>>>>>
>>>>>> Hence re-introduce blk_mq_complete_request_sync and use that in the
>>>>>> timeout handler to make sure that we don't get a use-after-free
>>>>>> condition.
>>>>>>
>>>>>> This was uncovered by the blktests:
>>>>>> -- 
>>>>>> $ nvme_trtype=tcp ./check nvme/012
>>>>>>
>>>>>> [ 2152.546343] run blktests nvme/012 at 2020-10-08 05:59:03
>>>>>> [ 2152.799640] loop: module loaded
>>>>>> [ 2152.835222] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>>>>> [ 2152.860697] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>>>>>> [ 2152.937889] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:e544149126a24145abeeb6129b0ec52c.
>>>>>> [ 2152.952085] nvme nvme0: creating 12 I/O queues.
>>>>>> [ 2152.958042] nvme nvme0: mapped 12/0/0 default/read/poll queues.
>>>>>> [ 2152.969948] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420
>>>>>> [ 2154.927953] XFS (nvme0n1): Mounting V5 Filesystem
>>>>>> [ 2154.942432] XFS (nvme0n1): Ending clean mount
>>>>>> [ 2154.948183] xfs filesystem being mounted at /mnt/blktests supports timestamps until 2038 (0x7fffffff)
>>>>>> [ 2215.193645] nvme nvme0: queue 7: timeout request 0x11 type 4
>>>>>> [ 2215.199331] nvme nvme0: starting error recovery
>>>>>> [ 2215.203890] nvme nvme0: queue 7: timeout request 0x12 type 4
>>>>>> [ 2215.204483] block nvme0n1: no usable path - requeuing I/O
>>>>>> [ 2215.214976] block nvme0n1: no usable path - requeuing I/O
>>>>>> [ 2215.215495] nvme nvme0: Reconnecting in 10 seconds...
>>>>>> [ 2215.215502] ------------[ cut here ]------------
>>>>>> [ 2215.215505] refcount_t: underflow; use-after-free.
>>>>>> [ 2215.215617] WARNING: CPU: 6 PID: 45 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
>>>>>> [ 2215.215745] RSP: 0018:ffffb71b80837dc8 EFLAGS: 00010292
>>>>>> [ 2215.215750] RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffff93f37dcd8d08
>>>>>> [ 2215.215753] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff93f37dcd8d00
>>>>>> [ 2215.215755] RBP: ffff93f37a812400 R08: 00000203c5221fce R09: ffffffffa7fffbc4
>>>>>> [ 2215.215758] R10: 0000000000000477 R11: 000000000002835c R12: ffff93f37a8124e8
>>>>>> [ 2215.215761] R13: ffff93f37a2b0000 R14: ffffb71b80837e70 R15: ffff93f37a2b0000
>>>>>> [ 2215.215765] FS:  0000000000000000(0000) GS:ffff93f37dcc0000(0000) knlGS:0000000000000000
>>>>>> [ 2215.215768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [ 2215.215771] CR2: 00005637b4137028 CR3: 000000007c1be000 CR4: 00000000000006e0
>>>>>> [ 2215.215773] Call Trace:
>>>>>> [ 2215.215784]  blk_mq_check_expired+0x109/0x1b0
>>>>>> [ 2215.215797]  blk_mq_queue_tag_busy_iter+0x1b8/0x330
>>>>>> [ 2215.215801]  ? blk_poll+0x300/0x300
>>>>>> [ 2215.215806]  blk_mq_timeout_work+0x44/0xe0
>>>>>> [ 2215.215814]  process_one_work+0x1b4/0x370
>>>>>> [ 2215.215820]  worker_thread+0x53/0x3e0
>>>>>> [ 2215.215825]  ? process_one_work+0x370/0x370
>>>>>> [ 2215.215829]  kthread+0x11b/0x140
>>>>>> [ 2215.215834]  ? __kthread_bind_mask+0x60/0x60
>>>>>> [ 2215.215841]  ret_from_fork+0x22/0x30
>>>>>> [ 2215.215847] ---[ end trace 7d137e36e23c0d19 ]---
>>>>>> -- 
>>>>>>
>>>>>> Reported-by: Yi Zhang <yi.zhang@redhat.com>
>>>>>> Fixes: 236187c4ed19 ("nvme-tcp: fix timeout handler")
>>>>>> Fixes: 0475a8dcbcee ("nvme-rdma: fix timeout handler")
>>>>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>>>>> ---
>>>>>> Yi, would appreciate your Tested-by tag
>>>>>>
>>>>>>   block/blk-mq.c           | 7 +++++++
>>>>>>   drivers/nvme/host/rdma.c | 2 +-
>>>>>>   drivers/nvme/host/tcp.c  | 2 +-
>>>>>>   include/linux/blk-mq.h   | 1 +
>>>>>>   4 files changed, 10 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>> index f8e1e759c905..2d154722ef39 100644
>>>>>> --- a/block/blk-mq.c
>>>>>> +++ b/block/blk-mq.c
>>>>>> @@ -729,6 +729,13 @@ bool blk_mq_complete_request_remote(struct request *rq)
>>>>>>   }
>>>>>>   EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
>>>>>>   +void blk_mq_complete_request_sync(struct request *rq)
>>>>>> +{
>>>>>> +    WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
>>>>>> +    rq->q->mq_ops->complete(rq);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(blk_mq_complete_request_sync);
>>>>>> +
>>>>>>   /**
>>>>>>    * blk_mq_complete_request - end I/O on a request
>>>>>>    * @rq:        the request being processed
>>>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>>>> index 9e378d0a0c01..45fc605349da 100644
>>>>>> --- a/drivers/nvme/host/rdma.c
>>>>>> +++ b/drivers/nvme/host/rdma.c
>>>>>> @@ -1975,7 +1975,7 @@ static void nvme_rdma_complete_timed_out(struct request *rq)
>>>>>>       nvme_rdma_stop_queue(queue);
>>>>>>       if (!blk_mq_request_completed(rq)) {
>>>>>>           nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>>>>> -        blk_mq_complete_request(rq);
>>>>>> +        blk_mq_complete_request_sync(rq);
>>>>>>       }
>>>>>>       mutex_unlock(&ctrl->teardown_lock);
>>>>>>   }
>>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>>> index 8f4f29f18b8c..629b025685d1 100644
>>>>>> --- a/drivers/nvme/host/tcp.c
>>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>>> @@ -2177,7 +2177,7 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
>>>>>>       nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>>>>>>       if (!blk_mq_request_completed(rq)) {
>>>>>>           nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>>>>> -        blk_mq_complete_request(rq);
>>>>>> +        blk_mq_complete_request_sync(rq);
>>>>>
>>>>> Or complete the request in the following way? Then one block layer API
>>>>> can be saved:
>>>>>
>>>>>     blk_mq_complete_request_remote(rq);
>>>>>     nvme_complete_rq(rq);
>>>>>
>>>>>
>>>>> thanks,
>>>>> Ming
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Linux-nvme mailing list
>>>>> Linux-nvme@lists.infradead.org
>>>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Linux-nvme mailing list
>>>>> Linux-nvme@lists.infradead.org
>>>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-12  3:59             ` Chao Leng
@ 2020-10-12  8:13               ` Ming Lei
  2020-10-12  9:06                 ` Chao Leng
  2020-10-13 22:31                 ` Sagi Grimberg
  0 siblings, 2 replies; 26+ messages in thread
From: Ming Lei @ 2020-10-12  8:13 UTC (permalink / raw)
  To: Chao Leng
  Cc: Yi Zhang, Sagi Grimberg, Jens Axboe, linux-block, Keith Busch,
	linux-nvme, Christoph Hellwig

On Mon, Oct 12, 2020 at 11:59:21AM +0800, Chao Leng wrote:
> 
> 
> On 2020/10/10 14:08, Yi Zhang wrote:
> > 
> > 
> > On 10/10/20 2:29 AM, Sagi Grimberg wrote:
> > > 
> > > 
> > > On 10/9/20 6:55 AM, Yi Zhang wrote:
> > > > Hi Sagi
> > > > 
> > > > On 10/9/20 4:09 PM, Sagi Grimberg wrote:
> > > > > > Hi Sagi
> > > > > > 
> > > > > > I applied this patch on block origin/for-next and still can reproduce it.
> > > > > 
> > > > > That's unexpected, can you try this patch?
> > > > > -- 
> > > > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > > > > index 629b025685d1..46428ff0b0fc 100644
> > > > > --- a/drivers/nvme/host/tcp.c
> > > > > +++ b/drivers/nvme/host/tcp.c
> > > > > @@ -2175,7 +2175,7 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
> > > > >         /* fence other contexts that may complete the command */
> > > > >         mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
> > > > >         nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
> > > > > -       if (!blk_mq_request_completed(rq)) {
> > > > > +       if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq)) {
> > > > >                 nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
> > > > >                 blk_mq_complete_request_sync(rq);
> > > > >         }
> This may just reduce the probability. The concurrency of timeout and teardown will cause the same request
> be treated repeatly, this is not we expected.

That is right, not like SCSI, NVME doesn't apply atomic request completion, so
request may be completed/freed from both timeout & nvme_cancel_request().

.teardown_lock still may cover the race with Sagi's patch because teardown
actually cancels requests in sync style.

> In the teardown process, after quiesced queues delete the timer and cancel the timeout work maybe a better option.

Seems better solution, given it is aligned with NVME PCI's reset
handling. nvme_sync_queues() may be called in nvme_tcp_teardown_io_queues() to
avoid this race.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-12  8:13               ` Ming Lei
@ 2020-10-12  9:06                 ` Chao Leng
  2020-10-13 22:36                   ` Sagi Grimberg
  2020-10-13 22:31                 ` Sagi Grimberg
  1 sibling, 1 reply; 26+ messages in thread
From: Chao Leng @ 2020-10-12  9:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Yi Zhang, Sagi Grimberg, Jens Axboe, linux-block, Keith Busch,
	linux-nvme, Christoph Hellwig



On 2020/10/12 16:13, Ming Lei wrote:
> On Mon, Oct 12, 2020 at 11:59:21AM +0800, Chao Leng wrote:
>>
>>
>> On 2020/10/10 14:08, Yi Zhang wrote:
>>>
>>>
>>> On 10/10/20 2:29 AM, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 10/9/20 6:55 AM, Yi Zhang wrote:
>>>>> Hi Sagi
>>>>>
>>>>> On 10/9/20 4:09 PM, Sagi Grimberg wrote:
>>>>>>> Hi Sagi
>>>>>>>
>>>>>>> I applied this patch on block origin/for-next and still can reproduce it.
>>>>>>
>>>>>> That's unexpected, can you try this patch?
>>>>>> -- 
>>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>>> index 629b025685d1..46428ff0b0fc 100644
>>>>>> --- a/drivers/nvme/host/tcp.c
>>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>>> @@ -2175,7 +2175,7 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
>>>>>>          /* fence other contexts that may complete the command */
>>>>>>          mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>>>>          nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>>>>>> -       if (!blk_mq_request_completed(rq)) {
>>>>>> +       if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq)) {
>>>>>>                  nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>>>>>                  blk_mq_complete_request_sync(rq);
>>>>>>          }
>> This may just reduce the probability. The concurrency of timeout and teardown will cause the same request
>> be treated repeatly, this is not we expected.
> 
> That is right, not like SCSI, NVME doesn't apply atomic request completion, so
> request may be completed/freed from both timeout & nvme_cancel_request().
> 
> .teardown_lock still may cover the race with Sagi's patch because teardown
> actually cancels requests in sync style.
In extreme scenarios, the request may be already retry success(rq state change to inflight).
Timeout processing may wrongly stop the queue and abort the request.
teardown_lock serialize the process of timeout and teardown, but do not avoid the race.
It might not be safe.
> 
>> In the teardown process, after quiesced queues delete the timer and cancel the timeout work maybe a better option.
> 
> Seems better solution, given it is aligned with NVME PCI's reset
> handling. nvme_sync_queues() may be called in nvme_tcp_teardown_io_queues() to
> avoid this race.
> 
> 
> Thanks,
> Ming
> 
> .
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-12  8:13               ` Ming Lei
  2020-10-12  9:06                 ` Chao Leng
@ 2020-10-13 22:31                 ` Sagi Grimberg
  2020-10-14  1:25                   ` Chao Leng
  1 sibling, 1 reply; 26+ messages in thread
From: Sagi Grimberg @ 2020-10-13 22:31 UTC (permalink / raw)
  To: Ming Lei, Chao Leng
  Cc: Yi Zhang, Jens Axboe, linux-block, Keith Busch, linux-nvme,
	Christoph Hellwig


>>>>>> -- 
>>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>>> index 629b025685d1..46428ff0b0fc 100644
>>>>>> --- a/drivers/nvme/host/tcp.c
>>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>>> @@ -2175,7 +2175,7 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
>>>>>>          /* fence other contexts that may complete the command */
>>>>>>          mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>>>>          nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>>>>>> -       if (!blk_mq_request_completed(rq)) {
>>>>>> +       if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq)) {
>>>>>>                  nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>>>>>                  blk_mq_complete_request_sync(rq);
>>>>>>          }
>> This may just reduce the probability. The concurrency of timeout and teardown will cause the same request
>> be treated repeatly, this is not we expected.
> 
> That is right, not like SCSI, NVME doesn't apply atomic request completion, so
> request may be completed/freed from both timeout & nvme_cancel_request().
> 
> .teardown_lock still may cover the race with Sagi's patch because teardown
> actually cancels requests in sync style.
> 
>> In the teardown process, after quiesced queues delete the timer and cancel the timeout work maybe a better option.
> 
> Seems better solution, given it is aligned with NVME PCI's reset
> handling. nvme_sync_queues() may be called in nvme_tcp_teardown_io_queues() to
> avoid this race.

We can't call nvme_sync_queues, that flushes the timeout work that is
serializing with the teardown, it will deadlock.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-12  9:06                 ` Chao Leng
@ 2020-10-13 22:36                   ` Sagi Grimberg
  2020-10-14  1:08                     ` Ming Lei
  2020-10-14  1:32                     ` Chao Leng
  0 siblings, 2 replies; 26+ messages in thread
From: Sagi Grimberg @ 2020-10-13 22:36 UTC (permalink / raw)
  To: Chao Leng, Ming Lei
  Cc: Yi Zhang, Jens Axboe, linux-block, Keith Busch, linux-nvme,
	Christoph Hellwig


>>> This may just reduce the probability. The concurrency of timeout and 
>>> teardown will cause the same request
>>> be treated repeatly, this is not we expected.
>>
>> That is right, not like SCSI, NVME doesn't apply atomic request 
>> completion, so
>> request may be completed/freed from both timeout & nvme_cancel_request().
>>
>> .teardown_lock still may cover the race with Sagi's patch because 
>> teardown
>> actually cancels requests in sync style.
> In extreme scenarios, the request may be already retry success(rq state 
> change to inflight).
> Timeout processing may wrongly stop the queue and abort the request.
> teardown_lock serialize the process of timeout and teardown, but do not 
> avoid the race.
> It might not be safe.

Not sure I understand the scenario you are describing.

what do you mean by "In extreme scenarios, the request may be already 
retry success(rq state change to inflight)"?

What will retry the request? only when the host will reconnect
the request will be retried.

We can call nvme_sync_queues in the last part of the teardown, but
I still don't understand the race here.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-13 22:36                   ` Sagi Grimberg
@ 2020-10-14  1:08                     ` Ming Lei
  2020-10-14  1:37                       ` Chao Leng
  2020-10-14  3:34                       ` Ming Lei
  2020-10-14  1:32                     ` Chao Leng
  1 sibling, 2 replies; 26+ messages in thread
From: Ming Lei @ 2020-10-14  1:08 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Chao Leng, Jens Axboe, Yi Zhang, linux-nvme, linux-block,
	Keith Busch, Christoph Hellwig

On Tue, Oct 13, 2020 at 03:36:08PM -0700, Sagi Grimberg wrote:
> 
> > > > This may just reduce the probability. The concurrency of timeout
> > > > and teardown will cause the same request
> > > > be treated repeatly, this is not we expected.
> > > 
> > > That is right, not like SCSI, NVME doesn't apply atomic request
> > > completion, so
> > > request may be completed/freed from both timeout & nvme_cancel_request().
> > > 
> > > .teardown_lock still may cover the race with Sagi's patch because
> > > teardown
> > > actually cancels requests in sync style.
> > In extreme scenarios, the request may be already retry success(rq state
> > change to inflight).
> > Timeout processing may wrongly stop the queue and abort the request.
> > teardown_lock serialize the process of timeout and teardown, but do not
> > avoid the race.
> > It might not be safe.
> 
> Not sure I understand the scenario you are describing.
> 
> what do you mean by "In extreme scenarios, the request may be already retry
> success(rq state change to inflight)"?
> 
> What will retry the request? only when the host will reconnect
> the request will be retried.
> 
> We can call nvme_sync_queues in the last part of the teardown, but
> I still don't understand the race here.

Not like SCSI, NVME doesn't complete request atomically, so double
completion/free can be done from both timeout & nvme_cancel_request()(via teardown).

Given request is completed remotely or asynchronously in the two code paths,
the teardown_lock can't protect the case.

One solution is to apply the introduced blk_mq_complete_request_sync()
in both two code paths.

Another candidate is to use nvme_sync_queues() before teardown, such as
the following change:

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index d6a3e1487354..dc3561ca0074 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1909,6 +1909,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
 	blk_mq_quiesce_queue(ctrl->admin_q);
 	nvme_start_freeze(ctrl);
 	nvme_stop_queues(ctrl);
+	nvme_sync_queues(ctrl);
 	nvme_tcp_stop_io_queues(ctrl);
 	if (ctrl->tagset) {
 		blk_mq_tagset_busy_iter(ctrl->tagset,
@@ -2171,14 +2172,11 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
 	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
 	struct nvme_ctrl *ctrl = &req->queue->ctrl->ctrl;
 
-	/* fence other contexts that may complete the command */
-	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
 	nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
 	if (!blk_mq_request_completed(rq)) {
 		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
 		blk_mq_complete_request(rq);
 	}
-	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
 }
 
 static enum blk_eh_timer_return


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-13 22:31                 ` Sagi Grimberg
@ 2020-10-14  1:25                   ` Chao Leng
  0 siblings, 0 replies; 26+ messages in thread
From: Chao Leng @ 2020-10-14  1:25 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Yi Zhang, Jens Axboe, linux-block, Keith Busch, linux-nvme,
	Christoph Hellwig



On 2020/10/14 6:31, Sagi Grimberg wrote:
> 
>>>>>>> -- 
>>>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>>>> index 629b025685d1..46428ff0b0fc 100644
>>>>>>> --- a/drivers/nvme/host/tcp.c
>>>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>>>> @@ -2175,7 +2175,7 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
>>>>>>>          /* fence other contexts that may complete the command */
>>>>>>>          mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>>>>>          nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>>>>>>> -       if (!blk_mq_request_completed(rq)) {
>>>>>>> +       if (blk_mq_request_started(rq) && !blk_mq_request_completed(rq)) {
>>>>>>>                  nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>>>>>>>                  blk_mq_complete_request_sync(rq);
>>>>>>>          }
>>> This may just reduce the probability. The concurrency of timeout and teardown will cause the same request
>>> be treated repeatly, this is not we expected.
>>
>> That is right, not like SCSI, NVME doesn't apply atomic request completion, so
>> request may be completed/freed from both timeout & nvme_cancel_request().
>>
>> .teardown_lock still may cover the race with Sagi's patch because teardown
>> actually cancels requests in sync style.
>>
>>> In the teardown process, after quiesced queues delete the timer and cancel the timeout work maybe a better option.
>>
>> Seems better solution, given it is aligned with NVME PCI's reset
>> handling. nvme_sync_queues() may be called in nvme_tcp_teardown_io_queues() to
>> avoid this race.
> 
> We can't call nvme_sync_queues, that flushes the timeout work that is
> serializing with the teardown, it will deadlock.
yes, the teardown_lock is not needed if we sync_queue. We is testing the patch with nvme over roce.
> .

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-13 22:36                   ` Sagi Grimberg
  2020-10-14  1:08                     ` Ming Lei
@ 2020-10-14  1:32                     ` Chao Leng
  1 sibling, 0 replies; 26+ messages in thread
From: Chao Leng @ 2020-10-14  1:32 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Yi Zhang, Jens Axboe, linux-block, Keith Busch, linux-nvme,
	Christoph Hellwig



On 2020/10/14 6:36, Sagi Grimberg wrote:
> 
>>>> This may just reduce the probability. The concurrency of timeout and teardown will cause the same request
>>>> be treated repeatly, this is not we expected.
>>>
>>> That is right, not like SCSI, NVME doesn't apply atomic request completion, so
>>> request may be completed/freed from both timeout & nvme_cancel_request().
>>>
>>> .teardown_lock still may cover the race with Sagi's patch because teardown
>>> actually cancels requests in sync style.
>> In extreme scenarios, the request may be already retry success(rq state change to inflight).
>> Timeout processing may wrongly stop the queue and abort the request.
>> teardown_lock serialize the process of timeout and teardown, but do not avoid the race.
>> It might not be safe.
> 
> Not sure I understand the scenario you are describing.
> 
> what do you mean by "In extreme scenarios, the request may be already retry success(rq state change to inflight)"?
> 
> What will retry the request? only when the host will reconnect
> the request will be retried.
If irq interrupt the timeout work, and cause the timeout work pause long time(more than 100ms).
The reconnect may already success, and start requests. And then timeout work continue run to
wrongly stop queue and cancel the request.
The probability of this happening is very low.
> 
> We can call nvme_sync_queues in the last part of the teardown, but
> I still don't understand the race here.
> .

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-14  1:08                     ` Ming Lei
@ 2020-10-14  1:37                       ` Chao Leng
  2020-10-14  2:02                         ` Ming Lei
  2020-10-14  3:34                       ` Ming Lei
  1 sibling, 1 reply; 26+ messages in thread
From: Chao Leng @ 2020-10-14  1:37 UTC (permalink / raw)
  To: Ming Lei, Sagi Grimberg
  Cc: Jens Axboe, Yi Zhang, linux-nvme, linux-block, Keith Busch,
	Christoph Hellwig

rdma also need do this patch.
We do test this patch with nvme over roce a few days, now work well.

On 2020/10/14 9:08, Ming Lei wrote:
> On Tue, Oct 13, 2020 at 03:36:08PM -0700, Sagi Grimberg wrote:
>>
>>>>> This may just reduce the probability. The concurrency of timeout
>>>>> and teardown will cause the same request
>>>>> be treated repeatly, this is not we expected.
>>>>
>>>> That is right, not like SCSI, NVME doesn't apply atomic request
>>>> completion, so
>>>> request may be completed/freed from both timeout & nvme_cancel_request().
>>>>
>>>> .teardown_lock still may cover the race with Sagi's patch because
>>>> teardown
>>>> actually cancels requests in sync style.
>>> In extreme scenarios, the request may be already retry success(rq state
>>> change to inflight).
>>> Timeout processing may wrongly stop the queue and abort the request.
>>> teardown_lock serialize the process of timeout and teardown, but do not
>>> avoid the race.
>>> It might not be safe.
>>
>> Not sure I understand the scenario you are describing.
>>
>> what do you mean by "In extreme scenarios, the request may be already retry
>> success(rq state change to inflight)"?
>>
>> What will retry the request? only when the host will reconnect
>> the request will be retried.
>>
>> We can call nvme_sync_queues in the last part of the teardown, but
>> I still don't understand the race here.
> 
> Not like SCSI, NVME doesn't complete request atomically, so double
> completion/free can be done from both timeout & nvme_cancel_request()(via teardown).
> 
> Given request is completed remotely or asynchronously in the two code paths,
> the teardown_lock can't protect the case.
> 
> One solution is to apply the introduced blk_mq_complete_request_sync()
> in both two code paths.
> 
> Another candidate is to use nvme_sync_queues() before teardown, such as
> the following change:
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index d6a3e1487354..dc3561ca0074 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1909,6 +1909,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>   	blk_mq_quiesce_queue(ctrl->admin_q);
>   	nvme_start_freeze(ctrl);
>   	nvme_stop_queues(ctrl);
> +	nvme_sync_queues(ctrl);
nvme_sync_queues now sync io queues and admin queues, so we may need like this:
nvme_sync_io_queues(struct nvme_ctrl *ctrl)
{
	down_read(&ctrl->namespaces_rwsem);
	list_for_each_entry(ns, &ctrl->namespaces, list)
		blk_sync_queue(ns->queue);
	up_read(&ctrl->namespaces_rwsem);
}
EXPORT_SYMBOL_GPL(nvme_sync_io_queues);

void nvme_sync_queues(struct nvme_ctrl *ctrl)
{
	struct nvme_ns *ns;

     nvme_sync_io_queues(ctrl);

	if (ctrl->admin_q)
		blk_sync_queue(ctrl->admin_q);
}
EXPORT_SYMBOL_GPL(nvme_sync_queues);
>   	nvme_tcp_stop_io_queues(ctrl);
>   	if (ctrl->tagset) {
>   		blk_mq_tagset_busy_iter(ctrl->tagset,
> @@ -2171,14 +2172,11 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
>   	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
>   	struct nvme_ctrl *ctrl = &req->queue->ctrl->ctrl;
>   
> -	/* fence other contexts that may complete the command */
> -	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>   	nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
>   	if (!blk_mq_request_completed(rq)) {
>   		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>   		blk_mq_complete_request(rq);
>   	}
> -	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
>   }
>   
>   static enum blk_eh_timer_return
> 
> 
> Thanks,
> Ming
> 
> .
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-14  1:37                       ` Chao Leng
@ 2020-10-14  2:02                         ` Ming Lei
  2020-10-14  2:32                           ` Chao Leng
  2020-10-14  2:41                           ` Chao Leng
  0 siblings, 2 replies; 26+ messages in thread
From: Ming Lei @ 2020-10-14  2:02 UTC (permalink / raw)
  To: Chao Leng
  Cc: Sagi Grimberg, Jens Axboe, Yi Zhang, linux-nvme, linux-block,
	Keith Busch, Christoph Hellwig

On Wed, Oct 14, 2020 at 09:37:07AM +0800, Chao Leng wrote:
> rdma also need do this patch.
> We do test this patch with nvme over roce a few days, now work well.
> 
> On 2020/10/14 9:08, Ming Lei wrote:
> > On Tue, Oct 13, 2020 at 03:36:08PM -0700, Sagi Grimberg wrote:
> > > 
> > > > > > This may just reduce the probability. The concurrency of timeout
> > > > > > and teardown will cause the same request
> > > > > > be treated repeatly, this is not we expected.
> > > > > 
> > > > > That is right, not like SCSI, NVME doesn't apply atomic request
> > > > > completion, so
> > > > > request may be completed/freed from both timeout & nvme_cancel_request().
> > > > > 
> > > > > .teardown_lock still may cover the race with Sagi's patch because
> > > > > teardown
> > > > > actually cancels requests in sync style.
> > > > In extreme scenarios, the request may be already retry success(rq state
> > > > change to inflight).
> > > > Timeout processing may wrongly stop the queue and abort the request.
> > > > teardown_lock serialize the process of timeout and teardown, but do not
> > > > avoid the race.
> > > > It might not be safe.
> > > 
> > > Not sure I understand the scenario you are describing.
> > > 
> > > what do you mean by "In extreme scenarios, the request may be already retry
> > > success(rq state change to inflight)"?
> > > 
> > > What will retry the request? only when the host will reconnect
> > > the request will be retried.
> > > 
> > > We can call nvme_sync_queues in the last part of the teardown, but
> > > I still don't understand the race here.
> > 
> > Not like SCSI, NVME doesn't complete request atomically, so double
> > completion/free can be done from both timeout & nvme_cancel_request()(via teardown).
> > 
> > Given request is completed remotely or asynchronously in the two code paths,
> > the teardown_lock can't protect the case.
> > 
> > One solution is to apply the introduced blk_mq_complete_request_sync()
> > in both two code paths.
> > 
> > Another candidate is to use nvme_sync_queues() before teardown, such as
> > the following change:
> > 
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index d6a3e1487354..dc3561ca0074 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -1909,6 +1909,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
> >   	blk_mq_quiesce_queue(ctrl->admin_q);
> >   	nvme_start_freeze(ctrl);
> >   	nvme_stop_queues(ctrl);
> > +	nvme_sync_queues(ctrl);
> nvme_sync_queues now sync io queues and admin queues, so we may need like this:
> nvme_sync_io_queues(struct nvme_ctrl *ctrl)
> {
> 	down_read(&ctrl->namespaces_rwsem);
> 	list_for_each_entry(ns, &ctrl->namespaces, list)
> 		blk_sync_queue(ns->queue);
> 	up_read(&ctrl->namespaces_rwsem);
> }

Looks not necessary to do that, because admin queue is quiesced in
nvme_tcp_teardown_io_queues(), so it is safe to sync admin queue here too.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-14  2:02                         ` Ming Lei
@ 2020-10-14  2:32                           ` Chao Leng
  2020-10-14  2:41                           ` Chao Leng
  1 sibling, 0 replies; 26+ messages in thread
From: Chao Leng @ 2020-10-14  2:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Jens Axboe, Yi Zhang, linux-nvme, linux-block,
	Keith Busch, Christoph Hellwig



On 2020/10/14 10:02, Ming Lei wrote:
> On Wed, Oct 14, 2020 at 09:37:07AM +0800, Chao Leng wrote:
>> rdma also need do this patch.
>> We do test this patch with nvme over roce a few days, now work well.
>>
>> On 2020/10/14 9:08, Ming Lei wrote:
>>> On Tue, Oct 13, 2020 at 03:36:08PM -0700, Sagi Grimberg wrote:
>>>>
>>>>>>> This may just reduce the probability. The concurrency of timeout
>>>>>>> and teardown will cause the same request
>>>>>>> be treated repeatly, this is not we expected.
>>>>>>
>>>>>> That is right, not like SCSI, NVME doesn't apply atomic request
>>>>>> completion, so
>>>>>> request may be completed/freed from both timeout & nvme_cancel_request().
>>>>>>
>>>>>> .teardown_lock still may cover the race with Sagi's patch because
>>>>>> teardown
>>>>>> actually cancels requests in sync style.
>>>>> In extreme scenarios, the request may be already retry success(rq state
>>>>> change to inflight).
>>>>> Timeout processing may wrongly stop the queue and abort the request.
>>>>> teardown_lock serialize the process of timeout and teardown, but do not
>>>>> avoid the race.
>>>>> It might not be safe.
>>>>
>>>> Not sure I understand the scenario you are describing.
>>>>
>>>> what do you mean by "In extreme scenarios, the request may be already retry
>>>> success(rq state change to inflight)"?
>>>>
>>>> What will retry the request? only when the host will reconnect
>>>> the request will be retried.
>>>>
>>>> We can call nvme_sync_queues in the last part of the teardown, but
>>>> I still don't understand the race here.
>>>
>>> Not like SCSI, NVME doesn't complete request atomically, so double
>>> completion/free can be done from both timeout & nvme_cancel_request()(via teardown).
>>>
>>> Given request is completed remotely or asynchronously in the two code paths,
>>> the teardown_lock can't protect the case.
>>>
>>> One solution is to apply the introduced blk_mq_complete_request_sync()
>>> in both two code paths.
>>>
>>> Another candidate is to use nvme_sync_queues() before teardown, such as
>>> the following change:
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index d6a3e1487354..dc3561ca0074 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1909,6 +1909,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>>    	blk_mq_quiesce_queue(ctrl->admin_q);
>>>    	nvme_start_freeze(ctrl);
>>>    	nvme_stop_queues(ctrl);
>>> +	nvme_sync_queues(ctrl);
>> nvme_sync_queues now sync io queues and admin queues, so we may need like this:
>> nvme_sync_io_queues(struct nvme_ctrl *ctrl)
>> {
>> 	down_read(&ctrl->namespaces_rwsem);
>> 	list_for_each_entry(ns, &ctrl->namespaces, list)
>> 		blk_sync_queue(ns->queue);
>> 	up_read(&ctrl->namespaces_rwsem);
>> }
> 
> Looks not necessary to do that, because admin queue is quiesced in
> nvme_tcp_teardown_io_queues(), so it is safe to sync admin queue here too.Sync admin queue is not necessary. It should do in tear down admin queue.
And rdma do not quiesce the admin queue in tear down io queue.
So separate nvme_sync_io_queues may be a better choice.
> 
> 
> Thanks,
> Ming
> 
> .
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-14  2:02                         ` Ming Lei
  2020-10-14  2:32                           ` Chao Leng
@ 2020-10-14  2:41                           ` Chao Leng
  1 sibling, 0 replies; 26+ messages in thread
From: Chao Leng @ 2020-10-14  2:41 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Jens Axboe, Yi Zhang, linux-nvme, linux-block,
	Keith Busch, Christoph Hellwig



On 2020/10/14 10:02, Ming Lei wrote:
> On Wed, Oct 14, 2020 at 09:37:07AM +0800, Chao Leng wrote:
>> rdma also need do this patch.
>> We do test this patch with nvme over roce a few days, now work well.
>>
>> On 2020/10/14 9:08, Ming Lei wrote:
>>> On Tue, Oct 13, 2020 at 03:36:08PM -0700, Sagi Grimberg wrote:
>>>>
>>>>>>> This may just reduce the probability. The concurrency of timeout
>>>>>>> and teardown will cause the same request
>>>>>>> be treated repeatly, this is not we expected.
>>>>>>
>>>>>> That is right, not like SCSI, NVME doesn't apply atomic request
>>>>>> completion, so
>>>>>> request may be completed/freed from both timeout & nvme_cancel_request().
>>>>>>
>>>>>> .teardown_lock still may cover the race with Sagi's patch because
>>>>>> teardown
>>>>>> actually cancels requests in sync style.
>>>>> In extreme scenarios, the request may be already retry success(rq state
>>>>> change to inflight).
>>>>> Timeout processing may wrongly stop the queue and abort the request.
>>>>> teardown_lock serialize the process of timeout and teardown, but do not
>>>>> avoid the race.
>>>>> It might not be safe.
>>>>
>>>> Not sure I understand the scenario you are describing.
>>>>
>>>> what do you mean by "In extreme scenarios, the request may be already retry
>>>> success(rq state change to inflight)"?
>>>>
>>>> What will retry the request? only when the host will reconnect
>>>> the request will be retried.
>>>>
>>>> We can call nvme_sync_queues in the last part of the teardown, but
>>>> I still don't understand the race here.
>>>
>>> Not like SCSI, NVME doesn't complete request atomically, so double
>>> completion/free can be done from both timeout & nvme_cancel_request()(via teardown).
>>>
>>> Given request is completed remotely or asynchronously in the two code paths,
>>> the teardown_lock can't protect the case.
>>>
>>> One solution is to apply the introduced blk_mq_complete_request_sync()
>>> in both two code paths.
>>>
>>> Another candidate is to use nvme_sync_queues() before teardown, such as
>>> the following change:
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index d6a3e1487354..dc3561ca0074 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1909,6 +1909,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>>    	blk_mq_quiesce_queue(ctrl->admin_q);
>>>    	nvme_start_freeze(ctrl);
>>>    	nvme_stop_queues(ctrl);
>>> +	nvme_sync_queues(ctrl);
>> nvme_sync_queues now sync io queues and admin queues, so we may need like this:
>> nvme_sync_io_queues(struct nvme_ctrl *ctrl)
>> {
>> 	down_read(&ctrl->namespaces_rwsem);
>> 	list_for_each_entry(ns, &ctrl->namespaces, list)
>> 		blk_sync_queue(ns->queue);
>> 	up_read(&ctrl->namespaces_rwsem);
>> }
> 
> Looks not necessary to do that, because admin queue is quiesced in
> nvme_tcp_teardown_io_queues(), so it is safe to sync admin queue here too.
sync admin queue is not necessary. It should be done in teardown admin queue.
And rdma do not quiesce the admin queue in tear down io queue.
So separate nvme_sync_io_queues may be a better choice.
> 
> 
> Thanks,
> Ming
> 
> .
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-14  1:08                     ` Ming Lei
  2020-10-14  1:37                       ` Chao Leng
@ 2020-10-14  3:34                       ` Ming Lei
  2020-10-14  9:39                         ` Chao Leng
  1 sibling, 1 reply; 26+ messages in thread
From: Ming Lei @ 2020-10-14  3:34 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Chao Leng, Jens Axboe, Yi Zhang, linux-nvme, linux-block,
	Keith Busch, Christoph Hellwig

On Wed, Oct 14, 2020 at 09:08:28AM +0800, Ming Lei wrote:
> On Tue, Oct 13, 2020 at 03:36:08PM -0700, Sagi Grimberg wrote:
> > 
> > > > > This may just reduce the probability. The concurrency of timeout
> > > > > and teardown will cause the same request
> > > > > be treated repeatly, this is not we expected.
> > > > 
> > > > That is right, not like SCSI, NVME doesn't apply atomic request
> > > > completion, so
> > > > request may be completed/freed from both timeout & nvme_cancel_request().
> > > > 
> > > > .teardown_lock still may cover the race with Sagi's patch because
> > > > teardown
> > > > actually cancels requests in sync style.
> > > In extreme scenarios, the request may be already retry success(rq state
> > > change to inflight).
> > > Timeout processing may wrongly stop the queue and abort the request.
> > > teardown_lock serialize the process of timeout and teardown, but do not
> > > avoid the race.
> > > It might not be safe.
> > 
> > Not sure I understand the scenario you are describing.
> > 
> > what do you mean by "In extreme scenarios, the request may be already retry
> > success(rq state change to inflight)"?
> > 
> > What will retry the request? only when the host will reconnect
> > the request will be retried.
> > 
> > We can call nvme_sync_queues in the last part of the teardown, but
> > I still don't understand the race here.
> 
> Not like SCSI, NVME doesn't complete request atomically, so double
> completion/free can be done from both timeout & nvme_cancel_request()(via teardown).
> 
> Given request is completed remotely or asynchronously in the two code paths,
> the teardown_lock can't protect the case.

Thinking of the issue further, the race shouldn't be between timeout and
teardown.

Both nvme_cancel_request() and nvme_tcp_complete_timed_out() are called
with .teardown_lock, and both check if the request is completed before
calling blk_mq_complete_request() which marks the request as COMPLETE state.
So the request shouldn't be double-freed in the two code paths.

Another possible reason is that between timeout and normal completion(fail
fast pending requests after ctrl state is updated to CONNECTING).

Yi, can you try the following patch and see if the issue is fixed?

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index d6a3e1487354..fab9220196bd 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1886,7 +1886,6 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
 static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
 		bool remove)
 {
-	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
 	blk_mq_quiesce_queue(ctrl->admin_q);
 	nvme_tcp_stop_queue(ctrl, 0);
 	if (ctrl->admin_tagset) {
@@ -1897,15 +1896,13 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
 	if (remove)
 		blk_mq_unquiesce_queue(ctrl->admin_q);
 	nvme_tcp_destroy_admin_queue(ctrl, remove);
-	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
 }
 
 static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
 		bool remove)
 {
-	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
 	if (ctrl->queue_count <= 1)
-		goto out;
+		return;
 	blk_mq_quiesce_queue(ctrl->admin_q);
 	nvme_start_freeze(ctrl);
 	nvme_stop_queues(ctrl);
@@ -1918,8 +1915,6 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
 	if (remove)
 		nvme_start_queues(ctrl);
 	nvme_tcp_destroy_io_queues(ctrl, remove);
-out:
-	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
 }
 
 static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
@@ -2030,11 +2025,11 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
 	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
 
 	nvme_stop_keep_alive(ctrl);
+
+	mutex_lock(&tcp_ctrl->teardown_lock);
 	nvme_tcp_teardown_io_queues(ctrl, false);
-	/* unquiesce to fail fast pending requests */
-	nvme_start_queues(ctrl);
 	nvme_tcp_teardown_admin_queue(ctrl, false);
-	blk_mq_unquiesce_queue(ctrl->admin_q);
+	mutex_unlock(&tcp_ctrl->teardown_lock);
 
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
 		/* state change failure is ok if we started ctrl delete */
@@ -2043,6 +2038,9 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
 		return;
 	}
 
+	/* drain pending timeout for avoiding race with normal completion */
+	nvme_sync_queues(ctrl);
+
 	nvme_tcp_reconnect_or_remove(ctrl);
 }
 
@@ -2051,6 +2049,7 @@ static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
 	cancel_work_sync(&to_tcp_ctrl(ctrl)->err_work);
 	cancel_delayed_work_sync(&to_tcp_ctrl(ctrl)->connect_work);
 
+	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
 	nvme_tcp_teardown_io_queues(ctrl, shutdown);
 	blk_mq_quiesce_queue(ctrl->admin_q);
 	if (shutdown)
@@ -2058,6 +2057,7 @@ static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
 	else
 		nvme_disable_ctrl(ctrl);
 	nvme_tcp_teardown_admin_queue(ctrl, shutdown);
+	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
 }
 
 static void nvme_tcp_delete_ctrl(struct nvme_ctrl *ctrl)
@@ -2080,6 +2080,9 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
 		return;
 	}
 
+	/* drain pending timeout for avoiding race with normal completion */
+	nvme_sync_queues(ctrl);
+
 	if (nvme_tcp_setup_ctrl(ctrl, false))
 		goto out_fail;
 

Thanks,
Ming


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-14  3:34                       ` Ming Lei
@ 2020-10-14  9:39                         ` Chao Leng
  2020-10-14  9:56                           ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Chao Leng @ 2020-10-14  9:39 UTC (permalink / raw)
  To: Ming Lei, Sagi Grimberg
  Cc: Jens Axboe, Yi Zhang, linux-nvme, linux-block, Keith Busch,
	Christoph Hellwig



On 2020/10/14 11:34, Ming Lei wrote:
> On Wed, Oct 14, 2020 at 09:08:28AM +0800, Ming Lei wrote:
>> On Tue, Oct 13, 2020 at 03:36:08PM -0700, Sagi Grimberg wrote:
>>>
>>>>>> This may just reduce the probability. The concurrency of timeout
>>>>>> and teardown will cause the same request
>>>>>> be treated repeatly, this is not we expected.
>>>>>
>>>>> That is right, not like SCSI, NVME doesn't apply atomic request
>>>>> completion, so
>>>>> request may be completed/freed from both timeout & nvme_cancel_request().
>>>>>
>>>>> .teardown_lock still may cover the race with Sagi's patch because
>>>>> teardown
>>>>> actually cancels requests in sync style.
>>>> In extreme scenarios, the request may be already retry success(rq state
>>>> change to inflight).
>>>> Timeout processing may wrongly stop the queue and abort the request.
>>>> teardown_lock serialize the process of timeout and teardown, but do not
>>>> avoid the race.
>>>> It might not be safe.
>>>
>>> Not sure I understand the scenario you are describing.
>>>
>>> what do you mean by "In extreme scenarios, the request may be already retry
>>> success(rq state change to inflight)"?
>>>
>>> What will retry the request? only when the host will reconnect
>>> the request will be retried.
>>>
>>> We can call nvme_sync_queues in the last part of the teardown, but
>>> I still don't understand the race here.
>>
>> Not like SCSI, NVME doesn't complete request atomically, so double
>> completion/free can be done from both timeout & nvme_cancel_request()(via teardown).
>>
>> Given request is completed remotely or asynchronously in the two code paths,
>> the teardown_lock can't protect the case.
> 
> Thinking of the issue further, the race shouldn't be between timeout and
> teardown.
> 
> Both nvme_cancel_request() and nvme_tcp_complete_timed_out() are called
> with .teardown_lock, and both check if the request is completed before
> calling blk_mq_complete_request() which marks the request as COMPLETE state.
> So the request shouldn't be double-freed in the two code paths.
> 
> Another possible reason is that between timeout and normal completion(fail
> fast pending requests after ctrl state is updated to CONNECTING).
> 
> Yi, can you try the following patch and see if the issue is fixed?
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index d6a3e1487354..fab9220196bd 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1886,7 +1886,6 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
>   static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
>   		bool remove)
>   {
> -	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>   	blk_mq_quiesce_queue(ctrl->admin_q);
>   	nvme_tcp_stop_queue(ctrl, 0);
>   	if (ctrl->admin_tagset) {
> @@ -1897,15 +1896,13 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
>   	if (remove)
>   		blk_mq_unquiesce_queue(ctrl->admin_q);
>   	nvme_tcp_destroy_admin_queue(ctrl, remove);
> -	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
>   }
>   
>   static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>   		bool remove)
>   {
> -	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>   	if (ctrl->queue_count <= 1)
> -		goto out;
> +		return;
>   	blk_mq_quiesce_queue(ctrl->admin_q);
>   	nvme_start_freeze(ctrl);
>   	nvme_stop_queues(ctrl);
> @@ -1918,8 +1915,6 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>   	if (remove)
>   		nvme_start_queues(ctrl);
>   	nvme_tcp_destroy_io_queues(ctrl, remove);
> -out:
> -	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
>   }
>   
>   static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
> @@ -2030,11 +2025,11 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
>   	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>   
>   	nvme_stop_keep_alive(ctrl);
> +
> +	mutex_lock(&tcp_ctrl->teardown_lock);
>   	nvme_tcp_teardown_io_queues(ctrl, false);
> -	/* unquiesce to fail fast pending requests */
> -	nvme_start_queues(ctrl);
>   	nvme_tcp_teardown_admin_queue(ctrl, false);
> -	blk_mq_unquiesce_queue(ctrl->admin_q);
Delete blk_mq_unquiesce_queue will cause a bug which may cause reconnect failed.
Delete nvme_start_queues may cause another bug.
> +	mutex_unlock(&tcp_ctrl->teardown_lock);
>   
>   	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
>   		/* state change failure is ok if we started ctrl delete */
> @@ -2043,6 +2038,9 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
>   		return;
>   	}
>   
> +	/* drain pending timeout for avoiding race with normal completion */
> +	nvme_sync_queues(ctrl);
> +
>   	nvme_tcp_reconnect_or_remove(ctrl);
>   }
>   
> @@ -2051,6 +2049,7 @@ static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
>   	cancel_work_sync(&to_tcp_ctrl(ctrl)->err_work);
>   	cancel_delayed_work_sync(&to_tcp_ctrl(ctrl)->connect_work);
>   
> +	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>   	nvme_tcp_teardown_io_queues(ctrl, shutdown);
>   	blk_mq_quiesce_queue(ctrl->admin_q);
>   	if (shutdown)
> @@ -2058,6 +2057,7 @@ static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
>   	else
>   		nvme_disable_ctrl(ctrl);
>   	nvme_tcp_teardown_admin_queue(ctrl, shutdown);
> +	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
>   }
>   
>   static void nvme_tcp_delete_ctrl(struct nvme_ctrl *ctrl)
> @@ -2080,6 +2080,9 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
>   		return;
>   	}
>   
> +	/* drain pending timeout for avoiding race with normal completion */
> +	nvme_sync_queues(ctrl);
> +
>   	if (nvme_tcp_setup_ctrl(ctrl, false))
>   		goto out_fail;
>   
> 
> Thanks,
> Ming
> 
> .
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-14  9:39                         ` Chao Leng
@ 2020-10-14  9:56                           ` Ming Lei
  2020-10-15  6:05                             ` Chao Leng
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2020-10-14  9:56 UTC (permalink / raw)
  To: Chao Leng
  Cc: Sagi Grimberg, Jens Axboe, Yi Zhang, linux-nvme, linux-block,
	Keith Busch, Christoph Hellwig

On Wed, Oct 14, 2020 at 05:39:12PM +0800, Chao Leng wrote:
> 
> 
> On 2020/10/14 11:34, Ming Lei wrote:
> > On Wed, Oct 14, 2020 at 09:08:28AM +0800, Ming Lei wrote:
> > > On Tue, Oct 13, 2020 at 03:36:08PM -0700, Sagi Grimberg wrote:
> > > > 
> > > > > > > This may just reduce the probability. The concurrency of timeout
> > > > > > > and teardown will cause the same request
> > > > > > > be treated repeatly, this is not we expected.
> > > > > > 
> > > > > > That is right, not like SCSI, NVME doesn't apply atomic request
> > > > > > completion, so
> > > > > > request may be completed/freed from both timeout & nvme_cancel_request().
> > > > > > 
> > > > > > .teardown_lock still may cover the race with Sagi's patch because
> > > > > > teardown
> > > > > > actually cancels requests in sync style.
> > > > > In extreme scenarios, the request may be already retry success(rq state
> > > > > change to inflight).
> > > > > Timeout processing may wrongly stop the queue and abort the request.
> > > > > teardown_lock serialize the process of timeout and teardown, but do not
> > > > > avoid the race.
> > > > > It might not be safe.
> > > > 
> > > > Not sure I understand the scenario you are describing.
> > > > 
> > > > what do you mean by "In extreme scenarios, the request may be already retry
> > > > success(rq state change to inflight)"?
> > > > 
> > > > What will retry the request? only when the host will reconnect
> > > > the request will be retried.
> > > > 
> > > > We can call nvme_sync_queues in the last part of the teardown, but
> > > > I still don't understand the race here.
> > > 
> > > Not like SCSI, NVME doesn't complete request atomically, so double
> > > completion/free can be done from both timeout & nvme_cancel_request()(via teardown).
> > > 
> > > Given request is completed remotely or asynchronously in the two code paths,
> > > the teardown_lock can't protect the case.
> > 
> > Thinking of the issue further, the race shouldn't be between timeout and
> > teardown.
> > 
> > Both nvme_cancel_request() and nvme_tcp_complete_timed_out() are called
> > with .teardown_lock, and both check if the request is completed before
> > calling blk_mq_complete_request() which marks the request as COMPLETE state.
> > So the request shouldn't be double-freed in the two code paths.
> > 
> > Another possible reason is that between timeout and normal completion(fail
> > fast pending requests after ctrl state is updated to CONNECTING).
> > 
> > Yi, can you try the following patch and see if the issue is fixed?
> > 
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index d6a3e1487354..fab9220196bd 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -1886,7 +1886,6 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
> >   static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
> >   		bool remove)
> >   {
> > -	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
> >   	blk_mq_quiesce_queue(ctrl->admin_q);
> >   	nvme_tcp_stop_queue(ctrl, 0);
> >   	if (ctrl->admin_tagset) {
> > @@ -1897,15 +1896,13 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
> >   	if (remove)
> >   		blk_mq_unquiesce_queue(ctrl->admin_q);
> >   	nvme_tcp_destroy_admin_queue(ctrl, remove);
> > -	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
> >   }
> >   static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
> >   		bool remove)
> >   {
> > -	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
> >   	if (ctrl->queue_count <= 1)
> > -		goto out;
> > +		return;
> >   	blk_mq_quiesce_queue(ctrl->admin_q);
> >   	nvme_start_freeze(ctrl);
> >   	nvme_stop_queues(ctrl);
> > @@ -1918,8 +1915,6 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
> >   	if (remove)
> >   		nvme_start_queues(ctrl);
> >   	nvme_tcp_destroy_io_queues(ctrl, remove);
> > -out:
> > -	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
> >   }
> >   static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
> > @@ -2030,11 +2025,11 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
> >   	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >   	nvme_stop_keep_alive(ctrl);
> > +
> > +	mutex_lock(&tcp_ctrl->teardown_lock);
> >   	nvme_tcp_teardown_io_queues(ctrl, false);
> > -	/* unquiesce to fail fast pending requests */
> > -	nvme_start_queues(ctrl);
> >   	nvme_tcp_teardown_admin_queue(ctrl, false);
> > -	blk_mq_unquiesce_queue(ctrl->admin_q);
> Delete blk_mq_unquiesce_queue will cause a bug which may cause reconnect failed.
> Delete nvme_start_queues may cause another bug.

nvme_tcp_setup_ctrl() will re-start io and admin queue, and only .connect_q
and .fabrics_q are required during reconnect.

So can you explain in detail about the bug?

Thanks,
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-14  9:56                           ` Ming Lei
@ 2020-10-15  6:05                             ` Chao Leng
  2020-10-15  7:50                               ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Chao Leng @ 2020-10-15  6:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Jens Axboe, Yi Zhang, linux-nvme, linux-block,
	Keith Busch, Christoph Hellwig



On 2020/10/14 17:56, Ming Lei wrote:
> On Wed, Oct 14, 2020 at 05:39:12PM +0800, Chao Leng wrote:
>>
>>
>> On 2020/10/14 11:34, Ming Lei wrote:
>>> On Wed, Oct 14, 2020 at 09:08:28AM +0800, Ming Lei wrote:
>>>> On Tue, Oct 13, 2020 at 03:36:08PM -0700, Sagi Grimberg wrote:
>>>>>
>>>>>>>> This may just reduce the probability. The concurrency of timeout
>>>>>>>> and teardown will cause the same request
>>>>>>>> be treated repeatly, this is not we expected.
>>>>>>>
>>>>>>> That is right, not like SCSI, NVME doesn't apply atomic request
>>>>>>> completion, so
>>>>>>> request may be completed/freed from both timeout & nvme_cancel_request().
>>>>>>>
>>>>>>> .teardown_lock still may cover the race with Sagi's patch because
>>>>>>> teardown
>>>>>>> actually cancels requests in sync style.
>>>>>> In extreme scenarios, the request may be already retry success(rq state
>>>>>> change to inflight).
>>>>>> Timeout processing may wrongly stop the queue and abort the request.
>>>>>> teardown_lock serialize the process of timeout and teardown, but do not
>>>>>> avoid the race.
>>>>>> It might not be safe.
>>>>>
>>>>> Not sure I understand the scenario you are describing.
>>>>>
>>>>> what do you mean by "In extreme scenarios, the request may be already retry
>>>>> success(rq state change to inflight)"?
>>>>>
>>>>> What will retry the request? only when the host will reconnect
>>>>> the request will be retried.
>>>>>
>>>>> We can call nvme_sync_queues in the last part of the teardown, but
>>>>> I still don't understand the race here.
>>>>
>>>> Not like SCSI, NVME doesn't complete request atomically, so double
>>>> completion/free can be done from both timeout & nvme_cancel_request()(via teardown).
>>>>
>>>> Given request is completed remotely or asynchronously in the two code paths,
>>>> the teardown_lock can't protect the case.
>>>
>>> Thinking of the issue further, the race shouldn't be between timeout and
>>> teardown.
>>>
>>> Both nvme_cancel_request() and nvme_tcp_complete_timed_out() are called
>>> with .teardown_lock, and both check if the request is completed before
>>> calling blk_mq_complete_request() which marks the request as COMPLETE state.
>>> So the request shouldn't be double-freed in the two code paths.
>>>
>>> Another possible reason is that between timeout and normal completion(fail
>>> fast pending requests after ctrl state is updated to CONNECTING).
>>>
>>> Yi, can you try the following patch and see if the issue is fixed?
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index d6a3e1487354..fab9220196bd 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1886,7 +1886,6 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
>>>    static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
>>>    		bool remove)
>>>    {
>>> -	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>    	blk_mq_quiesce_queue(ctrl->admin_q);
>>>    	nvme_tcp_stop_queue(ctrl, 0);
>>>    	if (ctrl->admin_tagset) {
>>> @@ -1897,15 +1896,13 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
>>>    	if (remove)
>>>    		blk_mq_unquiesce_queue(ctrl->admin_q);
>>>    	nvme_tcp_destroy_admin_queue(ctrl, remove);
>>> -	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>    }
>>>    static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>>    		bool remove)
>>>    {
>>> -	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>    	if (ctrl->queue_count <= 1)
>>> -		goto out;
>>> +		return;
>>>    	blk_mq_quiesce_queue(ctrl->admin_q);
>>>    	nvme_start_freeze(ctrl);
>>>    	nvme_stop_queues(ctrl);
>>> @@ -1918,8 +1915,6 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>>    	if (remove)
>>>    		nvme_start_queues(ctrl);
>>>    	nvme_tcp_destroy_io_queues(ctrl, remove);
>>> -out:
>>> -	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>    }
>>>    static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
>>> @@ -2030,11 +2025,11 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
>>>    	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>>>    	nvme_stop_keep_alive(ctrl);
>>> +
>>> +	mutex_lock(&tcp_ctrl->teardown_lock);
>>>    	nvme_tcp_teardown_io_queues(ctrl, false);
>>> -	/* unquiesce to fail fast pending requests */
>>> -	nvme_start_queues(ctrl);
>>>    	nvme_tcp_teardown_admin_queue(ctrl, false);
>>> -	blk_mq_unquiesce_queue(ctrl->admin_q);
>> Delete blk_mq_unquiesce_queue will cause a bug which may cause reconnect failed.
>> Delete nvme_start_queues may cause another bug.
> 
> nvme_tcp_setup_ctrl() will re-start io and admin queue, and only .connect_q
> and .fabrics_q are required during reconnect.I check the code. Unquiesce the admin queue in nvme_tcp_configure_admin_queue, so reconnect can work well.
> 
> So can you explain in detail about the bug?
First if reconnect failed, quiesce the io queue and admin queue will cause IO pause long time.
Second if reconnect failed more than max_reconnects, delete ctrl will hang.
Second, I'm not sure. you can check it.
The hang function is nvme_remove_namespaces in nvme_do_delete_ctrl.
> 
> Thanks,
> Ming
> 
> .
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-15  6:05                             ` Chao Leng
@ 2020-10-15  7:50                               ` Ming Lei
  2020-10-15 10:05                                 ` Chao Leng
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2020-10-15  7:50 UTC (permalink / raw)
  To: Chao Leng
  Cc: Sagi Grimberg, Jens Axboe, Yi Zhang, linux-nvme, linux-block,
	Keith Busch, Christoph Hellwig

On Thu, Oct 15, 2020 at 02:05:01PM +0800, Chao Leng wrote:
> 
> 
> On 2020/10/14 17:56, Ming Lei wrote:
> > On Wed, Oct 14, 2020 at 05:39:12PM +0800, Chao Leng wrote:
> > > 
> > > 
> > > On 2020/10/14 11:34, Ming Lei wrote:
> > > > On Wed, Oct 14, 2020 at 09:08:28AM +0800, Ming Lei wrote:
> > > > > On Tue, Oct 13, 2020 at 03:36:08PM -0700, Sagi Grimberg wrote:
> > > > > > 
> > > > > > > > > This may just reduce the probability. The concurrency of timeout
> > > > > > > > > and teardown will cause the same request
> > > > > > > > > be treated repeatly, this is not we expected.
> > > > > > > > 
> > > > > > > > That is right, not like SCSI, NVME doesn't apply atomic request
> > > > > > > > completion, so
> > > > > > > > request may be completed/freed from both timeout & nvme_cancel_request().
> > > > > > > > 
> > > > > > > > .teardown_lock still may cover the race with Sagi's patch because
> > > > > > > > teardown
> > > > > > > > actually cancels requests in sync style.
> > > > > > > In extreme scenarios, the request may be already retry success(rq state
> > > > > > > change to inflight).
> > > > > > > Timeout processing may wrongly stop the queue and abort the request.
> > > > > > > teardown_lock serialize the process of timeout and teardown, but do not
> > > > > > > avoid the race.
> > > > > > > It might not be safe.
> > > > > > 
> > > > > > Not sure I understand the scenario you are describing.
> > > > > > 
> > > > > > what do you mean by "In extreme scenarios, the request may be already retry
> > > > > > success(rq state change to inflight)"?
> > > > > > 
> > > > > > What will retry the request? only when the host will reconnect
> > > > > > the request will be retried.
> > > > > > 
> > > > > > We can call nvme_sync_queues in the last part of the teardown, but
> > > > > > I still don't understand the race here.
> > > > > 
> > > > > Not like SCSI, NVME doesn't complete request atomically, so double
> > > > > completion/free can be done from both timeout & nvme_cancel_request()(via teardown).
> > > > > 
> > > > > Given request is completed remotely or asynchronously in the two code paths,
> > > > > the teardown_lock can't protect the case.
> > > > 
> > > > Thinking of the issue further, the race shouldn't be between timeout and
> > > > teardown.
> > > > 
> > > > Both nvme_cancel_request() and nvme_tcp_complete_timed_out() are called
> > > > with .teardown_lock, and both check if the request is completed before
> > > > calling blk_mq_complete_request() which marks the request as COMPLETE state.
> > > > So the request shouldn't be double-freed in the two code paths.
> > > > 
> > > > Another possible reason is that between timeout and normal completion(fail
> > > > fast pending requests after ctrl state is updated to CONNECTING).
> > > > 
> > > > Yi, can you try the following patch and see if the issue is fixed?
> > > > 
> > > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > > > index d6a3e1487354..fab9220196bd 100644
> > > > --- a/drivers/nvme/host/tcp.c
> > > > +++ b/drivers/nvme/host/tcp.c
> > > > @@ -1886,7 +1886,6 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
> > > >    static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
> > > >    		bool remove)
> > > >    {
> > > > -	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
> > > >    	blk_mq_quiesce_queue(ctrl->admin_q);
> > > >    	nvme_tcp_stop_queue(ctrl, 0);
> > > >    	if (ctrl->admin_tagset) {
> > > > @@ -1897,15 +1896,13 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
> > > >    	if (remove)
> > > >    		blk_mq_unquiesce_queue(ctrl->admin_q);
> > > >    	nvme_tcp_destroy_admin_queue(ctrl, remove);
> > > > -	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
> > > >    }
> > > >    static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
> > > >    		bool remove)
> > > >    {
> > > > -	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
> > > >    	if (ctrl->queue_count <= 1)
> > > > -		goto out;
> > > > +		return;
> > > >    	blk_mq_quiesce_queue(ctrl->admin_q);
> > > >    	nvme_start_freeze(ctrl);
> > > >    	nvme_stop_queues(ctrl);
> > > > @@ -1918,8 +1915,6 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
> > > >    	if (remove)
> > > >    		nvme_start_queues(ctrl);
> > > >    	nvme_tcp_destroy_io_queues(ctrl, remove);
> > > > -out:
> > > > -	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
> > > >    }
> > > >    static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
> > > > @@ -2030,11 +2025,11 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
> > > >    	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> > > >    	nvme_stop_keep_alive(ctrl);
> > > > +
> > > > +	mutex_lock(&tcp_ctrl->teardown_lock);
> > > >    	nvme_tcp_teardown_io_queues(ctrl, false);
> > > > -	/* unquiesce to fail fast pending requests */
> > > > -	nvme_start_queues(ctrl);
> > > >    	nvme_tcp_teardown_admin_queue(ctrl, false);
> > > > -	blk_mq_unquiesce_queue(ctrl->admin_q);
> > > Delete blk_mq_unquiesce_queue will cause a bug which may cause reconnect failed.
> > > Delete nvme_start_queues may cause another bug.
> > 
> > nvme_tcp_setup_ctrl() will re-start io and admin queue, and only .connect_q
> > and .fabrics_q are required during reconnect.I check the code. Unquiesce the admin queue in nvme_tcp_configure_admin_queue, so reconnect can work well.
> > 
> > So can you explain in detail about the bug?
> First if reconnect failed, quiesce the io queue and admin queue will cause IO pause long time.

Any normal IO can't make progress until reconnect is successful, so this
change won't increase IO pause. This way is exactly what NVMe PCI takes,
see nvme_start_queues() called from nvme_reset_work().

> Second if reconnect failed more than max_reconnects, delete ctrl will hang.

No, delete ctrl won't hang, because 'shutdown' parameter is true in case
of deleting ctrl, which will unquiesce both admin_q and io queues in
nvme_tcp_teardown_io_queues() and nvme_tcp_teardown_admin_queue().


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] block: re-introduce blk_mq_complete_request_sync
  2020-10-15  7:50                               ` Ming Lei
@ 2020-10-15 10:05                                 ` Chao Leng
  0 siblings, 0 replies; 26+ messages in thread
From: Chao Leng @ 2020-10-15 10:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Yi Zhang, Sagi Grimberg, linux-nvme, linux-block,
	Keith Busch, Christoph Hellwig



On 2020/10/15 15:50, Ming Lei wrote:
> On Thu, Oct 15, 2020 at 02:05:01PM +0800, Chao Leng wrote:
>>
>>
>> On 2020/10/14 17:56, Ming Lei wrote:
>>> On Wed, Oct 14, 2020 at 05:39:12PM +0800, Chao Leng wrote:
>>>>
>>>>
>>>> On 2020/10/14 11:34, Ming Lei wrote:
>>>>> On Wed, Oct 14, 2020 at 09:08:28AM +0800, Ming Lei wrote:
>>>>>> On Tue, Oct 13, 2020 at 03:36:08PM -0700, Sagi Grimberg wrote:
>>>>>>>
>>>>>>>>>> This may just reduce the probability. The concurrency of timeout
>>>>>>>>>> and teardown will cause the same request
>>>>>>>>>> be treated repeatly, this is not we expected.
>>>>>>>>>
>>>>>>>>> That is right, not like SCSI, NVME doesn't apply atomic request
>>>>>>>>> completion, so
>>>>>>>>> request may be completed/freed from both timeout & nvme_cancel_request().
>>>>>>>>>
>>>>>>>>> .teardown_lock still may cover the race with Sagi's patch because
>>>>>>>>> teardown
>>>>>>>>> actually cancels requests in sync style.
>>>>>>>> In extreme scenarios, the request may be already retry success(rq state
>>>>>>>> change to inflight).
>>>>>>>> Timeout processing may wrongly stop the queue and abort the request.
>>>>>>>> teardown_lock serialize the process of timeout and teardown, but do not
>>>>>>>> avoid the race.
>>>>>>>> It might not be safe.
>>>>>>>
>>>>>>> Not sure I understand the scenario you are describing.
>>>>>>>
>>>>>>> what do you mean by "In extreme scenarios, the request may be already retry
>>>>>>> success(rq state change to inflight)"?
>>>>>>>
>>>>>>> What will retry the request? only when the host will reconnect
>>>>>>> the request will be retried.
>>>>>>>
>>>>>>> We can call nvme_sync_queues in the last part of the teardown, but
>>>>>>> I still don't understand the race here.
>>>>>>
>>>>>> Not like SCSI, NVME doesn't complete request atomically, so double
>>>>>> completion/free can be done from both timeout & nvme_cancel_request()(via teardown).
>>>>>>
>>>>>> Given request is completed remotely or asynchronously in the two code paths,
>>>>>> the teardown_lock can't protect the case.
>>>>>
>>>>> Thinking of the issue further, the race shouldn't be between timeout and
>>>>> teardown.
>>>>>
>>>>> Both nvme_cancel_request() and nvme_tcp_complete_timed_out() are called
>>>>> with .teardown_lock, and both check if the request is completed before
>>>>> calling blk_mq_complete_request() which marks the request as COMPLETE state.
>>>>> So the request shouldn't be double-freed in the two code paths.
>>>>>
>>>>> Another possible reason is that between timeout and normal completion(fail
>>>>> fast pending requests after ctrl state is updated to CONNECTING).
>>>>>
>>>>> Yi, can you try the following patch and see if the issue is fixed?
>>>>>
>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>> index d6a3e1487354..fab9220196bd 100644
>>>>> --- a/drivers/nvme/host/tcp.c
>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>> @@ -1886,7 +1886,6 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
>>>>>     static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
>>>>>     		bool remove)
>>>>>     {
>>>>> -	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>>>     	blk_mq_quiesce_queue(ctrl->admin_q);
>>>>>     	nvme_tcp_stop_queue(ctrl, 0);
>>>>>     	if (ctrl->admin_tagset) {
>>>>> @@ -1897,15 +1896,13 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
>>>>>     	if (remove)
>>>>>     		blk_mq_unquiesce_queue(ctrl->admin_q);
>>>>>     	nvme_tcp_destroy_admin_queue(ctrl, remove);
>>>>> -	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>>>     }
>>>>>     static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>>>>     		bool remove)
>>>>>     {
>>>>> -	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>>>     	if (ctrl->queue_count <= 1)
>>>>> -		goto out;
>>>>> +		return;
>>>>>     	blk_mq_quiesce_queue(ctrl->admin_q);
>>>>>     	nvme_start_freeze(ctrl);
>>>>>     	nvme_stop_queues(ctrl);
>>>>> @@ -1918,8 +1915,6 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>>>>     	if (remove)
>>>>>     		nvme_start_queues(ctrl);
>>>>>     	nvme_tcp_destroy_io_queues(ctrl, remove);
>>>>> -out:
>>>>> -	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
>>>>>     }
>>>>>     static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
>>>>> @@ -2030,11 +2025,11 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
>>>>>     	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>>>>>     	nvme_stop_keep_alive(ctrl);
>>>>> +
>>>>> +	mutex_lock(&tcp_ctrl->teardown_lock);
>>>>>     	nvme_tcp_teardown_io_queues(ctrl, false);
>>>>> -	/* unquiesce to fail fast pending requests */
>>>>> -	nvme_start_queues(ctrl);
>>>>>     	nvme_tcp_teardown_admin_queue(ctrl, false);
>>>>> -	blk_mq_unquiesce_queue(ctrl->admin_q);
>>>> Delete blk_mq_unquiesce_queue will cause a bug which may cause reconnect failed.
>>>> Delete nvme_start_queues may cause another bug.
>>>
>>> nvme_tcp_setup_ctrl() will re-start io and admin queue, and only .connect_q
>>> and .fabrics_q are required during reconnect.I check the code. Unquiesce the admin queue in nvme_tcp_configure_admin_queue, so reconnect can work well.
>>>
>>> So can you explain in detail about the bug?
>> First if reconnect failed, quiesce the io queue and admin queue will cause IO pause long time.
> 
> Any normal IO can't make progress until reconnect is successful, so this
> change won't increase IO pause. This way is exactly what NVMe PCI takes,
> see nvme_start_queues() called from nvme_reset_work().
now is ok. Now the patch which fix the long pause time is discussing.
> 
>> Second if reconnect failed more than max_reconnects, delete ctrl will hang.
> 
> No, delete ctrl won't hang, because 'shutdown' parameter is true in case
> of deleting ctrl, which will unquiesce both admin_q and io queues in
> nvme_tcp_teardown_io_queues() and nvme_tcp_teardown_admin_queue().
No, now nvme_remove_namespaces is before tear down queues.
tear down queues is in ctrl->ops->delete_ctrl.
static void nvme_do_delete_ctrl(struct nvme_ctrl *ctrl)
{
	dev_info(ctrl->device,
		 "Removing ctrl: NQN \"%s\"\n", ctrl->opts->subsysnqn);

	flush_work(&ctrl->reset_work);
	nvme_stop_ctrl(ctrl);
	nvme_remove_namespaces(ctrl);
	ctrl->ops->delete_ctrl(ctrl);
	nvme_uninit_ctrl(ctrl);
}
> 
> 
> Thanks,
> Ming
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> .
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-10-15 10:32 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-08 21:37 [PATCH] block: re-introduce blk_mq_complete_request_sync Sagi Grimberg
2020-10-09  4:39 ` Ming Lei
2020-10-09  5:03   ` Yi Zhang
2020-10-09  8:09     ` Sagi Grimberg
2020-10-09 13:55       ` Yi Zhang
2020-10-09 18:29         ` Sagi Grimberg
2020-10-10  6:08           ` Yi Zhang
2020-10-12  3:59             ` Chao Leng
2020-10-12  8:13               ` Ming Lei
2020-10-12  9:06                 ` Chao Leng
2020-10-13 22:36                   ` Sagi Grimberg
2020-10-14  1:08                     ` Ming Lei
2020-10-14  1:37                       ` Chao Leng
2020-10-14  2:02                         ` Ming Lei
2020-10-14  2:32                           ` Chao Leng
2020-10-14  2:41                           ` Chao Leng
2020-10-14  3:34                       ` Ming Lei
2020-10-14  9:39                         ` Chao Leng
2020-10-14  9:56                           ` Ming Lei
2020-10-15  6:05                             ` Chao Leng
2020-10-15  7:50                               ` Ming Lei
2020-10-15 10:05                                 ` Chao Leng
2020-10-14  1:32                     ` Chao Leng
2020-10-13 22:31                 ` Sagi Grimberg
2020-10-14  1:25                   ` Chao Leng
2020-10-09  8:11   ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).