* nvme_tcp: nvme connect failed after execute stress-ng: unshare
[not found] <1594444669.711245.1599710442490.JavaMail.zimbra@redhat.com>
@ 2020-09-10 4:35 ` Yi Zhang
2020-09-14 23:50 ` Sagi Grimberg
0 siblings, 1 reply; 4+ messages in thread
From: Yi Zhang @ 2020-09-10 4:35 UTC (permalink / raw)
To: sagi, gscrivan, colin.king; +Cc: linux-nvme
Hello
Recently I found nvme-tcp connecting always failed[1] after execute stress-ng:unshare[2], by bisecting I finally found it was introduced with commit[3], the connecting works well after revert it.
I'm not sure whether it's one test case issue or kernel issue, could anyone help check it.
[1]
# sh test.sh
+ ./stress-ng/stress-ng --unshare 0 --timeout 5 --log-file unshare.log
stress-ng: info: [355534] dispatching hogs: 32 unshare
stress-ng: info: [355534] successful run completed in 5.04s
+ modprobe null-blk nr-devices=1
+ modprobe nvmet-tcp
+ modprobe nvme-tcp
+ nvmetcli restore tcp.json
+ nvme connect -t tcp -n nqn.2014-08.org.nvmexpress.discovery -a 127.0.0.1 -s 4420
Failed to write to /dev/nvme-fabrics: Input/output error
# dmesg | tail -9
[ 700.012299] null_blk: module loaded
[ 700.073415] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 700.073923] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 715.291020] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[ 715.297031] nvmet: ctrl 1 fatal error occurred!
[ 749.939898] nvmet: creating controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:e405e6bb-8e28-4a73-b338-3fddb5746b8c.
[ 763.417376] nvme nvme0: queue 0: timeout request 0x0 type 4
[ 763.422979] nvme nvme0: Connect command failed, error wo/DNR bit: 881
[ 763.429419] nvme nvme0: failed to connect queue: 0 ret=881
# uname -r
5.9.0-rc4
[2] stress-ng: unshare case
https://github.com/ColinIanKing/stress-ng.git
https://github.com/ColinIanKing/stress-ng/blob/master/stress-unshare.c
[3]
commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
Author: Giuseppe Scrivano <gscrivan@redhat.com>
Date: Sun Jun 7 21:40:10 2020 -0700
ipc/namespace.c: use a work queue to free_ipc
the reason is to avoid a delay caused by the synchronize_rcu() call in
kern_umount() when the mqueue mount is freed.
[4]
# cat tcp.json
{
"hosts": [],
"ports": [
{
"addr": {
"adrfam": "ipv4",
"traddr": "127.0.0.1",
"treq": "not specified",
"trsvcid": "4420",
"trtype": "tcp"
},
"portid": 0,
"referrals": [],
"subsystems": [
"blktests-subsystem-1"
]
}
],
"subsystems": [
{
"allowed_hosts": [],
"attr": {
"allow_any_host": "1",
"cntlid_max": "65519",
"cntlid_min": "1",
"model": "Linux",
"pi_enable": "0",
"serial": "7d833f5501f6b240",
"version": "1.3"
},
"namespaces": [
{
"device": {
"nguid": "00000000-0000-0000-0000-000000000000",
"path": "/dev/nullb0",
"uuid": "b07c7eef-8428-47bf-8e79-26ec8c30f334"
},
"enable": 1,
"nsid": 1
}
],
"nqn": "blktests-subsystem-1"
}
]
}
Best Regards,
Yi Zhang
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: nvme_tcp: nvme connect failed after execute stress-ng: unshare
2020-09-10 4:35 ` nvme_tcp: nvme connect failed after execute stress-ng: unshare Yi Zhang
@ 2020-09-14 23:50 ` Sagi Grimberg
2020-09-15 1:40 ` Yi Zhang
0 siblings, 1 reply; 4+ messages in thread
From: Sagi Grimberg @ 2020-09-14 23:50 UTC (permalink / raw)
To: Yi Zhang, gscrivan, colin.king; +Cc: linux-nvme
> Hello
>
> Recently I found nvme-tcp connecting always failed[1] after execute stress-ng:unshare[2], by bisecting I finally found it was introduced with commit[3], the connecting works well after revert it.
> I'm not sure whether it's one test case issue or kernel issue, could anyone help check it.
Is this failure persistent or transient?
>
> [1]
> # sh test.sh
> + ./stress-ng/stress-ng --unshare 0 --timeout 5 --log-file unshare.log
> stress-ng: info: [355534] dispatching hogs: 32 unshare
> stress-ng: info: [355534] successful run completed in 5.04s
> + modprobe null-blk nr-devices=1
> + modprobe nvmet-tcp
> + modprobe nvme-tcp
> + nvmetcli restore tcp.json
> + nvme connect -t tcp -n nqn.2014-08.org.nvmexpress.discovery -a 127.0.0.1 -s 4420
> Failed to write to /dev/nvme-fabrics: Input/output error
>
> # dmesg | tail -9
> [ 700.012299] null_blk: module loaded
> [ 700.073415] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [ 700.073923] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
> [ 715.291020] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> [ 715.297031] nvmet: ctrl 1 fatal error occurred!
> [ 749.939898] nvmet: creating controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:e405e6bb-8e28-4a73-b338-3fddb5746b8c.
> [ 763.417376] nvme nvme0: queue 0: timeout request 0x0 type 4
> [ 763.422979] nvme nvme0: Connect command failed, error wo/DNR bit: 881
> [ 763.429419] nvme nvme0: failed to connect queue: 0 ret=881
>
> # uname -r
> 5.9.0-rc4
>
>
> [2] stress-ng: unshare case
> https://github.com/ColinIanKing/stress-ng.git
> https://github.com/ColinIanKing/stress-ng/blob/master/stress-unshare.c
>
>
> [3]
> commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
> Author: Giuseppe Scrivano <gscrivan@redhat.com>
> Date: Sun Jun 7 21:40:10 2020 -0700
>
> ipc/namespace.c: use a work queue to free_ipc
>
> the reason is to avoid a delay caused by the synchronize_rcu() call in
> kern_umount() when the mqueue mount is freed.
>
>
> [4]
> # cat tcp.json
> {
> "hosts": [],
> "ports": [
> {
> "addr": {
> "adrfam": "ipv4",
> "traddr": "127.0.0.1",
> "treq": "not specified",
> "trsvcid": "4420",
> "trtype": "tcp"
> },
> "portid": 0,
> "referrals": [],
> "subsystems": [
> "blktests-subsystem-1"
> ]
> }
> ],
> "subsystems": [
> {
> "allowed_hosts": [],
> "attr": {
> "allow_any_host": "1",
> "cntlid_max": "65519",
> "cntlid_min": "1",
> "model": "Linux",
> "pi_enable": "0",
> "serial": "7d833f5501f6b240",
> "version": "1.3"
> },
> "namespaces": [
> {
> "device": {
> "nguid": "00000000-0000-0000-0000-000000000000",
> "path": "/dev/nullb0",
> "uuid": "b07c7eef-8428-47bf-8e79-26ec8c30f334"
> },
> "enable": 1,
> "nsid": 1
> }
> ],
> "nqn": "blktests-subsystem-1"
> }
> ]
> }
>
>
> Best Regards,
> Yi Zhang
>
>
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: nvme_tcp: nvme connect failed after execute stress-ng: unshare
2020-09-14 23:50 ` Sagi Grimberg
@ 2020-09-15 1:40 ` Yi Zhang
2020-10-28 12:19 ` Yi Zhang
0 siblings, 1 reply; 4+ messages in thread
From: Yi Zhang @ 2020-09-15 1:40 UTC (permalink / raw)
To: Sagi Grimberg, gscrivan, colin.king; +Cc: linux-nvme
Hi Sagi
On 9/15/20 7:50 AM, Sagi Grimberg wrote:
>
>> Hello
>>
>> Recently I found nvme-tcp connecting always failed[1] after execute
>> stress-ng:unshare[2], by bisecting I finally found it was introduced
>> with commit[3], the connecting works well after revert it.
>> I'm not sure whether it's one test case issue or kernel issue, could
>> anyone help check it.
>
> Is this failure persistent or transient?
>
It's persistent, and most of recent CKI job with 5.8 stable kernel also
showed this failure.
>>
>> [1]
>> # sh test.sh
>> + ./stress-ng/stress-ng --unshare 0 --timeout 5 --log-file unshare.log
>> stress-ng: info: [355534] dispatching hogs: 32 unshare
>> stress-ng: info: [355534] successful run completed in 5.04s
>> + modprobe null-blk nr-devices=1
>> + modprobe nvmet-tcp
>> + modprobe nvme-tcp
>> + nvmetcli restore tcp.json
>> + nvme connect -t tcp -n nqn.2014-08.org.nvmexpress.discovery -a
>> 127.0.0.1 -s 4420
>> Failed to write to /dev/nvme-fabrics: Input/output error
>>
>> # dmesg | tail -9
>> [ 700.012299] null_blk: module loaded
>> [ 700.073415] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>> [ 700.073923] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>> [ 715.291020] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>> [ 715.297031] nvmet: ctrl 1 fatal error occurred!
>> [ 749.939898] nvmet: creating controller 1 for subsystem
>> nqn.2014-08.org.nvmexpress.discovery for NQN
>> nqn.2014-08.org.nvmexpress:uuid:e405e6bb-8e28-4a73-b338-3fddb5746b8c.
>> [ 763.417376] nvme nvme0: queue 0: timeout request 0x0 type 4
>> [ 763.422979] nvme nvme0: Connect command failed, error wo/DNR bit: 881
>> [ 763.429419] nvme nvme0: failed to connect queue: 0 ret=881
>>
>> # uname -r
>> 5.9.0-rc4
>>
>>
>> [2] stress-ng: unshare case
>> https://github.com/ColinIanKing/stress-ng.git
>> https://github.com/ColinIanKing/stress-ng/blob/master/stress-unshare.c
>>
>>
>> [3]
>> commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
>> Author: Giuseppe Scrivano <gscrivan@redhat.com>
>> Date: Sun Jun 7 21:40:10 2020 -0700
>>
>> ipc/namespace.c: use a work queue to free_ipc
>> the reason is to avoid a delay caused by the
>> synchronize_rcu() call in
>> kern_umount() when the mqueue mount is freed.
>>
>>
>> [4]
>> # cat tcp.json
>> {
>> "hosts": [],
>> "ports": [
>> {
>> "addr": {
>> "adrfam": "ipv4",
>> "traddr": "127.0.0.1",
>> "treq": "not specified",
>> "trsvcid": "4420",
>> "trtype": "tcp"
>> },
>> "portid": 0,
>> "referrals": [],
>> "subsystems": [
>> "blktests-subsystem-1"
>> ]
>> }
>> ],
>> "subsystems": [
>> {
>> "allowed_hosts": [],
>> "attr": {
>> "allow_any_host": "1",
>> "cntlid_max": "65519",
>> "cntlid_min": "1",
>> "model": "Linux",
>> "pi_enable": "0",
>> "serial": "7d833f5501f6b240",
>> "version": "1.3"
>> },
>> "namespaces": [
>> {
>> "device": {
>> "nguid": "00000000-0000-0000-0000-000000000000",
>> "path": "/dev/nullb0",
>> "uuid": "b07c7eef-8428-47bf-8e79-26ec8c30f334"
>> },
>> "enable": 1,
>> "nsid": 1
>> }
>> ],
>> "nqn": "blktests-subsystem-1"
>> }
>> ]
>> }
>>
>>
>> Best Regards,
>> Yi Zhang
>>
>>
>
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: nvme_tcp: nvme connect failed after execute stress-ng: unshare
2020-09-15 1:40 ` Yi Zhang
@ 2020-10-28 12:19 ` Yi Zhang
0 siblings, 0 replies; 4+ messages in thread
From: Yi Zhang @ 2020-10-28 12:19 UTC (permalink / raw)
To: Sagi Grimberg, gscrivan; +Cc: colin.king, linux-nvme
On 9/15/20 9:40 AM, Yi Zhang wrote:
> Hi Sagi
>
> On 9/15/20 7:50 AM, Sagi Grimberg wrote:
>>
>>> Hello
>>>
>>> Recently I found nvme-tcp connecting always failed[1] after execute
>>> stress-ng:unshare[2], by bisecting I finally found it was introduced
>>> with commit[3], the connecting works well after revert it.
>>> I'm not sure whether it's one test case issue or kernel issue, could
>>> anyone help check it.
>>
>> Is this failure persistent or transient?
>>
> It's persistent, and most of recent CKI job with 5.8 stable kernel
> also showed this failure.
>
Found "WARNING: possible circular locking" with latest 5.10.0-rc1
[ 230.353213] run blktests nvme/003 at 2020-10-28 07:41:08
[ 230.666092] loop: module loaded
[ 230.870124] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 230.902252] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 246.097201] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[ 246.104886] nvmet: ctrl 1 fatal error occurred!
[ 246.113796] ======================================================
[ 246.120042] WARNING: possible circular locking dependency detected
[ 246.126288] 5.10.0-rc1 #1 Not tainted
[ 246.129987] ------------------------------------------------------
[ 246.136231] kworker/0:2/251 is trying to acquire lock:
[ 246.141423] ffff00012ec0b830
((work_completion)(&queue->io_work)){+.+.}-{0:0}, at:
__flush_work+0x48/0x4f0
[ 246.151202]
but task is already holding lock:
[ 246.157095] ffff8000147e3dd8
((work_completion)(&queue->release_work)){+.+.}-{0:0}, at:
process_one_work+0x1bc/0x670
[ 246.167743]
which lock already depends on the new lock.
[ 246.176007]
the existing dependency chain (in reverse order) is:
[ 246.183569]
-> #2 ((work_completion)(&queue->release_work)){+.+.}-{0:0}:
[ 246.191849] validate_chain+0x5f0/0xa00
[ 246.196251] __lock_acquire+0x3b0/0xb70
[ 246.200652] lock_acquire+0x104/0x428
[ 246.204878] process_one_work+0x21c/0x670
[ 246.209456] worker_thread+0x54/0x408
[ 246.213684] kthread+0x11c/0x148
[ 246.217473] ret_from_fork+0x10/0x18
[ 246.221611]
-> #1 ((wq_completion)events){+.+.}-{0:0}:
[ 246.228602] validate_chain+0x5f0/0xa00
[ 246.233005] __lock_acquire+0x3b0/0xb70
[ 246.237407] lock_acquire+0x104/0x428
[ 246.241634] flush_workqueue+0x8c/0x4a8
[ 246.246041] nvmet_tcp_install_queue+0x100/0x110 [nvmet_tcp]
[ 246.252309] nvmet_install_queue+0xc4/0x150 [nvmet]
[ 246.257775] nvmet_execute_admin_connect+0xf4/0x1e0 [nvmet]
[ 246.263935] nvmet_tcp_io_work+0x8dc/0x970 [nvmet_tcp]
[ 246.269654] process_one_work+0x268/0x670
[ 246.274231] worker_thread+0x54/0x408
[ 246.278458] kthread+0x11c/0x148
[ 246.282246] ret_from_fork+0x10/0x18
[ 246.288493]
-> #0 ((work_completion)(&queue->io_work)){+.+.}-{0:0}:
[ 246.300559] check_noncircular+0x100/0x128
[ 246.307367] check_prev_add+0xa4/0x668
[ 246.313777] validate_chain+0x5f0/0xa00
[ 246.320230] __lock_acquire+0x3b0/0xb70
[ 246.326652] lock_acquire+0x104/0x428
[ 246.332879] __flush_work+0x70/0x4f0
[ 246.338994] flush_work+0x1c/0x28
[ 246.344833] nvmet_tcp_release_queue_work+0xd0/0x230 [nvmet_tcp]
[ 246.353405] process_one_work+0x268/0x670
[ 246.359955] worker_thread+0x54/0x408
[ 246.366131] kthread+0x11c/0x148
[ 246.371843] ret_from_fork+0x10/0x18
[ 246.377876]
other info that might help us debug this:
[ 246.391416] Chain exists of:
(work_completion)(&queue->io_work) -->
(wq_completion)events --> (work_completion)(&queue->release_work)
[ 246.411895] Possible unsafe locking scenario:
[ 246.421363] CPU0 CPU1
[ 246.427687] ---- ----
[ 246.433978] lock((work_completion)(&queue->release_work));
[ 246.441413] lock((wq_completion)events);
[ 246.449816] lock((work_completion)(&queue->release_work));
[ 246.459780] lock((work_completion)(&queue->io_work));
[ 246.466808]
*** DEADLOCK ***
[ 246.478008] 2 locks held by kworker/0:2/251:
[ 246.484053] #0: ffff000100051b38
((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x670
[ 246.495355] #1: ffff8000147e3dd8
((work_completion)(&queue->release_work)){+.+.}-{0:0}, at:
process_one_work+0x1bc/0x670
[ 246.508301]
stack backtrace:
[ 246.516512] CPU: 0 PID: 251 Comm: kworker/0:2 Not tainted 5.10.0-rc1 #1
[ 246.525121] Hardware name: GIGABYTE R120-T32-00/MT30-GS1-00, BIOS F02
08/06/2019
[ 246.534560] Workqueue: events nvmet_tcp_release_queue_work [nvmet_tcp]
[ 246.543156] Call trace:
[ 246.547634] dump_backtrace+0x0/0x1c8
[ 246.553325] show_stack+0x20/0x70
[ 246.558653] dump_stack+0xf8/0x164
[ 246.564043] print_circular_bug.isra.0+0x234/0x290
[ 246.570834] check_noncircular+0x100/0x128
[ 246.576931] check_prev_add+0xa4/0x668
[ 246.582676] validate_chain+0x5f0/0xa00
[ 246.588512] __lock_acquire+0x3b0/0xb70
[ 246.594346] lock_acquire+0x104/0x428
[ 246.599986] __flush_work+0x70/0x4f0
[ 246.605506] flush_work+0x1c/0x28
[ 246.610765] nvmet_tcp_release_queue_work+0xd0/0x230 [nvmet_tcp]
[ 246.618768] process_one_work+0x268/0x670
[ 246.624777] worker_thread+0x54/0x408
[ 246.630438] kthread+0x11c/0x148
[ 246.635661] ret_from_fork+0x10/0x18
[ 295.375201] nvme nvme0: queue 0: timeout request 0x0 type 4
[ 295.383074] nvme nvme0: Connect command failed, error wo/DNR bit: 881
[ 295.391912] nvme nvme0: failed to connect queue: 0 ret=881
[ 479.792299] nvmet: creating controller 1 for subsystem
nqn.2014-08.org.nvmexpress.discovery for NQN
nqn.2014-08.org.nvmexpress:uuid:0e657e35bc5f402480543a94c7a292ed.
>>>
>>> [1]
>>> # sh test.sh
>>> + ./stress-ng/stress-ng --unshare 0 --timeout 5 --log-file unshare.log
>>> stress-ng: info: [355534] dispatching hogs: 32 unshare
>>> stress-ng: info: [355534] successful run completed in 5.04s
>>> + modprobe null-blk nr-devices=1
>>> + modprobe nvmet-tcp
>>> + modprobe nvme-tcp
>>> + nvmetcli restore tcp.json
>>> + nvme connect -t tcp -n nqn.2014-08.org.nvmexpress.discovery -a
>>> 127.0.0.1 -s 4420
>>> Failed to write to /dev/nvme-fabrics: Input/output error
>>>
>>> # dmesg | tail -9
>>> [ 700.012299] null_blk: module loaded
>>> [ 700.073415] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [ 700.073923] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>>> [ 715.291020] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>>> [ 715.297031] nvmet: ctrl 1 fatal error occurred!
>>> [ 749.939898] nvmet: creating controller 1 for subsystem
>>> nqn.2014-08.org.nvmexpress.discovery for NQN
>>> nqn.2014-08.org.nvmexpress:uuid:e405e6bb-8e28-4a73-b338-3fddb5746b8c.
>>> [ 763.417376] nvme nvme0: queue 0: timeout request 0x0 type 4
>>> [ 763.422979] nvme nvme0: Connect command failed, error wo/DNR bit:
>>> 881
>>> [ 763.429419] nvme nvme0: failed to connect queue: 0 ret=881
>>>
>>> # uname -r
>>> 5.9.0-rc4
>>>
>>>
>>> [2] stress-ng: unshare case
>>> https://github.com/ColinIanKing/stress-ng.git
>>> https://github.com/ColinIanKing/stress-ng/blob/master/stress-unshare.c
>>>
>>>
>>> [3]
>>> commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
>>> Author: Giuseppe Scrivano <gscrivan@redhat.com>
>>> Date: Sun Jun 7 21:40:10 2020 -0700
>>>
>>> ipc/namespace.c: use a work queue to free_ipc
>>> the reason is to avoid a delay caused by the
>>> synchronize_rcu() call in
>>> kern_umount() when the mqueue mount is freed.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-10-28 12:19 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <1594444669.711245.1599710442490.JavaMail.zimbra@redhat.com>
2020-09-10 4:35 ` nvme_tcp: nvme connect failed after execute stress-ng: unshare Yi Zhang
2020-09-14 23:50 ` Sagi Grimberg
2020-09-15 1:40 ` Yi Zhang
2020-10-28 12:19 ` Yi Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).