Re: NVMe Over Fabrics Disconnect Kernel error

From: Max Gurtovoy <maxg@mellanox.com>
To: Anton Brekhov <anton.brekhov@rsc-tech.ru>
Cc: Sagi Grimberg <sagi@grimberg.me>,
	linux-nvme@lists.infradead.org,
	Konstantin Ponomarev <k.ponomarev@rsc-tech.ru>
Subject: Re: NVMe Over Fabrics Disconnect Kernel error
Date: Sun, 29 Mar 2020 14:56:10 +0300	[thread overview]
Message-ID: <9024d7bc-d55d-06c1-65b3-61027f81fda6@mellanox.com> (raw)
In-Reply-To: <CABY-YC4jSOZJW2zEx5dS9BRj8+ipNF5aF_0cgkuDo9oaLbhvew@mail.gmail.com>

On 3/29/2020 2:38 PM, Anton Brekhov wrote:
> Max,
> This error we've obtained while using the latest release of nvme-cli:
> [root@s02p005 ~]# nvme version
> nvme version 1.10.1
>
> Or there were some major changes after latest release?

I referred to the kernel version.

Can you check your scenario with git://git.infradead.org/nvme.git 
(branch nvme-5.7 or nvme-5.7-rc1).

-Max.

> Thanks.
>
> вс, 29 мар. 2020 г. в 11:51, Max Gurtovoy <maxg@mellanox.com>:
>>
>> On 3/29/2020 7:14 AM, Sagi Grimberg wrote:
>>>> Greetings!
>>>>
>>>> We're using nvme-cli technology with ZFS and Lustre Filesystem on top
>>>> of it.
>>>> But we constantly come across a kernel error while disconnecting
>>>> remote disks from switched off nodes:
>>>> ```
>>>> [  +0,000089] INFO: task kworker/u593:0:82293 blocked for more than
>>>> 120 seconds.
>>>> [  +0,001959] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [  +0,001941] kworker/u593:0  D ffff90e8493fe2a0     0 82293      2
>>>> 0x00000080
>>>> [  +0,000031] Workqueue: nvme-delete-wq nvme_delete_ctrl_work
>>>> [nvme_core]
>>>> [  +0,000003] Call Trace:
>>>> [  +0,000008]  [<ffffffff8177f229>] schedule+0x29/0x70
>>>> [  +0,000010]  [<ffffffff81358e85>] blk_mq_freeze_queue_wait+0x75/0xe0
>>>> [  +0,000007]  [<ffffffff810c61c0>] ? wake_up_atomic_t+0x30/0x30
>>>> [  +0,000006]  [<ffffffff81359cb4>] blk_freeze_queue+0x24/0x50
>>>> [  +0,000009]  [<ffffffff8134e0ef>] blk_cleanup_queue+0x7f/0x1b0
>>>> [  +0,000012]  [<ffffffffc031158e>] nvme_ns_remove+0x8e/0xb0 [nvme_core]
>>>> [  +0,000011]  [<ffffffffc031174b>] nvme_remove_namespaces+0xab/0xf0
>>>> [nvme_core]
>>>> [  +0,000012]  [<ffffffffc03117e2>] nvme_delete_ctrl_work+0x52/0x80
>>>> [nvme_core]
>>>> [  +0,000008]  [<ffffffff810bd0ff>] process_one_work+0x17f/0x440
>>>> [  +0,000006]  [<ffffffff810be368>] worker_thread+0x278/0x3c0
>>>> [  +0,000006]  [<ffffffff810be0f0>] ? manage_workers.isra.26+0x2a0/0x2a0
>>>> [  +0,000005]  [<ffffffff810c50d1>] kthread+0xd1/0xe0
>>>> [  +0,000006]  [<ffffffff810c5000>] ? insert_kthread_work+0x40/0x40
>>>> [  +0,000006]  [<ffffffff8178cd1d>] ret_from_fork_nospec_begin+0x7/0x21
>>>> [  +0,000006]  [<ffffffff810c5000>] ? insert_kthread_work+0x40/0x40
>>>> ```
>>>> Nodes characteristics:
>>>> [root@s02p005 ~]# uname -srm
>>>> Linux 3.10.0-1062.1.1.el7.x86_64 x86_64
>>>> [root@s02p005 ~]# cat /etc/redhat-release
>>>> CentOS Linux release 7.7.1908 (Core)
>>>>
>>>> Where're using nvmet_rdma.
>>>> Is there any workaround for such error?
>>> It seems like queue freeze is stuck. Can you share more of the
>>> trace so we can see what else is blocking? If not, when
>>> it reproduces run echo t > /proc/sysrq-trigger and share the
>>> log.
>> Anton,
>>
>> Can you repro this with latest nvme branch ? or only inbox Centos7.7 ?
>>
>>
>>> Thanks.
>>>
>>> _______________________________________________
>>> linux-nvme mailing list
>>> linux-nvme@lists.infradead.org
>>> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-nvme&amp;data=02%7C01%7Cmaxg%40mellanox.com%7C14471b0f1bab4be2a68108d7d3d5c89b%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637210787464775707&amp;sdata=xmwgk5ljFt%2F7%2BsZRQmP6mfwuR0hhjoYsvNrrLUBayqI%3D&amp;reserved=0
>>>

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme