* nvme-tcp crashes the system when overloading the backend device.
@ 2021-08-31 13:30 Mark Ruijter
2021-09-01 12:49 ` Sagi Grimberg
0 siblings, 1 reply; 7+ messages in thread
From: Mark Ruijter @ 2021-08-31 13:30 UTC (permalink / raw)
To: linux-nvme
Hi all,
I can consistently crash a system when I sufficiently overload the nvme-tcp target.
The easiest way to reproduce the problem is by creating a raid5.
While this R5 is resyncing export it with the nvmet-tcp target driver and start a high queue-depth 4K random fio workload from the initiator.
At some point the target system will start logging these messages:
[ 2865.725069] nvmet: ctrl 238 keep-alive timer (15 seconds) expired!
[ 2865.725072] nvmet: ctrl 236 keep-alive timer (15 seconds) expired!
[ 2865.725075] nvmet: ctrl 238 fatal error occurred!
[ 2865.725076] nvmet: ctrl 236 fatal error occurred!
[ 2865.725080] nvmet: ctrl 237 keep-alive timer (15 seconds) expired!
[ 2865.725083] nvmet: ctrl 237 fatal error occurred!
[ 2865.725087] nvmet: ctrl 235 keep-alive timer (15 seconds) expired!
[ 2865.725094] nvmet: ctrl 235 fatal error occurred!
Even when you stop all IO from the initiator some of the nvmet_tcp_wq workers will keep running forever.
The workload shown with "top" never returns to the normal idle level.
root 5669 1.1 0.0 0 0 ? D< 03:39 0:09 [kworker/22:2H+nvmet_tcp_wq]
root 5670 0.8 0.0 0 0 ? D< 03:39 0:06 [kworker/55:2H+nvmet_tcp_wq]
root 5676 0.2 0.0 0 0 ? D< 03:39 0:01 [kworker/29:2H+nvmet_tcp_wq]
root 5677 12.2 0.0 0 0 ? D< 03:39 1:35 [kworker/59:2H+nvmet_tcp_wq]
root 5679 5.7 0.0 0 0 ? D< 03:39 0:44 [kworker/27:2H+nvmet_tcp_wq]
root 5680 2.9 0.0 0 0 ? I< 03:39 0:23 [kworker/57:2H-nvmet_tcp_wq]
root 5681 1.0 0.0 0 0 ? D< 03:39 0:08 [kworker/60:2H+nvmet_tcp_wq]
root 5682 0.5 0.0 0 0 ? D< 03:39 0:04 [kworker/18:2H+nvmet_tcp_wq]
root 5683 5.8 0.0 0 0 ? D< 03:39 0:45 [kworker/54:2H+nvmet_tcp_wq]
The number of running nvmet_tcp_wq will keep increasing once you hit the problem:
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | tail -3
41114 ? D< 0:00 [kworker/25:21H+nvmet_tcp_wq]
41152 ? D< 0:00 [kworker/54:25H+nvmet_tcp_wq]
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvme | grep wq | wc -l
500
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvme | grep wq | wc -l
502
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
503
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
505
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
506
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
511
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
661
Eventually the system runs out of resources.
At some point the system will reach a workload of 2000+ and crash.
So far, I have been unable to determine why the number of nvmet_tcp_wq keeps increasing.
It must be because the current failed worker gets replaced by a new worker without the old being terminated.
Thanks,
Mark Ruijter
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nvme-tcp crashes the system when overloading the backend device.
2021-08-31 13:30 nvme-tcp crashes the system when overloading the backend device Mark Ruijter
@ 2021-09-01 12:49 ` Sagi Grimberg
2021-09-01 14:36 ` Mark Ruijter
0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2021-09-01 12:49 UTC (permalink / raw)
To: Mark Ruijter, linux-nvme
> Hi all,
>
> I can consistently crash a system when I sufficiently overload the nvme-tcp target.
> The easiest way to reproduce the problem is by creating a raid5.
>
> While this R5 is resyncing export it with the nvmet-tcp target driver and start a high queue-depth 4K random fio workload from the initiator.
> At some point the target system will start logging these messages:
> [ 2865.725069] nvmet: ctrl 238 keep-alive timer (15 seconds) expired!
> [ 2865.725072] nvmet: ctrl 236 keep-alive timer (15 seconds) expired!
> [ 2865.725075] nvmet: ctrl 238 fatal error occurred!
> [ 2865.725076] nvmet: ctrl 236 fatal error occurred!
> [ 2865.725080] nvmet: ctrl 237 keep-alive timer (15 seconds) expired!
> [ 2865.725083] nvmet: ctrl 237 fatal error occurred!
> [ 2865.725087] nvmet: ctrl 235 keep-alive timer (15 seconds) expired!
> [ 2865.725094] nvmet: ctrl 235 fatal error occurred!
This is the target not getting a keep-alive commands from the initiator,
hence it triggers a controller removal. btw what kernel version is this?
While the target is tearing down the controller queues, and waits for
all the inflight commands to complete and drop the final reference on
the queue. I'm wandering if something is preventing these inflight
commands from getting completed by the backend R5 device.
>
> Even when you stop all IO from the initiator some of the nvmet_tcp_wq workers will keep running forever.
That is because the target is waiting for the commands to complete, but
they don't.
> Eventually the system runs out of resources.
> At some point the system will reach a workload of 2000+ and crash.
>
> So far, I have been unable to determine why the number of nvmet_tcp_wq keeps increasing.
> It must be because the current failed worker gets replaced by a new worker without the old being terminated.
Is it possible to check if the R5 device has inflight commands? if not
there is some race condition or misaccounting that prevents an orderly
shutdown of the queues.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nvme-tcp crashes the system when overloading the backend device.
2021-09-01 12:49 ` Sagi Grimberg
@ 2021-09-01 14:36 ` Mark Ruijter
2021-09-01 14:47 ` Sagi Grimberg
0 siblings, 1 reply; 7+ messages in thread
From: Mark Ruijter @ 2021-09-01 14:36 UTC (permalink / raw)
To: Sagi Grimberg, linux-nvme
Hi Sagi,
I can reproduce this problem with any recent kernel.
At least all these kernels I tested suffer from the problem: 5.10.40, 5.10.57, 5.14-rc4 as well as SuSE SLES15-SP2 with kernel 5.3.18-24.37-default.
On the initiator I use Ubuntu 20.04 LTS with kernel 5.10.0-1019.
>>Is it possible to check if the R5 device has inflight commands? if not
there is some race condition or misaccounting that prevents an orderly
shutdown of the queues.
I will double check; however, I don't think that the underlying device is the problem.
The exact same test passes with the nvmet-rdma target.
It only fails with the nvmet-tcp target driver.
At far as I can tell I exhaust the budget in nvmet_tcp_io_work and requeue:
1293 } while (pending && ops < NVMET_TCP_IO_WORK_BUDGET);
1294
1295 /*
1296 * Requeue the worker if idle deadline period is in progress or any
1297 * ops activity was recorded during the do-while loop above.
1298 */
1299 if (nvmet_tcp_check_queue_deadline(queue, ops) || pending)
1300 queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
I added pr_info statements in the code to determine what is going on:
2021-09-01T07:15:26.944067-06:00 gold kernel: [ 5502.786914] nvmet_tcp: MARK exhausted budget: ret = 0, ops = 71
2021-09-01T07:15:26.944070-06:00 gold kernel: [ 5502.787455] nvmet: ctrl 49 keep-alive timer (15 seconds) expired!
2021-09-01T07:15:26.944072-06:00 gold kernel: [ 5502.787461] nvmet: ctrl 49 fatal error occurred!
Shortly after the routine nvmet_fatal_error_handler gets triggered:
static void nvmet_fatal_error_handler(struct work_struct *work)
{
struct nvmet_ctrl *ctrl =
container_of(work, struct nvmet_ctrl, fatal_err_work);
pr_err("ctrl %d fatal error occurred!\n", ctrl->cntlid);
ctrl->ops->delete_ctrl(ctrl);
}
Some of nvme_tcp_wq workers now keep running and the number of workers keeps increasing.
root 3686 3.3 0.0 0 0 ? I< 07:31 0:29 [kworker/11:0H-nvmet_tcp_wq]
root 3689 12.0 0.0 0 0 ? I< 07:31 1:43 [kworker/25:0H-nvmet_tcp_wq]
root 3695 12.0 0.0 0 0 ? I< 07:31 1:43 [kworker/55:3H-nvmet_tcp_wq]
root 3699 5.0 0.0 0 0 ? I< 07:31 0:43 [kworker/38:1H-nvmet_tcp_wq]
root 3704 11.5 0.0 0 0 ? I< 07:31 1:39 [kworker/21:0H-nvmet_tcp_wq]
root 3708 12.1 0.0 0 0 ? I< 07:31 1:44 [kworker/31:0H-nvmet_tcp_wq]
"nvmetcli clear" will no longer return after this and when you keep the initiators running the system eventually crashes.
--Mark
On 01/09/2021, 14:49, "Sagi Grimberg" <sagi@grimberg.me> wrote:
> Hi all,
>
> I can consistently crash a system when I sufficiently overload the nvme-tcp target.
> The easiest way to reproduce the problem is by creating a raid5.
>
> While this R5 is resyncing export it with the nvmet-tcp target driver and start a high queue-depth 4K random fio workload from the initiator.
> At some point the target system will start logging these messages:
> [ 2865.725069] nvmet: ctrl 238 keep-alive timer (15 seconds) expired!
> [ 2865.725072] nvmet: ctrl 236 keep-alive timer (15 seconds) expired!
> [ 2865.725075] nvmet: ctrl 238 fatal error occurred!
> [ 2865.725076] nvmet: ctrl 236 fatal error occurred!
> [ 2865.725080] nvmet: ctrl 237 keep-alive timer (15 seconds) expired!
> [ 2865.725083] nvmet: ctrl 237 fatal error occurred!
> [ 2865.725087] nvmet: ctrl 235 keep-alive timer (15 seconds) expired!
> [ 2865.725094] nvmet: ctrl 235 fatal error occurred!
This is the target not getting a keep-alive commands from the initiator,
hence it triggers a controller removal. btw what kernel version is this?
While the target is tearing down the controller queues, and waits for
all the inflight commands to complete and drop the final reference on
the queue. I'm wandering if something is preventing these inflight
commands from getting completed by the backend R5 device.
>
> Even when you stop all IO from the initiator some of the nvmet_tcp_wq workers will keep running forever.
That is because the target is waiting for the commands to complete, but
they don't.
> Eventually the system runs out of resources.
> At some point the system will reach a workload of 2000+ and crash.
>
> So far, I have been unable to determine why the number of nvmet_tcp_wq keeps increasing.
> It must be because the current failed worker gets replaced by a new worker without the old being terminated.
Is it possible to check if the R5 device has inflight commands? if not
there is some race condition or misaccounting that prevents an orderly
shutdown of the queues.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nvme-tcp crashes the system when overloading the backend device.
2021-09-01 14:36 ` Mark Ruijter
@ 2021-09-01 14:47 ` Sagi Grimberg
2021-09-02 11:31 ` Mark Ruijter
[not found] ` <27377057-5001-4D53-B8D7-889972376F29@primelogic.nl>
0 siblings, 2 replies; 7+ messages in thread
From: Sagi Grimberg @ 2021-09-01 14:47 UTC (permalink / raw)
To: Mark Ruijter, linux-nvme
> Hi Sagi,
>
> I can reproduce this problem with any recent kernel.
> At least all these kernels I tested suffer from the problem: 5.10.40, 5.10.57, 5.14-rc4 as well as SuSE SLES15-SP2 with kernel 5.3.18-24.37-default.
> On the initiator I use Ubuntu 20.04 LTS with kernel 5.10.0-1019.
Thanks.
>>> Is it possible to check if the R5 device has inflight commands? if not
> there is some race condition or misaccounting that prevents an orderly
> shutdown of the queues.
>
> I will double check; however, I don't think that the underlying device is the problem.
> The exact same test passes with the nvmet-rdma target.
> It only fails with the nvmet-tcp target driver.
OK, that is useful information.
>
> At far as I can tell I exhaust the budget in nvmet_tcp_io_work and requeue:
>
> 1293 } while (pending && ops < NVMET_TCP_IO_WORK_BUDGET);
> 1294
> 1295 /*
> 1296 * Requeue the worker if idle deadline period is in progress or any
> 1297 * ops activity was recorded during the do-while loop above.
> 1298 */
> 1299 if (nvmet_tcp_check_queue_deadline(queue, ops) || pending)
> 1300 queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
>
> I added pr_info statements in the code to determine what is going on:
> 2021-09-01T07:15:26.944067-06:00 gold kernel: [ 5502.786914] nvmet_tcp: MARK exhausted budget: ret = 0, ops = 71
> 2021-09-01T07:15:26.944070-06:00 gold kernel: [ 5502.787455] nvmet: ctrl 49 keep-alive timer (15 seconds) expired!
> 2021-09-01T07:15:26.944072-06:00 gold kernel: [ 5502.787461] nvmet: ctrl 49 fatal error occurred!
>
> Shortly after the routine nvmet_fatal_error_handler gets triggered:
> static void nvmet_fatal_error_handler(struct work_struct *work)
> {
> struct nvmet_ctrl *ctrl =
> container_of(work, struct nvmet_ctrl, fatal_err_work);
>
> pr_err("ctrl %d fatal error occurred!\n", ctrl->cntlid);
> ctrl->ops->delete_ctrl(ctrl);
> }
>
> Some of nvme_tcp_wq workers now keep running and the number of workers keeps increasing.
> root 3686 3.3 0.0 0 0 ? I< 07:31 0:29 [kworker/11:0H-nvmet_tcp_wq]
> root 3689 12.0 0.0 0 0 ? I< 07:31 1:43 [kworker/25:0H-nvmet_tcp_wq]
> root 3695 12.0 0.0 0 0 ? I< 07:31 1:43 [kworker/55:3H-nvmet_tcp_wq]
> root 3699 5.0 0.0 0 0 ? I< 07:31 0:43 [kworker/38:1H-nvmet_tcp_wq]
> root 3704 11.5 0.0 0 0 ? I< 07:31 1:39 [kworker/21:0H-nvmet_tcp_wq]
> root 3708 12.1 0.0 0 0 ? I< 07:31 1:44 [kworker/31:0H-nvmet_tcp_wq]
>
> "nvmetcli clear" will no longer return after this and when you keep the initiators running the system eventually crashes.
>
OK, so maybe some information can help. When you reproduce this for the
first time I would dump all the threads in the system to dmesg.
So if you can do the following:
1. reproduce the hang
2. nvmetcli clear
3. echo t > /proc/sysrq-trigger
And share the dmesg output with us?
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nvme-tcp crashes the system when overloading the backend device.
2021-09-01 14:47 ` Sagi Grimberg
@ 2021-09-02 11:31 ` Mark Ruijter
[not found] ` <27377057-5001-4D53-B8D7-889972376F29@primelogic.nl>
1 sibling, 0 replies; 7+ messages in thread
From: Mark Ruijter @ 2021-09-02 11:31 UTC (permalink / raw)
To: linux-nvme
[-- Attachment #1: Type: text/plain, Size: 7396 bytes --]
Hi Sagi,
I installed a recent kernel on the system and restarted the test.
The kernel is: 5.10.57
Before the kernel would crash, I stopped the test by disconnecting the initiators.
"nvmetcli clear" did not hang and in this case, it still managed to remove the configuration.
# ls -l /sys/kernel/config/nvmet/ports/
total 0
# ls -l /sys/kernel/config/nvmet/subsystems/
total 0
However, after this I still see nvmet_tcp_wq workers that are actively running:
# ps aux | grep nvmet
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 163 0.4 0.0 0 0 ? I< 02:28 0:21 [kworker/24:0H-nvmet_tcp_wq]
root 199 0.0 0.0 0 0 ? I< 02:28 0:02 [kworker/30:0H-nvmet_tcp_wq]
root 355 0.0 0.0 0 0 ? I< 02:28 0:04 [kworker/56:0H-nvmet_tcp_wq]
root 361 0.0 0.0 0 0 ? I< 02:28 0:03 [kworker/57:0H-nvmet_tcp_wq]
root 683 1.2 0.0 0 0 ? D< 02:30 0:56 [kworker/53:1H+nvmet_tcp_wq]
root 785 1.0 0.0 0 0 ? D< 02:30 0:44 [kworker/59:1H+nvmet_tcp_wq]
root 1200 0.1 0.0 0 0 ? D< 02:30 0:08 [kworker/27:1H+nvmet_tcp_wq]
root 29212 2.0 0.0 0 0 ? I< 03:28 0:18 [kworker/31:2H-nvmet_tcp_wq]
root 32691 0.0 0.0 0 0 ? I< 02:31 0:00 [nvmet-buffered-]
root 39437 5.3 0.0 0 0 ? D< 03:32 0:35 [kworker/51:3H+nvmet_tcp_wq]
root 39440 1.8 0.0 0 0 ? I< 03:32 0:12 [kworker/59:3H-nvmet_tcp_wq]
root 39458 13.3 0.0 0 0 ? I< 03:32 1:28 [kworker/18:3H-nvmet_tcp_wq]
root 39508 7.1 0.0 0 0 ? D< 03:32 0:47 [kworker/53:4H+nvmet_tcp_wq]
root 39511 2.7 0.0 0 0 ? D< 03:32 0:17 [kworker/28:5H+nvmet_tcp_wq]
root 39520 7.7 0.0 0 0 ? D< 03:32 0:51 [kworker/52:3H+nvmet_tcp_wq]
root 39855 4.3 0.0 0 0 ? I< 03:32 0:28 [kworker/48:4H-nvmet_tcp_wq]
root 39857 3.0 0.0 0 0 ? D< 03:32 0:20 [kworker/28:7H+nvmet_tcp_wq]
root 39902 6.3 0.0 0 0 ? D< 03:32 0:41 [kworker/27:6H+nvmet_tcp_wq]
root 39928 5.3 0.0 0 0 ? D< 03:32 0:35 [kworker/25:9H+nvmet_tcp_wq]
root 39963 8.8 0.0 0 0 ? D< 03:32 0:57 [kworker/24:6H+nvmet_tcp_wq]
root 40024 3.3 0.0 0 0 ? I< 03:32 0:21 [kworker/28:9H-nvmet_tcp_wq]
root 40087 6.3 0.0 0 0 ? I< 03:32 0:41 [kworker/53:6H-nvmet_tcp_wq]
root 40169 6.1 0.0 0 0 ? D< 03:32 0:40 [kworker/59:5H+nvmet_tcp_wq]
root 40201 3.5 0.0 0 0 ? D< 03:32 0:23 [kworker/54:8H+nvmet_tcp_wq]
root 40333 0.6 0.0 0 0 ? D< 03:32 0:04 [kworker/59:7H+nvmet_tcp_wq]
root 40371 0.4 0.0 0 0 ? I< 03:32 0:03 [kworker/49:5H-nvmet_tcp_wq]
root 40375 2.5 0.0 0 0 ? I< 03:32 0:16 [kworker/20:8H-nvmet_tcp_wq]
root 40517 0.4 0.0 0 0 ? I< 03:32 0:02 [kworker/58:6H-nvmet_tcp_wq]
root 40811 2.8 0.0 0 0 ? D< 03:33 0:17 [kworker/51:9H+nvmet_tcp_wq]
root 40864 1.5 0.0 0 0 ? I< 03:33 0:09 [kworker/29:5H-nvmet_tcp_wq]
root 40891 1.7 0.0 0 0 ? I< 03:33 0:10 [kworker/17:9H-nvmet_tcp_wq]
root 40902 4.3 0.0 0 0 ? D< 03:33 0:25 [kworker/59:8H+nvmet_tcp_wq]
root 41061 3.3 0.0 0 0 ? I< 03:34 0:18 [kworker/51:10H-nvmet_tcp_wq]
root 41145 2.6 0.0 0 0 ? D< 03:34 0:14 [kworker/56:7H+nvmet_tcp_wq]
root 41278 1.3 0.0 0 0 ? I< 03:34 0:07 [kworker/22:9H-nvmet_tcp_wq]
I've attached dmesg.txt as requested.
Thanks,
On 01/09/2021, 16:47, "Sagi Grimberg" <sagi@grimberg.me> wrote:
> Hi Sagi,
>
> I can reproduce this problem with any recent kernel.
> At least all these kernels I tested suffer from the problem: 5.10.40, 5.10.57, 5.14-rc4 as well as SuSE SLES15-SP2 with kernel 5.3.18-24.37-default.
> On the initiator I use Ubuntu 20.04 LTS with kernel 5.10.0-1019.
Thanks.
>>> Is it possible to check if the R5 device has inflight commands? if not
> there is some race condition or misaccounting that prevents an orderly
> shutdown of the queues.
>
> I will double check; however, I don't think that the underlying device is the problem.
> The exact same test passes with the nvmet-rdma target.
> It only fails with the nvmet-tcp target driver.
OK, that is useful information.
>
> At far as I can tell I exhaust the budget in nvmet_tcp_io_work and requeue:
>
> 1293 } while (pending && ops < NVMET_TCP_IO_WORK_BUDGET);
> 1294
> 1295 /*
> 1296 * Requeue the worker if idle deadline period is in progress or any
> 1297 * ops activity was recorded during the do-while loop above.
> 1298 */
> 1299 if (nvmet_tcp_check_queue_deadline(queue, ops) || pending)
> 1300 queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
>
> I added pr_info statements in the code to determine what is going on:
> 2021-09-01T07:15:26.944067-06:00 gold kernel: [ 5502.786914] nvmet_tcp: MARK exhausted budget: ret = 0, ops = 71
> 2021-09-01T07:15:26.944070-06:00 gold kernel: [ 5502.787455] nvmet: ctrl 49 keep-alive timer (15 seconds) expired!
> 2021-09-01T07:15:26.944072-06:00 gold kernel: [ 5502.787461] nvmet: ctrl 49 fatal error occurred!
>
> Shortly after the routine nvmet_fatal_error_handler gets triggered:
> static void nvmet_fatal_error_handler(struct work_struct *work)
> {
> struct nvmet_ctrl *ctrl =
> container_of(work, struct nvmet_ctrl, fatal_err_work);
>
> pr_err("ctrl %d fatal error occurred!\n", ctrl->cntlid);
> ctrl->ops->delete_ctrl(ctrl);
> }
>
> Some of nvme_tcp_wq workers now keep running and the number of workers keeps increasing.
> root 3686 3.3 0.0 0 0 ? I< 07:31 0:29 [kworker/11:0H-nvmet_tcp_wq]
> root 3689 12.0 0.0 0 0 ? I< 07:31 1:43 [kworker/25:0H-nvmet_tcp_wq]
> root 3695 12.0 0.0 0 0 ? I< 07:31 1:43 [kworker/55:3H-nvmet_tcp_wq]
> root 3699 5.0 0.0 0 0 ? I< 07:31 0:43 [kworker/38:1H-nvmet_tcp_wq]
> root 3704 11.5 0.0 0 0 ? I< 07:31 1:39 [kworker/21:0H-nvmet_tcp_wq]
> root 3708 12.1 0.0 0 0 ? I< 07:31 1:44 [kworker/31:0H-nvmet_tcp_wq]
>
> "nvmetcli clear" will no longer return after this and when you keep the initiators running the system eventually crashes.
>
OK, so maybe some information can help. When you reproduce this for the
first time I would dump all the threads in the system to dmesg.
So if you can do the following:
1. reproduce the hang
2. nvmetcli clear
3. echo t > /proc/sysrq-trigger
And share the dmesg output with us?
[-- Attachment #2: dmesg.txt.xz --]
[-- Type: application/octet-stream, Size: 58372 bytes --]
[-- Attachment #3: Type: text/plain, Size: 158 bytes --]
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nvme-tcp crashes the system when overloading the backend device.
[not found] ` <27377057-5001-4D53-B8D7-889972376F29@primelogic.nl>
@ 2021-09-06 11:12 ` Sagi Grimberg
2021-09-06 12:25 ` Mark Ruijter
0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2021-09-06 11:12 UTC (permalink / raw)
To: Mark Ruijter, linux-nvme
> Hi Sagi,
>
> I installed a recent kernel on the system and restarted the test.
> The kernel is: 5.10.57
>
> Before the kernel would crash, I stopped the test by disconnecting the initiators.
> "nvmetcli clear" did not hang and in this case, it still managed to remove the configuration.
>
> # ls -l /sys/kernel/config/nvmet/ports/
> total 0
> # ls -l /sys/kernel/config/nvmet/subsystems/
> total 0
>
> However, after this I still see nvmet_tcp_wq workers that are actively running:
> # ps aux | grep nvmet
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 163 0.4 0.0 0 0 ? I< 02:28 0:21 [kworker/24:0H-nvmet_tcp_wq]
> root 199 0.0 0.0 0 0 ? I< 02:28 0:02 [kworker/30:0H-nvmet_tcp_wq]
> root 355 0.0 0.0 0 0 ? I< 02:28 0:04 [kworker/56:0H-nvmet_tcp_wq]
> root 361 0.0 0.0 0 0 ? I< 02:28 0:03 [kworker/57:0H-nvmet_tcp_wq]
> root 683 1.2 0.0 0 0 ? D< 02:30 0:56 [kworker/53:1H+nvmet_tcp_wq]
> root 785 1.0 0.0 0 0 ? D< 02:30 0:44 [kworker/59:1H+nvmet_tcp_wq]
> root 1200 0.1 0.0 0 0 ? D< 02:30 0:08 [kworker/27:1H+nvmet_tcp_wq]
> root 29212 2.0 0.0 0 0 ? I< 03:28 0:18 [kworker/31:2H-nvmet_tcp_wq]
> root 32691 0.0 0.0 0 0 ? I< 02:31 0:00 [nvmet-buffered-]
> root 39437 5.3 0.0 0 0 ? D< 03:32 0:35 [kworker/51:3H+nvmet_tcp_wq]
> root 39440 1.8 0.0 0 0 ? I< 03:32 0:12 [kworker/59:3H-nvmet_tcp_wq]
> root 39458 13.3 0.0 0 0 ? I< 03:32 1:28 [kworker/18:3H-nvmet_tcp_wq]
> root 39508 7.1 0.0 0 0 ? D< 03:32 0:47 [kworker/53:4H+nvmet_tcp_wq]
> root 39511 2.7 0.0 0 0 ? D< 03:32 0:17 [kworker/28:5H+nvmet_tcp_wq]
> root 39520 7.7 0.0 0 0 ? D< 03:32 0:51 [kworker/52:3H+nvmet_tcp_wq]
> root 39855 4.3 0.0 0 0 ? I< 03:32 0:28 [kworker/48:4H-nvmet_tcp_wq]
> root 39857 3.0 0.0 0 0 ? D< 03:32 0:20 [kworker/28:7H+nvmet_tcp_wq]
> root 39902 6.3 0.0 0 0 ? D< 03:32 0:41 [kworker/27:6H+nvmet_tcp_wq]
> root 39928 5.3 0.0 0 0 ? D< 03:32 0:35 [kworker/25:9H+nvmet_tcp_wq]
> root 39963 8.8 0.0 0 0 ? D< 03:32 0:57 [kworker/24:6H+nvmet_tcp_wq]
> root 40024 3.3 0.0 0 0 ? I< 03:32 0:21 [kworker/28:9H-nvmet_tcp_wq]
> root 40087 6.3 0.0 0 0 ? I< 03:32 0:41 [kworker/53:6H-nvmet_tcp_wq]
> root 40169 6.1 0.0 0 0 ? D< 03:32 0:40 [kworker/59:5H+nvmet_tcp_wq]
> root 40201 3.5 0.0 0 0 ? D< 03:32 0:23 [kworker/54:8H+nvmet_tcp_wq]
> root 40333 0.6 0.0 0 0 ? D< 03:32 0:04 [kworker/59:7H+nvmet_tcp_wq]
> root 40371 0.4 0.0 0 0 ? I< 03:32 0:03 [kworker/49:5H-nvmet_tcp_wq]
> root 40375 2.5 0.0 0 0 ? I< 03:32 0:16 [kworker/20:8H-nvmet_tcp_wq]
> root 40517 0.4 0.0 0 0 ? I< 03:32 0:02 [kworker/58:6H-nvmet_tcp_wq]
> root 40811 2.8 0.0 0 0 ? D< 03:33 0:17 [kworker/51:9H+nvmet_tcp_wq]
> root 40864 1.5 0.0 0 0 ? I< 03:33 0:09 [kworker/29:5H-nvmet_tcp_wq]
> root 40891 1.7 0.0 0 0 ? I< 03:33 0:10 [kworker/17:9H-nvmet_tcp_wq]
> root 40902 4.3 0.0 0 0 ? D< 03:33 0:25 [kworker/59:8H+nvmet_tcp_wq]
> root 41061 3.3 0.0 0 0 ? I< 03:34 0:18 [kworker/51:10H-nvmet_tcp_wq]
> root 41145 2.6 0.0 0 0 ? D< 03:34 0:14 [kworker/56:7H+nvmet_tcp_wq]
> root 41278 1.3 0.0 0 0 ? I< 03:34 0:07 [kworker/22:9H-nvmet_tcp_wq]
>
> I've attached dmesg.txt as requested.
The dmesg output seems incomplete, I'm missing the nvmet-tcp threads
that are blocked. Do you have this output in the journal that you can share?
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nvme-tcp crashes the system when overloading the backend device.
2021-09-06 11:12 ` Sagi Grimberg
@ 2021-09-06 12:25 ` Mark Ruijter
0 siblings, 0 replies; 7+ messages in thread
From: Mark Ruijter @ 2021-09-06 12:25 UTC (permalink / raw)
To: Sagi Grimberg, linux-nvme
[-- Attachment #1: Type: text/plain, Size: 4642 bytes --]
This is all the output I could still find from the test.
If you are still missing info than let me know. I will need to re-run the test in that case.
--Mark
On 06/09/2021, 13:12, "Sagi Grimberg" <sagi@grimberg.me> wrote:
> Hi Sagi,
>
> I installed a recent kernel on the system and restarted the test.
> The kernel is: 5.10.57
>
> Before the kernel would crash, I stopped the test by disconnecting the initiators.
> "nvmetcli clear" did not hang and in this case, it still managed to remove the configuration.
>
> # ls -l /sys/kernel/config/nvmet/ports/
> total 0
> # ls -l /sys/kernel/config/nvmet/subsystems/
> total 0
>
> However, after this I still see nvmet_tcp_wq workers that are actively running:
> # ps aux | grep nvmet
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 163 0.4 0.0 0 0 ? I< 02:28 0:21 [kworker/24:0H-nvmet_tcp_wq]
> root 199 0.0 0.0 0 0 ? I< 02:28 0:02 [kworker/30:0H-nvmet_tcp_wq]
> root 355 0.0 0.0 0 0 ? I< 02:28 0:04 [kworker/56:0H-nvmet_tcp_wq]
> root 361 0.0 0.0 0 0 ? I< 02:28 0:03 [kworker/57:0H-nvmet_tcp_wq]
> root 683 1.2 0.0 0 0 ? D< 02:30 0:56 [kworker/53:1H+nvmet_tcp_wq]
> root 785 1.0 0.0 0 0 ? D< 02:30 0:44 [kworker/59:1H+nvmet_tcp_wq]
> root 1200 0.1 0.0 0 0 ? D< 02:30 0:08 [kworker/27:1H+nvmet_tcp_wq]
> root 29212 2.0 0.0 0 0 ? I< 03:28 0:18 [kworker/31:2H-nvmet_tcp_wq]
> root 32691 0.0 0.0 0 0 ? I< 02:31 0:00 [nvmet-buffered-]
> root 39437 5.3 0.0 0 0 ? D< 03:32 0:35 [kworker/51:3H+nvmet_tcp_wq]
> root 39440 1.8 0.0 0 0 ? I< 03:32 0:12 [kworker/59:3H-nvmet_tcp_wq]
> root 39458 13.3 0.0 0 0 ? I< 03:32 1:28 [kworker/18:3H-nvmet_tcp_wq]
> root 39508 7.1 0.0 0 0 ? D< 03:32 0:47 [kworker/53:4H+nvmet_tcp_wq]
> root 39511 2.7 0.0 0 0 ? D< 03:32 0:17 [kworker/28:5H+nvmet_tcp_wq]
> root 39520 7.7 0.0 0 0 ? D< 03:32 0:51 [kworker/52:3H+nvmet_tcp_wq]
> root 39855 4.3 0.0 0 0 ? I< 03:32 0:28 [kworker/48:4H-nvmet_tcp_wq]
> root 39857 3.0 0.0 0 0 ? D< 03:32 0:20 [kworker/28:7H+nvmet_tcp_wq]
> root 39902 6.3 0.0 0 0 ? D< 03:32 0:41 [kworker/27:6H+nvmet_tcp_wq]
> root 39928 5.3 0.0 0 0 ? D< 03:32 0:35 [kworker/25:9H+nvmet_tcp_wq]
> root 39963 8.8 0.0 0 0 ? D< 03:32 0:57 [kworker/24:6H+nvmet_tcp_wq]
> root 40024 3.3 0.0 0 0 ? I< 03:32 0:21 [kworker/28:9H-nvmet_tcp_wq]
> root 40087 6.3 0.0 0 0 ? I< 03:32 0:41 [kworker/53:6H-nvmet_tcp_wq]
> root 40169 6.1 0.0 0 0 ? D< 03:32 0:40 [kworker/59:5H+nvmet_tcp_wq]
> root 40201 3.5 0.0 0 0 ? D< 03:32 0:23 [kworker/54:8H+nvmet_tcp_wq]
> root 40333 0.6 0.0 0 0 ? D< 03:32 0:04 [kworker/59:7H+nvmet_tcp_wq]
> root 40371 0.4 0.0 0 0 ? I< 03:32 0:03 [kworker/49:5H-nvmet_tcp_wq]
> root 40375 2.5 0.0 0 0 ? I< 03:32 0:16 [kworker/20:8H-nvmet_tcp_wq]
> root 40517 0.4 0.0 0 0 ? I< 03:32 0:02 [kworker/58:6H-nvmet_tcp_wq]
> root 40811 2.8 0.0 0 0 ? D< 03:33 0:17 [kworker/51:9H+nvmet_tcp_wq]
> root 40864 1.5 0.0 0 0 ? I< 03:33 0:09 [kworker/29:5H-nvmet_tcp_wq]
> root 40891 1.7 0.0 0 0 ? I< 03:33 0:10 [kworker/17:9H-nvmet_tcp_wq]
> root 40902 4.3 0.0 0 0 ? D< 03:33 0:25 [kworker/59:8H+nvmet_tcp_wq]
> root 41061 3.3 0.0 0 0 ? I< 03:34 0:18 [kworker/51:10H-nvmet_tcp_wq]
> root 41145 2.6 0.0 0 0 ? D< 03:34 0:14 [kworker/56:7H+nvmet_tcp_wq]
> root 41278 1.3 0.0 0 0 ? I< 03:34 0:07 [kworker/22:9H-nvmet_tcp_wq]
>
> I've attached dmesg.txt as requested.
The dmesg output seems incomplete, I'm missing the nvmet-tcp threads
that are blocked. Do you have this output in the journal that you can share?
[-- Attachment #2: messages.xz --]
[-- Type: application/octet-stream, Size: 196640 bytes --]
[-- Attachment #3: warn.xz --]
[-- Type: application/octet-stream, Size: 61552 bytes --]
[-- Attachment #4: Type: text/plain, Size: 158 bytes --]
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-09-06 12:25 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-31 13:30 nvme-tcp crashes the system when overloading the backend device Mark Ruijter
2021-09-01 12:49 ` Sagi Grimberg
2021-09-01 14:36 ` Mark Ruijter
2021-09-01 14:47 ` Sagi Grimberg
2021-09-02 11:31 ` Mark Ruijter
[not found] ` <27377057-5001-4D53-B8D7-889972376F29@primelogic.nl>
2021-09-06 11:12 ` Sagi Grimberg
2021-09-06 12:25 ` Mark Ruijter
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.