All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mark Ruijter <mruijter@primelogic.nl>
To: "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Subject: nvme-tcp crashes the system when overloading the backend device.
Date: Tue, 31 Aug 2021 13:30:51 +0000	[thread overview]
Message-ID: <1A17C9D4-327C-45D6-B7BB-D69AEB169BBD@primelogic.nl> (raw)

Hi all,

I can consistently crash a system when I sufficiently overload the nvme-tcp target.
The easiest way to reproduce the problem is by creating a raid5.

While this R5 is resyncing export it with the nvmet-tcp target driver and start a high queue-depth 4K random fio workload from the initiator.
At some point the target system will start logging these messages:
[ 2865.725069] nvmet: ctrl 238 keep-alive timer (15 seconds) expired!
[ 2865.725072] nvmet: ctrl 236 keep-alive timer (15 seconds) expired!
[ 2865.725075] nvmet: ctrl 238 fatal error occurred!
[ 2865.725076] nvmet: ctrl 236 fatal error occurred!
[ 2865.725080] nvmet: ctrl 237 keep-alive timer (15 seconds) expired!
[ 2865.725083] nvmet: ctrl 237 fatal error occurred!
[ 2865.725087] nvmet: ctrl 235 keep-alive timer (15 seconds) expired!
[ 2865.725094] nvmet: ctrl 235 fatal error occurred!

Even when you stop all IO from the initiator some of the nvmet_tcp_wq workers will keep running forever.
The workload shown with "top" never returns to the normal idle level.

root      5669  1.1  0.0      0     0 ?        D<   03:39   0:09 [kworker/22:2H+nvmet_tcp_wq]
root      5670  0.8  0.0      0     0 ?        D<   03:39   0:06 [kworker/55:2H+nvmet_tcp_wq]
root      5676  0.2  0.0      0     0 ?        D<   03:39   0:01 [kworker/29:2H+nvmet_tcp_wq]
root      5677 12.2  0.0      0     0 ?        D<   03:39   1:35 [kworker/59:2H+nvmet_tcp_wq]
root      5679  5.7  0.0      0     0 ?        D<   03:39   0:44 [kworker/27:2H+nvmet_tcp_wq]
root      5680  2.9  0.0      0     0 ?        I<   03:39   0:23 [kworker/57:2H-nvmet_tcp_wq]
root      5681  1.0  0.0      0     0 ?        D<   03:39   0:08 [kworker/60:2H+nvmet_tcp_wq]
root      5682  0.5  0.0      0     0 ?        D<   03:39   0:04 [kworker/18:2H+nvmet_tcp_wq]
root      5683  5.8  0.0      0     0 ?        D<   03:39   0:45 [kworker/54:2H+nvmet_tcp_wq]

The number of running nvmet_tcp_wq will keep increasing once you hit the problem:

gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | tail -3
41114 ?        D<     0:00 [kworker/25:21H+nvmet_tcp_wq]
41152 ?        D<     0:00 [kworker/54:25H+nvmet_tcp_wq]

gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvme | grep wq | wc -l
500
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvme | grep wq | wc -l
502
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
503
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
505
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
506
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
511
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
661

Eventually the system runs out of resources.
At some point the system will reach a workload of 2000+ and crash.

So far, I have been unable to determine why the number of nvmet_tcp_wq keeps increasing.
It must be because the current failed worker gets replaced by a new worker without the old being terminated.

Thanks,

Mark Ruijter




_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

             reply	other threads:[~2021-08-31 13:31 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-31 13:30 Mark Ruijter [this message]
2021-09-01 12:49 ` nvme-tcp crashes the system when overloading the backend device Sagi Grimberg
2021-09-01 14:36   ` Mark Ruijter
2021-09-01 14:47     ` Sagi Grimberg
2021-09-02 11:31       ` Mark Ruijter
     [not found]       ` <27377057-5001-4D53-B8D7-889972376F29@primelogic.nl>
2021-09-06 11:12         ` Sagi Grimberg
2021-09-06 12:25           ` Mark Ruijter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1A17C9D4-327C-45D6-B7BB-D69AEB169BBD@primelogic.nl \
    --to=mruijter@primelogic.nl \
    --cc=linux-nvme@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.