All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sagi Grimberg <sagi@grimberg.me>
To: Mark Ruijter <mruijter@primelogic.nl>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Subject: Re: nvme-tcp crashes the system when overloading the backend device.
Date: Wed, 1 Sep 2021 17:47:12 +0300	[thread overview]
Message-ID: <06beadce-b135-f1ac-094a-8262ea37fe86@grimberg.me> (raw)
In-Reply-To: <2E0B97E4-8182-4EF5-8D90-DA4998C58295@primelogic.nl>


> Hi Sagi,
> 
> I can reproduce this problem with any recent kernel.
> At least all these kernels I tested suffer from the problem: 5.10.40, 5.10.57, 5.14-rc4 as well as SuSE SLES15-SP2 with kernel 5.3.18-24.37-default.
> On the initiator I use Ubuntu 20.04 LTS with kernel 5.10.0-1019.

Thanks.

>>> Is it possible to check if the R5 device has inflight commands? if not
> there is some race condition or misaccounting that prevents an orderly
> shutdown of the queues.
> 
> I will double check; however, I don't think that the underlying device is the problem.
> The exact same test passes with the nvmet-rdma target.
> It only fails with the nvmet-tcp target driver.

OK, that is useful information.

> 
> At far as I can tell I exhaust the budget in nvmet_tcp_io_work and requeue:
> 
> 1293         } while (pending && ops < NVMET_TCP_IO_WORK_BUDGET);
> 1294
> 1295         /*
> 1296          * Requeue the worker if idle deadline period is in progress or any
> 1297          * ops activity was recorded during the do-while loop above.
> 1298          */
> 1299         if (nvmet_tcp_check_queue_deadline(queue, ops) || pending)
> 1300                 queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
> 
> I added pr_info statements in the code to determine what is going on:
> 2021-09-01T07:15:26.944067-06:00 gold kernel: [ 5502.786914] nvmet_tcp: MARK exhausted budget: ret = 0, ops = 71
> 2021-09-01T07:15:26.944070-06:00 gold kernel: [ 5502.787455] nvmet: ctrl 49 keep-alive timer (15 seconds) expired!
> 2021-09-01T07:15:26.944072-06:00 gold kernel: [ 5502.787461] nvmet: ctrl 49 fatal error occurred!
> 
> Shortly after the routine nvmet_fatal_error_handler gets triggered:
> static void nvmet_fatal_error_handler(struct work_struct *work)
> {
>          struct nvmet_ctrl *ctrl =
>                          container_of(work, struct nvmet_ctrl, fatal_err_work);
> 
>          pr_err("ctrl %d fatal error occurred!\n", ctrl->cntlid);
>          ctrl->ops->delete_ctrl(ctrl);
> }
> 
> Some of nvme_tcp_wq workers now keep running and the number of workers keeps increasing.
> root      3686  3.3  0.0      0     0 ?        I<   07:31   0:29 [kworker/11:0H-nvmet_tcp_wq]
> root      3689 12.0  0.0      0     0 ?        I<   07:31   1:43 [kworker/25:0H-nvmet_tcp_wq]
> root      3695 12.0  0.0      0     0 ?        I<   07:31   1:43 [kworker/55:3H-nvmet_tcp_wq]
> root      3699  5.0  0.0      0     0 ?        I<   07:31   0:43 [kworker/38:1H-nvmet_tcp_wq]
> root      3704 11.5  0.0      0     0 ?        I<   07:31   1:39 [kworker/21:0H-nvmet_tcp_wq]
> root      3708 12.1  0.0      0     0 ?        I<   07:31   1:44 [kworker/31:0H-nvmet_tcp_wq]
> 
> "nvmetcli clear" will no longer return after this and when you keep the initiators running the system eventually crashes.
> 

OK, so maybe some information can help. When you reproduce this for the 
first time I would dump all the threads in the system to dmesg.

So if you can do the following:
1. reproduce the hang
2. nvmetcli clear
3. echo t > /proc/sysrq-trigger

And share the dmesg output with us?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply	other threads:[~2021-09-01 14:47 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-31 13:30 nvme-tcp crashes the system when overloading the backend device Mark Ruijter
2021-09-01 12:49 ` Sagi Grimberg
2021-09-01 14:36   ` Mark Ruijter
2021-09-01 14:47     ` Sagi Grimberg [this message]
2021-09-02 11:31       ` Mark Ruijter
     [not found]       ` <27377057-5001-4D53-B8D7-889972376F29@primelogic.nl>
2021-09-06 11:12         ` Sagi Grimberg
2021-09-06 12:25           ` Mark Ruijter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=06beadce-b135-f1ac-094a-8262ea37fe86@grimberg.me \
    --to=sagi@grimberg.me \
    --cc=linux-nvme@lists.infradead.org \
    --cc=mruijter@primelogic.nl \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.