All of lore.kernel.org
 help / color / mirror / Atom feed
From: swise@opengridcomputing.com (Steve Wise)
Subject: nvmf/rdma host crash during heavy load and keep alive recovery
Date: Fri, 9 Sep 2016 10:50:45 -0500	[thread overview]
Message-ID: <00ab01d20ab1$ed212ff0$c7638fd0$@opengridcomputing.com> (raw)
In-Reply-To: <021501d20a19$327ba5b0$9772f110$@opengridcomputing.com>

> > So with this patch, the crash is a little different.  One thread is in the
> > usual place crashed in nvme_rdma_post_send() called by
> > nvme_rdma_queue_rq() because the qp and cm_id in the nvme_rdma_queue
> have
> > been freed.   Actually there are a handful of CPUs processing different
> > requests in the same type stack trace.  But perhaps that is expected given
> > the work load and number of controllers (10) and queues (32 per
> > controller)...
> >
> > I also see another worker thread here:
> >
> > PID: 3769   TASK: ffff880e18972f40  CPU: 3   COMMAND: "kworker/3:3"
> >  #0 [ffff880e2f7938d0] __schedule at ffffffff816dfa17
> >  #1 [ffff880e2f793930] schedule at ffffffff816dff00
> >  #2 [ffff880e2f793980] schedule_timeout at ffffffff816e2b1b
> >  #3 [ffff880e2f793a60] wait_for_completion_timeout at ffffffff816e0f03
> >  #4 [ffff880e2f793ad0] destroy_cq at ffffffffa061d8f3 [iw_cxgb4]
> >  #5 [ffff880e2f793b60] c4iw_destroy_cq at ffffffffa061dad5 [iw_cxgb4]
> >  #6 [ffff880e2f793bf0] ib_free_cq at ffffffffa0360e5a [ib_core]
> >  #7 [ffff880e2f793c20] nvme_rdma_destroy_queue_ib at ffffffffa0644e9b
> > [nvme_rdma]
> >  #8 [ffff880e2f793c60] nvme_rdma_stop_and_free_queue at ffffffffa0645083
> > [nvme_rdma]
> >  #9 [ffff880e2f793c80] nvme_rdma_reconnect_ctrl_work at ffffffffa0645a9f
> > [nvme_rdma]
> > #10 [ffff880e2f793cb0] process_one_work at ffffffff810a1613
> > #11 [ffff880e2f793d90] worker_thread at ffffffff810a22ad
> > #12 [ffff880e2f793ec0] kthread at ffffffff810a6dec
> > #13 [ffff880e2f793f50] ret_from_fork at ffffffff816e3bbf
> >
> > I'm trying to identify if this reconnect is for the same controller that
> > crashed processing a request.  It probably is, but I need to search the
> > stack frame to try and find the controller pointer...
> >
> > Steve.
> 
> Both the thread that crashed and the thread doing reconnect are operating on
> ctrl "nvme2"...

I'm reanalyzing the crash dump for this particular crash, and I've found the
blk_mq_hw_ctx struct that has ->driver_data == to the nvme_rdma_queue that
caused the crash.  hctx->state, though, is 2, which is the BLK_MQ_S_TAG_ACTIVE
bit.  IE the BLK_MQ_S_STOPPED bit is _not_ set!

Attached are the blk_mq_hw_ctx, nvme_rdma_queue, and nvme_rdma_ctrl structs, as
well as the nvme_rdma_requeust, request and request_queue structs if you want to
have a look...

Steve.


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: crash_analysis.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20160909/6979423a/attachment-0001.txt>

  reply	other threads:[~2016-09-09 15:50 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-29 21:40 nvmf/rdma host crash during heavy load and keep alive recovery Steve Wise
2016-08-01 11:06 ` Christoph Hellwig
2016-08-01 14:26   ` Steve Wise
2016-08-01 21:38     ` Steve Wise
     [not found]     ` <015801d1ec3d$0ca07ea0$25e17be0$@opengridcomputing.com>
2016-08-10 15:46       ` Steve Wise
     [not found]       ` <010f01d1f31e$50c8cb40$f25a61c0$@opengridcomputing.com>
2016-08-10 16:00         ` Steve Wise
     [not found]         ` <013701d1f320$57b185d0$07149170$@opengridcomputing.com>
2016-08-10 17:20           ` Steve Wise
2016-08-10 18:59             ` Steve Wise
2016-08-11  6:27               ` Sagi Grimberg
2016-08-11 13:58                 ` Steve Wise
2016-08-11 14:19                   ` Steve Wise
2016-08-11 14:40                   ` Steve Wise
2016-08-11 15:53                     ` Steve Wise
     [not found]                     ` <00fe01d1f3e8$8992b330$9cb81990$@opengridcomputing.com>
2016-08-15 14:39                       ` Steve Wise
2016-08-16  9:26                         ` Sagi Grimberg
2016-08-16 21:17                           ` Steve Wise
2016-08-17 18:57                             ` Sagi Grimberg
2016-08-17 19:07                               ` Steve Wise
2016-09-01 19:14                                 ` Steve Wise
2016-09-04  9:17                                   ` Sagi Grimberg
2016-09-07 21:08                                     ` Steve Wise
2016-09-08  7:45                                       ` Sagi Grimberg
2016-09-08 20:47                                         ` Steve Wise
2016-09-08 21:00                                         ` Steve Wise
     [not found]                                       ` <7f09e373-6316-26a3-ae81-dab1205d88ab@grimbe rg.me>
     [not found]                                         ` <021201d20a14$0 f203b80$2d60b280$@opengridcomputing.com>
     [not found]                                           ` <021201d20a14$0f203b80$2d60b280$@opengridcomputing.com>
2016-09-08 21:21                                             ` Steve Wise
     [not found]                                           ` <021401d20a16$ed60d470$c8227d50$@opengridcomputing.com>
     [not found]                                             ` <021501d20a19$327ba5b0$9772f110$@opengrid computing.com>
2016-09-08 21:37                                             ` Steve Wise
2016-09-09 15:50                                               ` Steve Wise [this message]
2016-09-12 20:10                                                 ` Steve Wise
     [not found]                                                   ` <da2e918b-0f18-e032-272d-368c6ec49c62@gri mberg.me>
2016-09-15  9:53                                                   ` Sagi Grimberg
2016-09-15 14:44                                                     ` Steve Wise
2016-09-15 15:10                                                       ` Steve Wise
2016-09-15 15:53                                                         ` Steve Wise
2016-09-15 16:45                                                           ` Steve Wise
2016-09-15 20:58                                                             ` Steve Wise
2016-09-16 11:04                                                               ` 'Christoph Hellwig'
2016-09-18 17:02                                                                 ` Sagi Grimberg
2016-09-19 15:38                                                                   ` Steve Wise
2016-09-21 21:20                                                                     ` Steve Wise
2016-09-23 23:57                                                                       ` Sagi Grimberg
2016-09-26 15:12                                                                         ` 'Christoph Hellwig'
2016-09-26 22:29                                                                           ` 'Christoph Hellwig'
2016-09-27 15:11                                                                             ` Steve Wise
2016-09-27 15:31                                                                               ` Steve Wise
2016-09-27 14:07                                                                         ` Steve Wise
2016-09-15 14:00                                                   ` Gabriel Krisman Bertazi
2016-09-15 14:31                                                     ` Steve Wise
2016-09-07 21:33                                     ` Steve Wise
2016-09-08  8:22                                       ` Sagi Grimberg
2016-09-08 17:19                                         ` Steve Wise
2016-09-09 15:57                                           ` Steve Wise
     [not found]                                       ` <9fd1f090-3b86-b496-d8c0-225ac0815fbe@grimbe rg.me>
     [not found]                                         ` <01bc01d209f5$1 b7d7510$52785f30$@opengridcomputing.com>
     [not found]                                           ` <01bc01d209f5$1b7d7510$52785f30$@opengridcomputing.com>
2016-09-08 19:15                                             ` Steve Wise
     [not found]                                           ` <01f201d20a05$6abde5f0$4039b1d0$@opengridcomputing.com>
2016-09-08 19:26                                             ` Steve Wise
     [not found]                                             ` <01f401d20a06$d4cc8360$7e658a20$@opengridcomputing.com>
2016-09-08 20:44                                               ` Steve Wise

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='00ab01d20ab1$ed212ff0$c7638fd0$@opengridcomputing.com' \
    --to=swise@opengridcomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.