All of lore.kernel.org
 help / color / mirror / Atom feed
From: sagi@grimberg.me (Sagi Grimberg)
Subject: nvmf/rdma host crash during heavy load and keep alive recovery
Date: Thu, 8 Sep 2016 10:45:35 +0300	[thread overview]
Message-ID: <7f09e373-6316-26a3-ae81-dab1205d88ab@grimberg.me> (raw)
In-Reply-To: <039401d2094c$084d64e0$18e82ea0$@opengridcomputing.com>


>> Does this happen if you change the reconnect delay to be something
>> different than 10 seconds? (say 30?)
>>
>
> Yes.  But I noticed something when performing this experiment that is an
> important point, I think:  if I just bring the network interface down and leave
> it down, we don't crash.  During this state, I see the host continually
> reconnecting after the reconnect delay time, timing out trying to reconnect, and
> retrying after another reconnect_delay period.  I see this for all 10 targets of
> course.  The crash only happens when I bring the interface back up, and the
> targets begin to reconnect.   So the process of successfully reconnecting the
> RDMA QPs, and restarting the nvme queues is somehow triggering running an nvme
> request too soon (or perhaps on the wrong queue).

Interesting. Given this is easy to reproduce, can you record the:
(request_tag, *queue, *qp) for each request submitted?

I'd like to see that the *queue stays the same for each tag
but the *qp indeed changes.

>> Can you also give patch [1] a try? It's not a solution, but I want
>> to see if it hides the problem...
>>
>
> hmm.  I ran the experiment once with [1] and it didn't crash.  I ran it a 2nd
> time and hit a new crash.  Maybe a problem with [1]?

Strange, I don't see how we can visit rdma_destroy_qp twice given
that we have NVME_RDMA_IB_QUEUE_ALLOCATED bit protecting it.

Not sure if it fixes anything, but we probably need it regardless, can
you give another go with this on top:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 43602cebf097..89023326f397 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -542,11 +542,12 @@ static int nvme_rdma_create_queue_ib(struct 
nvme_rdma_queue *queue,
                 goto out_destroy_qp;
         }
         set_bit(NVME_RDMA_IB_QUEUE_ALLOCATED, &queue->flags);
+       clear_bit(NVME_RDMA_Q_DELETING, &queue->flags);

         return 0;

  out_destroy_qp:
-       ib_destroy_qp(queue->qp);
+       rdma_destroy_qp(queue->qp);
  out_destroy_ib_cq:
         ib_free_cq(queue->ib_cq);
  out:
@@ -591,15 +592,16 @@ static int nvme_rdma_init_queue(struct 
nvme_rdma_ctrl *ctrl,
         if (ret) {
                 dev_info(ctrl->ctrl.device,
                         "rdma_resolve_addr wait failed (%d).\n", ret);
-               goto out_destroy_cm_id;
+               goto out_destroy_queue_id;
         }

         set_bit(NVME_RDMA_Q_CONNECTED, &queue->flags);

         return 0;

-out_destroy_cm_id:
+out_destroy_queue_ib:
         nvme_rdma_destroy_queue_ib(queue);
+out_destroy_cm_id:
         rdma_destroy_id(queue->cm_id);
         return ret;
  }
--

  reply	other threads:[~2016-09-08  7:45 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-29 21:40 nvmf/rdma host crash during heavy load and keep alive recovery Steve Wise
2016-08-01 11:06 ` Christoph Hellwig
2016-08-01 14:26   ` Steve Wise
2016-08-01 21:38     ` Steve Wise
     [not found]     ` <015801d1ec3d$0ca07ea0$25e17be0$@opengridcomputing.com>
2016-08-10 15:46       ` Steve Wise
     [not found]       ` <010f01d1f31e$50c8cb40$f25a61c0$@opengridcomputing.com>
2016-08-10 16:00         ` Steve Wise
     [not found]         ` <013701d1f320$57b185d0$07149170$@opengridcomputing.com>
2016-08-10 17:20           ` Steve Wise
2016-08-10 18:59             ` Steve Wise
2016-08-11  6:27               ` Sagi Grimberg
2016-08-11 13:58                 ` Steve Wise
2016-08-11 14:19                   ` Steve Wise
2016-08-11 14:40                   ` Steve Wise
2016-08-11 15:53                     ` Steve Wise
     [not found]                     ` <00fe01d1f3e8$8992b330$9cb81990$@opengridcomputing.com>
2016-08-15 14:39                       ` Steve Wise
2016-08-16  9:26                         ` Sagi Grimberg
2016-08-16 21:17                           ` Steve Wise
2016-08-17 18:57                             ` Sagi Grimberg
2016-08-17 19:07                               ` Steve Wise
2016-09-01 19:14                                 ` Steve Wise
2016-09-04  9:17                                   ` Sagi Grimberg
2016-09-07 21:08                                     ` Steve Wise
2016-09-08  7:45                                       ` Sagi Grimberg [this message]
2016-09-08 20:47                                         ` Steve Wise
2016-09-08 21:00                                         ` Steve Wise
     [not found]                                       ` <7f09e373-6316-26a3-ae81-dab1205d88ab@grimbe rg.me>
     [not found]                                         ` <021201d20a14$0 f203b80$2d60b280$@opengridcomputing.com>
     [not found]                                           ` <021201d20a14$0f203b80$2d60b280$@opengridcomputing.com>
2016-09-08 21:21                                             ` Steve Wise
     [not found]                                           ` <021401d20a16$ed60d470$c8227d50$@opengridcomputing.com>
     [not found]                                             ` <021501d20a19$327ba5b0$9772f110$@opengrid computing.com>
2016-09-08 21:37                                             ` Steve Wise
2016-09-09 15:50                                               ` Steve Wise
2016-09-12 20:10                                                 ` Steve Wise
     [not found]                                                   ` <da2e918b-0f18-e032-272d-368c6ec49c62@gri mberg.me>
2016-09-15  9:53                                                   ` Sagi Grimberg
2016-09-15 14:44                                                     ` Steve Wise
2016-09-15 15:10                                                       ` Steve Wise
2016-09-15 15:53                                                         ` Steve Wise
2016-09-15 16:45                                                           ` Steve Wise
2016-09-15 20:58                                                             ` Steve Wise
2016-09-16 11:04                                                               ` 'Christoph Hellwig'
2016-09-18 17:02                                                                 ` Sagi Grimberg
2016-09-19 15:38                                                                   ` Steve Wise
2016-09-21 21:20                                                                     ` Steve Wise
2016-09-23 23:57                                                                       ` Sagi Grimberg
2016-09-26 15:12                                                                         ` 'Christoph Hellwig'
2016-09-26 22:29                                                                           ` 'Christoph Hellwig'
2016-09-27 15:11                                                                             ` Steve Wise
2016-09-27 15:31                                                                               ` Steve Wise
2016-09-27 14:07                                                                         ` Steve Wise
2016-09-15 14:00                                                   ` Gabriel Krisman Bertazi
2016-09-15 14:31                                                     ` Steve Wise
2016-09-07 21:33                                     ` Steve Wise
2016-09-08  8:22                                       ` Sagi Grimberg
2016-09-08 17:19                                         ` Steve Wise
2016-09-09 15:57                                           ` Steve Wise
     [not found]                                       ` <9fd1f090-3b86-b496-d8c0-225ac0815fbe@grimbe rg.me>
     [not found]                                         ` <01bc01d209f5$1 b7d7510$52785f30$@opengridcomputing.com>
     [not found]                                           ` <01bc01d209f5$1b7d7510$52785f30$@opengridcomputing.com>
2016-09-08 19:15                                             ` Steve Wise
     [not found]                                           ` <01f201d20a05$6abde5f0$4039b1d0$@opengridcomputing.com>
2016-09-08 19:26                                             ` Steve Wise
     [not found]                                             ` <01f401d20a06$d4cc8360$7e658a20$@opengridcomputing.com>
2016-09-08 20:44                                               ` Steve Wise

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7f09e373-6316-26a3-ae81-dab1205d88ab@grimberg.me \
    --to=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.