From mboxrd@z Thu Jan  1 00:00:00 1970
From: sagi@grimberg.me (Sagi Grimberg)
Date: Thu, 8 Sep 2016 10:45:35 +0300
Subject: nvmf/rdma host crash during heavy load and keep alive recovery
In-Reply-To: <039401d2094c$084d64e0$18e82ea0$@opengridcomputing.com>
References: <018301d1e9e1$da3b2e40$8eb18ac0$@opengridcomputing.com>
 <010f01d1f31e$50c8cb40$f25a61c0$@opengridcomputing.com>
 <013701d1f320$57b185d0$07149170$@opengridcomputing.com>
 <018401d1f32b$792cfdb0$6b86f910$@opengridcomputing.com>
 <01a301d1f339$55ba8e70$012fab50$@opengridcomputing.com>
 <2fb1129c-424d-8b2d-7101-b9471e897dc8@grimberg.me>
 <004701d1f3d8$760660b0$62132210$@opengridcomputing.com>
 <008101d1f3de$557d2850$007778f0$@opengridcomputing.com>
 <00fe01d1f3e8$8992b330$9cb81990$@opengridcomputing.com>
 <01c301d1f702$d28c7270$77a55750$@opengridcomputing.com>
 <6ef9b0d1-ce84-4598-74db-7adeed313bb6@grimberg.me>
 <045601d1f803$a9d73a20$fd85ae60$@opengridcomputing.com>
 <69c0e819-76d9-286b-c4fb-22f087f36ff1@grimberg.me>
 <08b701d1f8ba$a709ae10$f51d0a30$@opengridcomputing.com>
 <01c301d20485$0dfcd2c0$29f67840$@opengridcomputing.com> <0c159abb
 -24ee-21bf-09d2-9fe7d2 69a2eb@grimberg.me>
 <039401d2094c$084d64e0$18e82ea0$@opengridcomputing.com>
Message-ID: <7f09e373-6316-26a3-ae81-dab1205d88ab@grimberg.me>


>> Does this happen if you change the reconnect delay to be something
>> different than 10 seconds? (say 30?)
>>
>
> Yes.  But I noticed something when performing this experiment that is an
> important point, I think:  if I just bring the network interface down and leave
> it down, we don't crash.  During this state, I see the host continually
> reconnecting after the reconnect delay time, timing out trying to reconnect, and
> retrying after another reconnect_delay period.  I see this for all 10 targets of
> course.  The crash only happens when I bring the interface back up, and the
> targets begin to reconnect.   So the process of successfully reconnecting the
> RDMA QPs, and restarting the nvme queues is somehow triggering running an nvme
> request too soon (or perhaps on the wrong queue).

Interesting. Given this is easy to reproduce, can you record the:
(request_tag, *queue, *qp) for each request submitted?

I'd like to see that the *queue stays the same for each tag
but the *qp indeed changes.

>> Can you also give patch [1] a try? It's not a solution, but I want
>> to see if it hides the problem...
>>
>
> hmm.  I ran the experiment once with [1] and it didn't crash.  I ran it a 2nd
> time and hit a new crash.  Maybe a problem with [1]?

Strange, I don't see how we can visit rdma_destroy_qp twice given
that we have NVME_RDMA_IB_QUEUE_ALLOCATED bit protecting it.

Not sure if it fixes anything, but we probably need it regardless, can
you give another go with this on top:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 43602cebf097..89023326f397 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -542,11 +542,12 @@ static int nvme_rdma_create_queue_ib(struct 
nvme_rdma_queue *queue,
                 goto out_destroy_qp;
         }
         set_bit(NVME_RDMA_IB_QUEUE_ALLOCATED, &queue->flags);
+       clear_bit(NVME_RDMA_Q_DELETING, &queue->flags);

         return 0;

  out_destroy_qp:
-       ib_destroy_qp(queue->qp);
+       rdma_destroy_qp(queue->qp);
  out_destroy_ib_cq:
         ib_free_cq(queue->ib_cq);
  out:
@@ -591,15 +592,16 @@ static int nvme_rdma_init_queue(struct 
nvme_rdma_ctrl *ctrl,
         if (ret) {
                 dev_info(ctrl->ctrl.device,
                         "rdma_resolve_addr wait failed (%d).\n", ret);
-               goto out_destroy_cm_id;
+               goto out_destroy_queue_id;
         }

         set_bit(NVME_RDMA_Q_CONNECTED, &queue->flags);

         return 0;

-out_destroy_cm_id:
+out_destroy_queue_ib:
         nvme_rdma_destroy_queue_ib(queue);
+out_destroy_cm_id:
         rdma_destroy_id(queue->cm_id);
         return ret;
  }
--