From mboxrd@z Thu Jan  1 00:00:00 1970
From: swise@opengridcomputing.com (Steve Wise)
Date: Wed, 10 Aug 2016 10:46:09 -0500
Subject: nvmf/rdma host crash during heavy load and keep alive recovery
In-Reply-To: <015801d1ec3d$0ca07ea0$25e17be0$@opengridcomputing.com>
References: <018301d1e9e1$da3b2e40$8eb18ac0$@opengridcomputing.com>
 <20160801110658.GF16141@lst.de>
 <008801d1ec00$a0bcfbf0$e236f3d0$@opengridcomputing.com>
 <015801d1ec3d$0ca07ea0$25e17be0$@opengridcomputing.com>
Message-ID: <010e01d1f31e$50b80260$f2280720$@opengridcomputing.com>

Hey guys, I've rebased the nvmf-4.8-rc branch on top of 4.8-rc2 so I have the
latest/gratest, and continued debugging this crash.  I see:

0) 10 ram disks attached via nvmf/iw_cxgb4, and fio started on all 10 disks.
This node has 8 cores, so that is 80 connections.
1) the cxgb4 interface brought down a few seconds later
2) kato fires on all connections
3) the interface is brought back up 8 seconds after #1
4) 10 seconds after #2 all the qps are destroyed
5) reconnects start happening
6) a blk request is executed and the nvme_rdma_request struct still has a
pointer to one of the qps destroyed in 3 and whamo...

I'm digging into the request cancel logic.  Any ideas/help is greatly
appreciated...

Thanks,

Steve.