Hi Seth,


> On Feb 7, 2019, at 12:18 PM, Howell, Seth <seth.howell(a)intel.com> wrote:
> 
> Hi Sasha, Valeriy,
> 
> With the help of Valeriy's logs I was able to get to the bottom of this. The root cause is that for NVMe-oF requests that don't transfer any data, such as keep_alive, we were not properly resetting the value of rdma_req->num_outstanding_data_wr between uses of that structure. All data carrying operations properly reset this value in spdk_nvmf_rdma_req_parse_sgl. 
> 
> My local repro steps look like this for anyone interested.
> 
> Start the SPDK target,
> Submit a full queue depth worth of Smart log requests (sequentially is fine). A smaller number also works, but takes much longer.
> Wait for a while (This assumes you have keep alive enabled). Keep alive requests will reuse the rdma_req objects slowly incrementing the curr_send_depth on the admin qpair.
> Eventually the admin qpair will be unable to submit I/O.
> 
> I was able to fix the issue locally with the following patch. https://review.gerrithub.io/#/c/spdk/spdk/+/443811/. Valeriy, please let me know if applying this patch also fixes it for you ( I am pretty sure that it will).


Does this issue present only in 19.01, or can it also occur in 18.10.1?

--
Lance Hartmann