Hi Seth,

It seems that some problem is still present in the nvmf target built from  
the recent SPDK trunk.
It crashes with dump under our performance test.

===========
# Starting SPDK v19.04-pre / DPDK 18.11.0 initialization...
[ DPDK EAL parameters: nvmf --no-shconf -c 0x1 --log-level=lib.eal:6  
--base-virtaddr=0x200000000000 --file-prefix=spdk_pid131804 ]
EAL: No free hugepages reported in hugepages-2048kB
EAL: 2 hugepages of size 1073741824 reserved, but no mounted hugetlbfs  
found for that size
app.c: 624:spdk_app_start: *NOTICE*: Total cores available: 1
reactor.c: 233:_spdk_reactor_run: *NOTICE*: Reactor started on core 0
rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x20002458d000  
length=4096
rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x20002464b000  
length=4096
rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x2000245ae000  
length=4096
rdma.c:2786:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x20002457e000  
length=4096
rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x20002457e000  
length=4096
rdma.c:2786:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x2000245a8000  
length=4096
nvmf_tgt: rdma.c:2789: spdk_nvmf_rdma_poller_poll: Assertion  
`rdma_req->num_outstanding_data_wr > 0' failed.
^C
[1]+  Aborted                 (core dumped) ./nvmf_tgt -c ./nvmf.conf
=========

Do you need the dump file or some additional info?

-- 
Best regards,
Valeriy Glushkov
www.starwind.com
valeriy.glushkov(a)starwind.com

Howell, Seth <seth.howell(a)intel.com> писал(а) в своём письме Thu, 07 Feb  
2019 20:18:28 +0200:

> Hi Sasha, Valeriy,
>
> With the help of Valeriy's logs I was able to get to the bottom of this.  
> The root cause is that for NVMe-oF requests that don't transfer any  
> data, such as keep_alive, we were not properly resetting the value of  
> rdma_req->num_outstanding_data_wr between uses of that structure. All  
> data carrying operations properly reset this value in  
> spdk_nvmf_rdma_req_parse_sgl.
>
> My local repro steps look like this for anyone interested.
>
> Start the SPDK target,
> Submit a full queue depth worth of Smart log requests (sequentially is  
> fine). A smaller number also works, but takes much longer.
> Wait for a while (This assumes you have keep alive enabled). Keep alive  
> requests will reuse the rdma_req objects slowly incrementing the  
> curr_send_depth on the admin qpair.
> Eventually the admin qpair will be unable to submit I/O.
>
> I was able to fix the issue locally with the following patch.  
> https://review.gerrithub.io/#/c/spdk/spdk/+/443811/. Valeriy, please let  
> me know if applying this patch also fixes it for you ( I am pretty sure  
> that it will).
>
> Thank you for the bug report and for all of your help,
>
> Seth
>
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha  
> Kotchubievsky
> Sent: Thursday, February 7, 2019 11:06 AM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] A problem with SPDK 19.01 NVMeoF/RDMA target
>
> Hi,
>
> RNR value shouldn't affect NVMF. I just want to check if NVMF prepost  
> enough receive requests.  19.10 introduced some new way for flow control  
> and count number of send and receive work requests. Probably, NVMF  
> doesn't pre-post enough requests.
>
> Which network do you use : IB or ROcE? What it is you HW and SW stack in  
> host and in target sides? (OS, OFED/MOFED version, NIC type)
>
> I'd suggest to configure NVMF with big max queue depth, and in your test  
> actually use a half of the value.
>
> On 2/7/2019 5:37 PM, Valeriy Glushkov wrote:
>> Hi Sasha,
>>
>> There is no IBV on the host side, it's Windows.
>> So we have no control over the RNR field.
>>
>>   From a RDMA session's dump I can see that the initiator sets
>> infiniband.cm.req.rnrretrcount to 0x6.
>>
>> Could the RNR value be related to the problem we have with SPDK 19.01
>> NVMeoF target?
>>