Hi Valeriy,

This was a really stupid mistake on my part. I moved a decrement up without moving the assert that accompanied it up with it (in my original fix). Please see this one liner which should fix it. I will also backport it to 19.01.1. The silver lining here is that there is nothing functionally wrong with the code, the assert was just erroneous and will hit when SPDK is built in debug mode.

https://review.gerrithub.io/c/spdk/spdk/+/446440/1/lib/nvmf/rdma.c

Thank you for replying and pointing this out,

Seth

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Valeriy Glushkov
Sent: Tuesday, February 26, 2019 3:21 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] A problem with SPDK 19.01 NVMeoF/RDMA target

Hi Seth,

It seems that some problem is still present in the nvmf target built from the recent SPDK trunk.
It crashes with dump under our performance test.

===========
# Starting SPDK v19.04-pre / DPDK 18.11.0 initialization...
[ DPDK EAL parameters: nvmf --no-shconf -c 0x1 --log-level=lib.eal:6
--base-virtaddr=0x200000000000 --file-prefix=spdk_pid131804 ]
EAL: No free hugepages reported in hugepages-2048kB
EAL: 2 hugepages of size 1073741824 reserved, but no mounted hugetlbfs found for that size
app.c: 624:spdk_app_start: *NOTICE*: Total cores available: 1
reactor.c: 233:_spdk_reactor_run: *NOTICE*: Reactor started on core 0
rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x20002458d000
length=4096
rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x20002464b000
length=4096
rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x2000245ae000
length=4096
rdma.c:2786:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x20002457e000
length=4096
rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x20002457e000
length=4096
rdma.c:2786:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x2000245a8000
length=4096
nvmf_tgt: rdma.c:2789: spdk_nvmf_rdma_poller_poll: Assertion `rdma_req->num_outstanding_data_wr > 0' failed.
^C
[1]+  Aborted                 (core dumped) ./nvmf_tgt -c ./nvmf.conf
=========

Do you need the dump file or some additional info?

--
Best regards,
Valeriy Glushkov
www.starwind.com
valeriy.glushkov(a)starwind.com

Howell, Seth <seth.howell(a)intel.com> писал(а) в своём письме Thu, 07 Feb
2019 20:18:28 +0200:

> Hi Sasha, Valeriy,
>
> With the help of Valeriy's logs I was able to get to the bottom of this.  
> The root cause is that for NVMe-oF requests that don't transfer any 
> data, such as keep_alive, we were not properly resetting the value of 
> rdma_req->num_outstanding_data_wr between uses of that structure. All 
> data carrying operations properly reset this value in 
> spdk_nvmf_rdma_req_parse_sgl.
>
> My local repro steps look like this for anyone interested.
>
> Start the SPDK target,
> Submit a full queue depth worth of Smart log requests (sequentially is 
> fine). A smaller number also works, but takes much longer.
> Wait for a while (This assumes you have keep alive enabled). Keep 
> alive requests will reuse the rdma_req objects slowly incrementing the 
> curr_send_depth on the admin qpair.
> Eventually the admin qpair will be unable to submit I/O.
>
> I was able to fix the issue locally with the following patch.  
> https://review.gerrithub.io/#/c/spdk/spdk/+/443811/. Valeriy, please 
> let me know if applying this patch also fixes it for you ( I am pretty 
> sure that it will).
>
> Thank you for the bug report and for all of your help,
>
> Seth
>
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha 
> Kotchubievsky
> Sent: Thursday, February 7, 2019 11:06 AM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] A problem with SPDK 19.01 NVMeoF/RDMA target
>
> Hi,
>
> RNR value shouldn't affect NVMF. I just want to check if NVMF prepost 
> enough receive requests.  19.10 introduced some new way for flow 
> control and count number of send and receive work requests. Probably, 
> NVMF doesn't pre-post enough requests.
>
> Which network do you use : IB or ROcE? What it is you HW and SW stack 
> in host and in target sides? (OS, OFED/MOFED version, NIC type)
>
> I'd suggest to configure NVMF with big max queue depth, and in your 
> test actually use a half of the value.
>
> On 2/7/2019 5:37 PM, Valeriy Glushkov wrote:
>> Hi Sasha,
>>
>> There is no IBV on the host side, it's Windows.
>> So we have no control over the RNR field.
>>
>>   From a RDMA session's dump I can see that the initiator sets 
>> infiniband.cm.req.rnrretrcount to 0x6.
>>
>> Could the RNR value be related to the problem we have with SPDK 19.01 
>> NVMeoF target?
>>

	


_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk