Hi Seth, It seems that some problem is still present in the nvmf target built from the recent SPDK trunk. It crashes with dump under our performance test. =========== # Starting SPDK v19.04-pre / DPDK 18.11.0 initialization... [ DPDK EAL parameters: nvmf --no-shconf -c 0x1 --log-level=lib.eal:6 --base-virtaddr=0x200000000000 --file-prefix=spdk_pid131804 ] EAL: No free hugepages reported in hugepages-2048kB EAL: 2 hugepages of size 1073741824 reserved, but no mounted hugetlbfs found for that size app.c: 624:spdk_app_start: *NOTICE*: Total cores available: 1 reactor.c: 233:_spdk_reactor_run: *NOTICE*: Reactor started on core 0 rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x20002458d000 length=4096 rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x20002464b000 length=4096 rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x2000245ae000 length=4096 rdma.c:2786:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x20002457e000 length=4096 rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x20002457e000 length=4096 rdma.c:2786:spdk_nvmf_rdma_poller_poll: *ERROR*: data=0x2000245a8000 length=4096 nvmf_tgt: rdma.c:2789: spdk_nvmf_rdma_poller_poll: Assertion `rdma_req->num_outstanding_data_wr > 0' failed. ^C [1]+ Aborted (core dumped) ./nvmf_tgt -c ./nvmf.conf ========= Do you need the dump file or some additional info? -- Best regards, Valeriy Glushkov www.starwind.com valeriy.glushkov(a)starwind.com Howell, Seth писал(а) в своём письме Thu, 07 Feb 2019 20:18:28 +0200: > Hi Sasha, Valeriy, > > With the help of Valeriy's logs I was able to get to the bottom of this. > The root cause is that for NVMe-oF requests that don't transfer any > data, such as keep_alive, we were not properly resetting the value of > rdma_req->num_outstanding_data_wr between uses of that structure. All > data carrying operations properly reset this value in > spdk_nvmf_rdma_req_parse_sgl. > > My local repro steps look like this for anyone interested. > > Start the SPDK target, > Submit a full queue depth worth of Smart log requests (sequentially is > fine). A smaller number also works, but takes much longer. > Wait for a while (This assumes you have keep alive enabled). Keep alive > requests will reuse the rdma_req objects slowly incrementing the > curr_send_depth on the admin qpair. > Eventually the admin qpair will be unable to submit I/O. > > I was able to fix the issue locally with the following patch. > https://review.gerrithub.io/#/c/spdk/spdk/+/443811/. Valeriy, please let > me know if applying this patch also fixes it for you ( I am pretty sure > that it will). > > Thank you for the bug report and for all of your help, > > Seth > > -----Original Message----- > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha > Kotchubievsky > Sent: Thursday, February 7, 2019 11:06 AM > To: spdk(a)lists.01.org > Subject: Re: [SPDK] A problem with SPDK 19.01 NVMeoF/RDMA target > > Hi, > > RNR value shouldn't affect NVMF. I just want to check if NVMF prepost > enough receive requests. 19.10 introduced some new way for flow control > and count number of send and receive work requests. Probably, NVMF > doesn't pre-post enough requests. > > Which network do you use : IB or ROcE? What it is you HW and SW stack in > host and in target sides? (OS, OFED/MOFED version, NIC type) > > I'd suggest to configure NVMF with big max queue depth, and in your test > actually use a half of the value. > > On 2/7/2019 5:37 PM, Valeriy Glushkov wrote: >> Hi Sasha, >> >> There is no IBV on the host side, it's Windows. >> So we have no control over the RNR field. >> >> From a RDMA session's dump I can see that the initiator sets >> infiniband.cm.req.rnrretrcount to 0x6. >> >> Could the RNR value be related to the problem we have with SPDK 19.01 >> NVMeoF target? >>