From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============0046231547367418180==" MIME-Version: 1.0 From: Valeriy Glushkov Subject: Re: [SPDK] A problem with SPDK 19.01 NVMeoF/RDMA target Date: Tue, 26 Feb 2019 22:21:16 +0000 Message-ID: In-Reply-To: EA913ED399BBA34AA4EAC2EDC24CDD009C25E7B4@FMSMSX105.amr.corp.intel.com List-ID: To: spdk@lists.01.org --===============0046231547367418180== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Hi Seth, It seems that some problem is still present in the nvmf target built from = the recent SPDK trunk. It crashes with dump under our performance test. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D # Starting SPDK v19.04-pre / DPDK 18.11.0 initialization... [ DPDK EAL parameters: nvmf --no-shconf -c 0x1 --log-level=3Dlib.eal:6 = --base-virtaddr=3D0x200000000000 --file-prefix=3Dspdk_pid131804 ] EAL: No free hugepages reported in hugepages-2048kB EAL: 2 hugepages of size 1073741824 reserved, but no mounted hugetlbfs = found for that size app.c: 624:spdk_app_start: *NOTICE*: Total cores available: 1 reactor.c: 233:_spdk_reactor_run: *NOTICE*: Reactor started on core 0 rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=3D0x20002458d000 = length=3D4096 rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=3D0x20002464b000 = length=3D4096 rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=3D0x2000245ae000 = length=3D4096 rdma.c:2786:spdk_nvmf_rdma_poller_poll: *ERROR*: data=3D0x20002457e000 = length=3D4096 rdma.c:2758:spdk_nvmf_rdma_poller_poll: *ERROR*: data=3D0x20002457e000 = length=3D4096 rdma.c:2786:spdk_nvmf_rdma_poller_poll: *ERROR*: data=3D0x2000245a8000 = length=3D4096 nvmf_tgt: rdma.c:2789: spdk_nvmf_rdma_poller_poll: Assertion = `rdma_req->num_outstanding_data_wr > 0' failed. ^C [1]+ Aborted (core dumped) ./nvmf_tgt -c ./nvmf.conf =3D=3D=3D=3D=3D=3D=3D=3D=3D Do you need the dump file or some additional info? -- = Best regards, Valeriy Glushkov www.starwind.com valeriy.glushkov(a)starwind.com Howell, Seth =D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0= =B0) =D0=B2 =D1=81=D0=B2=D0=BE=D1=91=D0=BC =D0=BF=D0=B8=D1=81=D1=8C=D0=BC= =D0=B5 Thu, 07 Feb = 2019 20:18:28 +0200: > Hi Sasha, Valeriy, > > With the help of Valeriy's logs I was able to get to the bottom of this. = > The root cause is that for NVMe-oF requests that don't transfer any = > data, such as keep_alive, we were not properly resetting the value of = > rdma_req->num_outstanding_data_wr between uses of that structure. All = > data carrying operations properly reset this value in = > spdk_nvmf_rdma_req_parse_sgl. > > My local repro steps look like this for anyone interested. > > Start the SPDK target, > Submit a full queue depth worth of Smart log requests (sequentially is = > fine). A smaller number also works, but takes much longer. > Wait for a while (This assumes you have keep alive enabled). Keep alive = > requests will reuse the rdma_req objects slowly incrementing the = > curr_send_depth on the admin qpair. > Eventually the admin qpair will be unable to submit I/O. > > I was able to fix the issue locally with the following patch. = > https://review.gerrithub.io/#/c/spdk/spdk/+/443811/. Valeriy, please let = > me know if applying this patch also fixes it for you ( I am pretty sure = > that it will). > > Thank you for the bug report and for all of your help, > > Seth > > -----Original Message----- > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha = > Kotchubievsky > Sent: Thursday, February 7, 2019 11:06 AM > To: spdk(a)lists.01.org > Subject: Re: [SPDK] A problem with SPDK 19.01 NVMeoF/RDMA target > > Hi, > > RNR value shouldn't affect NVMF. I just want to check if NVMF prepost = > enough receive requests. 19.10 introduced some new way for flow control = > and count number of send and receive work requests. Probably, NVMF = > doesn't pre-post enough requests. > > Which network do you use : IB or ROcE? What it is you HW and SW stack in = > host and in target sides? (OS, OFED/MOFED version, NIC type) > > I'd suggest to configure NVMF with big max queue depth, and in your test = > actually use a half of the value. > > On 2/7/2019 5:37 PM, Valeriy Glushkov wrote: >> Hi Sasha, >> >> There is no IBV on the host side, it's Windows. >> So we have no control over the RNR field. >> >> From a RDMA session's dump I can see that the initiator sets >> infiniband.cm.req.rnrretrcount to 0x6. >> >> Could the RNR value be related to the problem we have with SPDK 19.01 >> NVMeoF target? >> = --===============0046231547367418180==--