All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-20 14:19 Luse, Paul E
  0 siblings, 0 replies; 17+ messages in thread
From: Luse, Paul E @ 2018-11-20 14:19 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 15389 bytes --]

OK, I'm on vacation until Monday.  I'm sure I was though, if not resolved when I get back I will double check and let ya know

Thx
Paul

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha Kotchubievsky
Sent: Tuesday, November 20, 2018 12:58 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] nvmf_tgt seg fault

I just want to be sure that you run source above 90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce

We see the similar issue on ARM platform with SPDK 18.10+couple our ARM related patches

target crashes after receiving completion with in error state

mlx5: retchet01-snic.mtr.labs.mlnx: got completion with error:

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

0000000c 00000000 00000000 00000000

00000000 9d005304 08000243 ca75dbd2

rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7ff220, Request 0x13586616 (4): local protection error

rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#12 changed
to: IBV_QPS_ERR

rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7ff220, Request 0x13586616 (5): Work Request Flushed Error

rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#12 changed
to: IBV_QPS_ERR

rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7ff220, Request 0x13637696 (5): Work Request Flushed Error

Program received signal SIGSEGV, Segmentation fault.

0x000000000044bb20 in spdk_nvmf_rdma_set_ibv_state ()

Missing separate debuginfos, use: debuginfo-install spdk-18.10-2.el7.aarch64

(gdb)

(gdb) bt

#0 0x000000000044bb20 in spdk_nvmf_rdma_set_ibv_state ()

#1 0x000000000044cf44 in spdk_nvmf_rdma_poller_poll.isra.9 ()

#2 0x000000000044d110 in spdk_nvmf_rdma_poll_group_poll ()

#3 0x000000000044a12c in spdk_nvmf_transport_poll_group_poll ()

#4 0x00000000004483bc in spdk_nvmf_poll_group_poll ()

#5 0x0000000000452ae8 in _spdk_reactor_run ()

#6 0x0000000000453080 in spdk_reactors_start ()

#7 0x0000000000451e44 in spdk_app_start ()

#8 0x000000000040803c in main ()


target crashes after "local protection error" followed by flush errors. 
The same pattern I see in logs reported in the email thread.

90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce improves error handling, so I'd to check if it solves the issue.

Best regards

Sasha

On 11/18/2018 9:19 PM, Luse, Paul E wrote:
> FYI the test I ran was on master as of Fri... I can check versions if 
> you tell me the steps to get exactly what you're looking for
>
> Thx
> Paul
>
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha 
> Kotchubievsky
> Sent: Sunday, November 18, 2018 9:52 AM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] nvmf_tgt seg fault
>
> Hi,
>
> Can you check the issue in latest master?
>
> Is
> https://github.com/spdk/spdk/commit/90b4bd6cf9bb5805c0c6d8df982ac5f2e3
> d90cce merged recently changes the behavior ?
>
> Do you use upstream OFED or Mellanox MOFED? Which version ?
>
> Best regards
>
> Sasha
>
> On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
>> Sure, done.  Issue #500, do I win a prize? :)
>>
>> https://github.com/spdk/spdk/issues/500
>>
>> -Joe
>>
>>> -----Original Message-----
>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, 
>>> James R
>>> Sent: Wednesday, November 14, 2018 11:16 AM
>>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
>>> Subject: Re: [SPDK] nvmf_tgt seg fault
>>>
>>> Thanks for the report Joe.  Could you file an issue in GitHub for this?
>>>
>>> https://github.com/spdk/spdk/issues
>>>
>>> Thanks,
>>>
>>> -Jim
>>>
>>>
>>> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" <spdk- 
>>> bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
>>>
>>>       Hi everyone-
>>>
>>>       I'm running a dual socket Skylake server with P4510 NVMe and 
>>> 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.
>>> SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK 
>>> NVMeoF target and exercising it from an initiator system (similar 
>>> config to the target but with 50Gb NIC) using FIO with the bdev 
>>> plugin.  I find 128K sequential workloads reliably and immediately 
>>> seg fault nvmf_tgt.  I can run 4KB random workloads without 
>>> experiencing the seg fault, so the problem seems tied to the block 
>>> size and/or IO pattern.  I can run the same IO pattern against a 
>>> local PCIe device using SPDK without a problem, I only see the failure when running the NVMeoF target with FIO running the IO patter from an SPDK initiator system.
>>>
>>>       Steps to reproduce and seg fault output follow below.
>>>
>>>       Start the target:
>>>       sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r 
>>> /var/tmp/spdk1.sock
>>>
>>>       Configure the target:
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d1 
>>> -t pcie - a 0000:1a:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d2 
>>> -t pcie - a 0000:1b:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d3 
>>> -t pcie - a 0000:1c:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d4 
>>> -t pcie - a 0000:1d:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d5 
>>> -t pcie - a 0000:3d:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d6 
>>> -t pcie - a 0000:3e:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d7 
>>> -t pcie - a 0000:3f:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d8 
>>> -t pcie - a 0000:40:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n
>>> raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store raid1 store1
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l1
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l2
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l3
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l4
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l5
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l6
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l7
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l8
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l9
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l10
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l11
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l12
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn1 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn2 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn3 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn4 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn5 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn6 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn7 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn8 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn9 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn10 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn11 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn12 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn1 store1/l1
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn2 store1/l2
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn3 store1/l3
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn4 store1/l4
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn5 store1/l5
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn6 store1/l6
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn7 store1/l7
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn8 store1/l8
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn9 store1/l9
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn10 store1/l10
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn11 store1/l11
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn12 store1/l12
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>> nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>> nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>> nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>> nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>> nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>> nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>> nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>> nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>> nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>> nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>> nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>> nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
>>>
>>>       FIO file on initiator:
>>>       [global]
>>>       rw=rw
>>>       rwmixread=100
>>>       numjobs=1
>>>       iodepth=32
>>>       bs=128k
>>>       direct=1
>>>       thread=1
>>>       time_based=1
>>>       ramp_time=10
>>>       runtime=10
>>>       ioengine=spdk_bdev
>>>       spdk_conf=/home/don/fio/nvmeof.conf
>>>       group_reporting=1
>>>       unified_rw_reporting=1
>>>       exitall=1
>>>       randrepeat=0
>>>       norandommap=1
>>>       cpus_allowed_policy=split
>>>       cpus_allowed=1-2
>>>       [job1]
>>>       filename=b0n1
>>>
>>>       Config file on initiator:
>>>       [Nvme]
>>>       TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420
>>> subnqn:nqn.2018-
>>> 11.io.spdk:nqn1 adrfam:IPv4" b0
>>>
>>>       Run FIO on initiator and nvmf_tgt seg faults immediate:
>>>       sudo
>>> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plugi
>>> n fio sr.ini
>>>
>>>       Seg fault looks like this:
>>>       mlx5: donsl202: got completion with error:
>>>       00000000 00000000 00000000 00000000
>>>       00000000 00000000 00000000 00000000
>>>       00000001 00000000 00000000 00000000
>>>       00000000 9d005304 0800011b 0008d0d2
>>>       rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on 
>>> CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
>>>       rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1 
>>> changed to: IBV_QPS_ERR
>>>       rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on 
>>> CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request 
>>> Flushed Error
>>>       rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1 
>>> changed to: IBV_QPS_ERR
>>>       rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on 
>>> CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
>>> Flushed Error
>>>       rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1 
>>> changed to: IBV_QPS_ERR
>>>       rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on 
>>> CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
>>> Flushed Error
>>>       Segmentation fault
>>>
>>>       Adds this to dmesg:
>>>       [71561.859644] nvme nvme1: Connect rejected: status 8 (invalid 
>>> service ID).
>>>       [71561.866466] nvme nvme1: rdma connection establishment failed (-104)
>>>       [71567.805288] reactor_7[9166]: segfault at 88 ip
>>> 00005630621e6580 sp
>>> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
>>>       [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 0f 
>>> 1f 44 00 00 41 81
>>> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00 
>>> <49> 8b 96 88
>>> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
>>>
>>>
>>>
>>>       _______________________________________________
>>>       SPDK mailing list
>>>       SPDK(a)lists.01.org
>>>       https://lists.01.org/mailman/listinfo/spdk
>>>
>>>
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://lists.01.org/mailman/listinfo/spdk
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-27 21:45 Howell, Seth
  0 siblings, 0 replies; 17+ messages in thread
From: Howell, Seth @ 2018-11-27 21:45 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 28281 bytes --]

Hi All,

I have updated the issue at https://github.com/spdk/spdk/issues/500.

For a short synopsis, After further testing, I found the initiator issues I reported from Jim's patch to be related to a callback that was not yet registered for the RDMA memory map. After applying that change on top of Jim's and Darek's changes, my test case was fixed.

You can follow the patch series here: https://review.gerrithub.io/c/spdk/spdk/+/433076

Thanks,

Seth


-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 27, 2018 11:06 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] nvmf_tgt seg fault

FYI let's try and use (or least update occasionally) the github issue for this as opposed to the dist list so we don't lose track of what's been done, what's next, etc.....

https://github.com/spdk/spdk/issues/500  

thx
PAul

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Howell, Seth
Sent: Tuesday, November 27, 2018 10:34 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] nvmf_tgt seg fault

Hi Jim,

The theory behind your patch makes sense to me. However I am running into an issue with a test I am using to verify the functionality of your patch. 
I applied your patch and Darek's initial patch (The one that exposes this issue) located at https://review.gerrithub.io/c/spdk/spdk/+/433076 on top of the current master (9e2f1bff814). 
I then ran the nvmf_lvol.sh locally as follows:
sudo HUGEMEM=8192 ./test/nvmf/lvol/nvmf_lvol.sh iso I got back errors similar to the ones I observed when Darek's patch was first introduced.
For example:
nvme_rdma.c:1499:nvme_rdma_qpair_submit_request: *ERROR*: nvme_rdma_req_init() failed.
After adding a few further prints, I found the root of the error was still coming from nvme_rdma.c:1048 where we compare the length of the Memory Region from the start of the buffer to the buffer length. It looks like we are still getting data buffers split over MRs in the host.

Your patch does, however, seem to fix the issues we are facing on the target side. Prior to applying your patch, I was consistently getting errors such as the following (which is similar to the previous errors encountered.):
    	mlx5: sethhowe-desk.ch.intel.com: got completion with error:
    	00000000 00000000 00000000 00000000
    	00000000 00000000 00000000 00000000
    	00000001 00000000 00000000 00000000
    	00000000 9d003304 1000010d 010091d2 But those disappeared after applying your patch. This makes sense since all of those buffers exist in the same spdk_mempool which should be created with a single contiguous DPDK memory allocation. 
I am not sure why your patch doesn't fix the initiator side. From our conversation yesterday,  I assumed that we were using buffers allocated in the same way as on the target side. I will do some more research today to figure out why I am still getting MR spanning errors. I assume that bdevperf is using buffers in a way we wouldn't expect. I just thought I would share my preliminary results to help push the conversation forward.

Thanks,

Seth

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Monday, November 26, 2018 5:20 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] nvmf_tgt seg fault

Here's a patch I'd like to throw out for discussion:

https://review.gerrithub.io/#/c/spdk/spdk/+/434895/

It sort of reverts GerritHub #428716 - at least the part about registering a separate MR for each hugepage.  It tells DPDK to *not* free memory that has been dynamically allocated - that way we don't have to worry about DPDK freeing the memory in different units than it was allocated.  Fixes all of the MR-spanning issues, and significantly relaxes bumping up against the maximum number of MRs supported by a NIC.

Downside is that if a user does a big allocation and then later frees it, that application retains the memory.  But the vast majority of SPDK use cases allocate all of the memory up front (via mempools) and don't free it until the application shuts down.

I'm driving for simplicity here.  This would ensure that all buffers allocated via SPDK malloc routines would never span an MR boundary and we could avoid a bunch of complexity in both the initiator and target.

-Jim




On 11/22/18, 10:55 AM, "SPDK on behalf of Evgenii Kochetov" <spdk-bounces(a)lists.01.org on behalf of evgeniik(a)mellanox.com> wrote:

    Hi,
    
    To follow up Sasha's email with more details and add some food for thought.
    
    First of all here is an easy way to reproduce the problem:
        1. Disable all hugeapages except 2MB
    	echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
        2. Create simple NVMf target config:
    	[Transport]
    	  Type RDMA
    	[Null]
    	  Dev Null0 4096 4096
    	[Subsystem0]
    	  NQN nqn.2016-06.io.spdk:cnode0
    	  SN SPDK000DEADBEAF00
    	  Namespace Null0
    	  Listen RDMA 1.1.1.1:4420
    	  AllowAnyHost yes
        3. Start NVMf target app:
    	./app/nvmf_tgt/nvmf_tgt -c nvmf.conf -m 0x01 -L rdma
        4. Start initiator (perf tool):
    	./examples/nvme/perf/perf -q 16 -o 131072 -w read -t 10 -r 'trtype:RDMA adrfam:IPv4 traddr:1.1.1.1 trsvcid:4420 subnqn:nqn.2016-06.io.spdk:cnode0'
        5. Check for errors on target:
    	mlx5: host: got completion with error:
    	00000000 00000000 00000000 00000000
    	00000000 00000000 00000000 00000000
    	00000001 00000000 00000000 00000000
    	00000000 9d005304 0800032d 0002bfd2
    	rdma.c:2584:spdk_nvmf_rdma_poller_poll: *DEBUG*: CQ error on CQ 0xc27560, Request 0x13392248 (4): local protection error
    
    The root cause, as it was noted already, is dynamic memory allocation feature. To be more precise its part that splits allocated memory regions into hugepage sized segments in function memory_hotplug_cb in lib/env_dpdk/memory.c. This code was added in this change https://review.gerrithub.io/c/spdk/spdk/+/428716. As a result separate MRs are registered for each 2MB hugepage. When we create memory pool for data buffers (data_buf_pool) in lib/nvmf/rdma.c some buffers cross the hugepage (and MR) boundary. When this buffer is used for RDMA operation local protection error is generated.
    The change itself looks reasonable and it's not clear at the moment how this problem can be fixed.
    
    As Ben said, -s parameter can be used as a workaround. In our setup when we add '-s 1024' to target parameters no errors occur.
    
    BR, Evgeniy.
    
    -----Original Message-----
    From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Sasha Kotchubievsky
    Sent: Thursday, November 22, 2018 3:21 PM
    To: Storage Performance Development Kit <spdk(a)lists.01.org>; Walker, Benjamin <benjamin.walker(a)intel.com>
    Subject: Re: [SPDK] nvmf_tgt seg fault
    
    Hi,
    
    It looks like, some allocations cross huge-page boundary and MR boundary as well.
    
    After switching to bigger huge-pages (1GB instead of 2M) we don't see the problem.
    
    Crash, after hit "local protection error", is solved in "master". In 18.10, the crash is result of wrong  processing of op code in RDMA completion.
    
    Sasha
    
    On 11/21/2018 7:03 PM, Walker, Benjamin wrote:
    > On Tue, 2018-11-20 at 09:57 +0200, Sasha Kotchubievsky wrote:
    >> We see the similar issue on ARM platform with SPDK 18.10+couple our 
    >> ARM related patches
    >>
    >> target crashes after receiving completion with in error state
    >>
    >> rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 
    >> 0x7ff220, Request 0x13586616 (4): local protection error
    >>
    >> target crashes after "local protection error" followed by flush errors.
    >> The same pattern I see in logs reported in the email thread.
    > Joe, Sasha - can you both try to reproduce the issue after having the 
    > NVMe-oF target pre-allocate its memory? This is the '-s' option to the 
    > target. Set it to at least 4GB to be safe. I'm concerned that this is 
    > a problem introduced by the patches that enable dynamic memory 
    > allocation (which creates multiple ibv_mrs and requires smarter splitting code that doesn't exist yet).
    >
    >
    >> 90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce improves error handling, so 
    >> I'd to check if it solves the issue.
    >>
    >> Best regards
    >>
    >> Sasha
    >>
    >> On 11/18/2018 9:19 PM, Luse, Paul E wrote:
    >>> FYI the test I ran was on master as of Fri... I can check versions 
    >>> if you tell me the steps to get exactly what you're looking for
    >>>
    >>> Thx
    >>> Paul
    >>>
    >>> -----Original Message-----
    >>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha 
    >>> Kotchubievsky
    >>> Sent: Sunday, November 18, 2018 9:52 AM
    >>> To: spdk(a)lists.01.org
    >>> Subject: Re: [SPDK] nvmf_tgt seg fault
    >>>
    >>> Hi,
    >>>
    >>> Can you check the issue in latest master?
    >>>
    >>> Is
    >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
    >>> ithub.com%2Fspdk%2Fspdk%2Fcommit%2F90b4bd6cf9bb5805c0c6d8df982ac5f2e
    >>> 3d90cce&amp;data=02%7C01%7Cevgeniik%40mellanox.com%7C0618f1d9969f4b5
    >>> 81faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C63678
    >>> 4860985451622&amp;sdata=KGlTXoO7JkRyLYE3CcPAoXZVO8Q8VPbSIHPQngsrjxQ%
    >>> 3D&amp;reserved=0 merged recently changes the behavior ?
    >>>
    >>> Do you use upstream OFED or Mellanox MOFED? Which version ?
    >>>
    >>> Best regards
    >>>
    >>> Sasha
    >>>
    >>> On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
    >>>> Sure, done.  Issue #500, do I win a prize? :)
    >>>>
    >>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
    >>>> github.com%2Fspdk%2Fspdk%2Fissues%2F500&amp;data=02%7C01%7Cevgeniik
    >>>> %40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d
    >>>> 9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=0sR9wMRj1
    >>>> CrM3MQsLXEB8cPEABFDyky8Xo%2Fs3SV97Sc%3D&amp;reserved=0
    >>>>
    >>>> -Joe
    >>>>
    >>>>> -----Original Message-----
    >>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, 
    >>>>> James R
    >>>>> Sent: Wednesday, November 14, 2018 11:16 AM
    >>>>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
    >>>>> Subject: Re: [SPDK] nvmf_tgt seg fault
    >>>>>
    >>>>> Thanks for the report Joe.  Could you file an issue in GitHub for this?
    >>>>>
    >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
    >>>>> Fgithub.com%2Fspdk%2Fspdk%2Fissues&amp;data=02%7C01%7Cevgeniik%40m
    >>>>> ellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba
    >>>>> 6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=BAJdbf4bpTp
    >>>>> FVnbHOeiuzdlC3xya3KoXFY8Dvs8Sjqk%3D&amp;reserved=0
    >>>>>
    >>>>> Thanks,
    >>>>>
    >>>>> -Jim
    >>>>>
    >>>>>
    >>>>> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" 
    >>>>> <spdk- bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
    >>>>>
    >>>>>        Hi everyone-
    >>>>>
    >>>>>        I'm running a dual socket Skylake server with P4510 NVMe 
    >>>>> and 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.
    >>>>> SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK 
    >>>>> NVMeoF target and exercising it from an initiator system (similar 
    >>>>> config to the target but with 50Gb NIC) using FIO with the bdev 
    >>>>> plugin.  I find 128K sequential workloads reliably and immediately 
    >>>>> seg fault nvmf_tgt.  I can run 4KB random workloads without 
    >>>>> experiencing the seg fault, so the problem seems tied to the block 
    >>>>> size and/or IO pattern.  I can run the same IO pattern against a 
    >>>>> local PCIe device using SPDK without a problem, I only see the 
    >>>>> failure when running the NVMeoF target with FIO running the IO 
    >>>>> patter from an SPDK initiator system.
    >>>>>
    >>>>>        Steps to reproduce and seg fault output follow below.
    >>>>>
    >>>>>        Start the target:
    >>>>>        sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r 
    >>>>> /var/tmp/spdk1.sock
    >>>>>
    >>>>>        Configure the target:
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d1 -t pcie - a 0000:1a:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d2 -t pcie - a 0000:1b:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d3 -t pcie - a 0000:1c:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d4 -t pcie - a 0000:1d:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d5 -t pcie - a 0000:3d:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d6 -t pcie - a 0000:3e:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d7 -t pcie - a 0000:3f:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d8 -t pcie - a 0000:40:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n
    >>>>> raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store 
    >>>>> raid1
    >>>>> store1
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l1
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l2
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l3
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l4
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l5
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l6
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l7
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l8
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l9
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l10
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l11
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l12
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn1 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn2 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn3 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn4 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn5 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn6 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn7 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn8 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn9 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn10 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn11 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn12 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn1 store1/l1
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn2 store1/l2
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn3 store1/l3
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn4 store1/l4
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn5 store1/l5
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn6 store1/l6
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn7 store1/l7
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn8 store1/l8
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn9 store1/l9
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn10 store1/l10
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn11 store1/l11
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn12 store1/l12
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
    >>>>>
    >>>>>        FIO file on initiator:
    >>>>>        [global]
    >>>>>        rw=rw
    >>>>>        rwmixread=100
    >>>>>        numjobs=1
    >>>>>        iodepth=32
    >>>>>        bs=128k
    >>>>>        direct=1
    >>>>>        thread=1
    >>>>>        time_based=1
    >>>>>        ramp_time=10
    >>>>>        runtime=10
    >>>>>        ioengine=spdk_bdev
    >>>>>        spdk_conf=/home/don/fio/nvmeof.conf
    >>>>>        group_reporting=1
    >>>>>        unified_rw_reporting=1
    >>>>>        exitall=1
    >>>>>        randrepeat=0
    >>>>>        norandommap=1
    >>>>>        cpus_allowed_policy=split
    >>>>>        cpus_allowed=1-2
    >>>>>        [job1]
    >>>>>        filename=b0n1
    >>>>>
    >>>>>        Config file on initiator:
    >>>>>        [Nvme]
    >>>>>        TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420
    >>>>> subnqn:nqn.2018-
    >>>>> 11.io.spdk:nqn1 adrfam:IPv4" b0
    >>>>>
    >>>>>        Run FIO on initiator and nvmf_tgt seg faults immediate:
    >>>>>        sudo
    >>>>> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plu
    >>>>> gi
    >>>>> n fio sr.ini
    >>>>>
    >>>>>        Seg fault looks like this:
    >>>>>        mlx5: donsl202: got completion with error:
    >>>>>        00000000 00000000 00000000 00000000
    >>>>>        00000000 00000000 00000000 00000000
    >>>>>        00000001 00000000 00000000 00000000
    >>>>>        00000000 9d005304 0800011b 0008d0d2
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
    >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
    >>>>> QP#1 changed to: IBV_QPS_ERR
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request 
    >>>>> Flushed Error
    >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
    >>>>> QP#1 changed to: IBV_QPS_ERR
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
    >>>>> Flushed Error
    >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
    >>>>> QP#1 changed to: IBV_QPS_ERR
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
    >>>>> Flushed Error
    >>>>>        Segmentation fault
    >>>>>
    >>>>>        Adds this to dmesg:
    >>>>>        [71561.859644] nvme nvme1: Connect rejected: status 8 
    >>>>> (invalid service ID).
    >>>>>        [71561.866466] nvme nvme1: rdma connection establishment 
    >>>>> failed (-
    >>>>> 104)
    >>>>>        [71567.805288] reactor_7[9166]: segfault at 88 ip
    >>>>> 00005630621e6580 sp
    >>>>> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
    >>>>>        [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 
    >>>>> 0f 1f 44 00 00 41 81
    >>>>> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00 
    >>>>> <49> 8b 96 88
    >>>>> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
    >>>>>
    >>>>>
    >>>>>
    >>>>>        _______________________________________________
    >>>>>        SPDK mailing list
    >>>>>        SPDK(a)lists.01.org
    >>>>>        
    >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
    >>>>> Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgen
    >>>>> iik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d
    >>>>> 2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBG
    >>>>> bU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>>>>
    >>>>>
    >>>>> _______________________________________________
    >>>>> SPDK mailing list
    >>>>> SPDK(a)lists.01.org
    >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
    >>>>> Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgen
    >>>>> iik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d
    >>>>> 2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBG
    >>>>> bU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>>> _______________________________________________
    >>>> SPDK mailing list
    >>>> SPDK(a)lists.01.org
    >>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
    >>>> lists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgenii
    >>>> k%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4
    >>>> d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u
    >>>> 6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>> _______________________________________________
    >>> SPDK mailing list
    >>> SPDK(a)lists.01.org
    >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
    >>> ists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%
    >>> 40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9b
    >>> a6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj6
    >>> 4t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>> _______________________________________________
    >>> SPDK mailing list
    >>> SPDK(a)lists.01.org
    >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
    >>> ists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%
    >>> 40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9b
    >>> a6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj6
    >>> 4t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >> _______________________________________________
    >> SPDK mailing list
    >> SPDK(a)lists.01.org
    >> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
    >> sts.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40
    >> mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a
    >> 4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5N
    >> ul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    > _______________________________________________
    > SPDK mailing list
    > SPDK(a)lists.01.org
    > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
    > ts.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40me
    > llanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d1
    > 49256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5Nul5z
    > x%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    

_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-27 18:05 Luse, Paul E
  0 siblings, 0 replies; 17+ messages in thread
From: Luse, Paul E @ 2018-11-27 18:05 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 27429 bytes --]

FYI let's try and use (or least update occasionally) the github issue for this as opposed to the dist list so we don't lose track of what's been done, what's next, etc.....

https://github.com/spdk/spdk/issues/500  

thx
PAul

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Howell, Seth
Sent: Tuesday, November 27, 2018 10:34 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] nvmf_tgt seg fault

Hi Jim,

The theory behind your patch makes sense to me. However I am running into an issue with a test I am using to verify the functionality of your patch. 
I applied your patch and Darek's initial patch (The one that exposes this issue) located at https://review.gerrithub.io/c/spdk/spdk/+/433076 on top of the current master (9e2f1bff814). 
I then ran the nvmf_lvol.sh locally as follows:
sudo HUGEMEM=8192 ./test/nvmf/lvol/nvmf_lvol.sh iso I got back errors similar to the ones I observed when Darek's patch was first introduced.
For example:
nvme_rdma.c:1499:nvme_rdma_qpair_submit_request: *ERROR*: nvme_rdma_req_init() failed.
After adding a few further prints, I found the root of the error was still coming from nvme_rdma.c:1048 where we compare the length of the Memory Region from the start of the buffer to the buffer length. It looks like we are still getting data buffers split over MRs in the host.

Your patch does, however, seem to fix the issues we are facing on the target side. Prior to applying your patch, I was consistently getting errors such as the following (which is similar to the previous errors encountered.):
    	mlx5: sethhowe-desk.ch.intel.com: got completion with error:
    	00000000 00000000 00000000 00000000
    	00000000 00000000 00000000 00000000
    	00000001 00000000 00000000 00000000
    	00000000 9d003304 1000010d 010091d2 But those disappeared after applying your patch. This makes sense since all of those buffers exist in the same spdk_mempool which should be created with a single contiguous DPDK memory allocation. 
I am not sure why your patch doesn't fix the initiator side. From our conversation yesterday,  I assumed that we were using buffers allocated in the same way as on the target side. I will do some more research today to figure out why I am still getting MR spanning errors. I assume that bdevperf is using buffers in a way we wouldn't expect. I just thought I would share my preliminary results to help push the conversation forward.

Thanks,

Seth

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Monday, November 26, 2018 5:20 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] nvmf_tgt seg fault

Here's a patch I'd like to throw out for discussion:

https://review.gerrithub.io/#/c/spdk/spdk/+/434895/

It sort of reverts GerritHub #428716 - at least the part about registering a separate MR for each hugepage.  It tells DPDK to *not* free memory that has been dynamically allocated - that way we don't have to worry about DPDK freeing the memory in different units than it was allocated.  Fixes all of the MR-spanning issues, and significantly relaxes bumping up against the maximum number of MRs supported by a NIC.

Downside is that if a user does a big allocation and then later frees it, that application retains the memory.  But the vast majority of SPDK use cases allocate all of the memory up front (via mempools) and don't free it until the application shuts down.

I'm driving for simplicity here.  This would ensure that all buffers allocated via SPDK malloc routines would never span an MR boundary and we could avoid a bunch of complexity in both the initiator and target.

-Jim




On 11/22/18, 10:55 AM, "SPDK on behalf of Evgenii Kochetov" <spdk-bounces(a)lists.01.org on behalf of evgeniik(a)mellanox.com> wrote:

    Hi,
    
    To follow up Sasha's email with more details and add some food for thought.
    
    First of all here is an easy way to reproduce the problem:
        1. Disable all hugeapages except 2MB
    	echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
        2. Create simple NVMf target config:
    	[Transport]
    	  Type RDMA
    	[Null]
    	  Dev Null0 4096 4096
    	[Subsystem0]
    	  NQN nqn.2016-06.io.spdk:cnode0
    	  SN SPDK000DEADBEAF00
    	  Namespace Null0
    	  Listen RDMA 1.1.1.1:4420
    	  AllowAnyHost yes
        3. Start NVMf target app:
    	./app/nvmf_tgt/nvmf_tgt -c nvmf.conf -m 0x01 -L rdma
        4. Start initiator (perf tool):
    	./examples/nvme/perf/perf -q 16 -o 131072 -w read -t 10 -r 'trtype:RDMA adrfam:IPv4 traddr:1.1.1.1 trsvcid:4420 subnqn:nqn.2016-06.io.spdk:cnode0'
        5. Check for errors on target:
    	mlx5: host: got completion with error:
    	00000000 00000000 00000000 00000000
    	00000000 00000000 00000000 00000000
    	00000001 00000000 00000000 00000000
    	00000000 9d005304 0800032d 0002bfd2
    	rdma.c:2584:spdk_nvmf_rdma_poller_poll: *DEBUG*: CQ error on CQ 0xc27560, Request 0x13392248 (4): local protection error
    
    The root cause, as it was noted already, is dynamic memory allocation feature. To be more precise its part that splits allocated memory regions into hugepage sized segments in function memory_hotplug_cb in lib/env_dpdk/memory.c. This code was added in this change https://review.gerrithub.io/c/spdk/spdk/+/428716. As a result separate MRs are registered for each 2MB hugepage. When we create memory pool for data buffers (data_buf_pool) in lib/nvmf/rdma.c some buffers cross the hugepage (and MR) boundary. When this buffer is used for RDMA operation local protection error is generated.
    The change itself looks reasonable and it's not clear at the moment how this problem can be fixed.
    
    As Ben said, -s parameter can be used as a workaround. In our setup when we add '-s 1024' to target parameters no errors occur.
    
    BR, Evgeniy.
    
    -----Original Message-----
    From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Sasha Kotchubievsky
    Sent: Thursday, November 22, 2018 3:21 PM
    To: Storage Performance Development Kit <spdk(a)lists.01.org>; Walker, Benjamin <benjamin.walker(a)intel.com>
    Subject: Re: [SPDK] nvmf_tgt seg fault
    
    Hi,
    
    It looks like, some allocations cross huge-page boundary and MR boundary as well.
    
    After switching to bigger huge-pages (1GB instead of 2M) we don't see the problem.
    
    Crash, after hit "local protection error", is solved in "master". In 18.10, the crash is result of wrong  processing of op code in RDMA completion.
    
    Sasha
    
    On 11/21/2018 7:03 PM, Walker, Benjamin wrote:
    > On Tue, 2018-11-20 at 09:57 +0200, Sasha Kotchubievsky wrote:
    >> We see the similar issue on ARM platform with SPDK 18.10+couple our 
    >> ARM related patches
    >>
    >> target crashes after receiving completion with in error state
    >>
    >> rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 
    >> 0x7ff220, Request 0x13586616 (4): local protection error
    >>
    >> target crashes after "local protection error" followed by flush errors.
    >> The same pattern I see in logs reported in the email thread.
    > Joe, Sasha - can you both try to reproduce the issue after having the 
    > NVMe-oF target pre-allocate its memory? This is the '-s' option to the 
    > target. Set it to at least 4GB to be safe. I'm concerned that this is 
    > a problem introduced by the patches that enable dynamic memory 
    > allocation (which creates multiple ibv_mrs and requires smarter splitting code that doesn't exist yet).
    >
    >
    >> 90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce improves error handling, so 
    >> I'd to check if it solves the issue.
    >>
    >> Best regards
    >>
    >> Sasha
    >>
    >> On 11/18/2018 9:19 PM, Luse, Paul E wrote:
    >>> FYI the test I ran was on master as of Fri... I can check versions 
    >>> if you tell me the steps to get exactly what you're looking for
    >>>
    >>> Thx
    >>> Paul
    >>>
    >>> -----Original Message-----
    >>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha 
    >>> Kotchubievsky
    >>> Sent: Sunday, November 18, 2018 9:52 AM
    >>> To: spdk(a)lists.01.org
    >>> Subject: Re: [SPDK] nvmf_tgt seg fault
    >>>
    >>> Hi,
    >>>
    >>> Can you check the issue in latest master?
    >>>
    >>> Is
    >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
    >>> ithub.com%2Fspdk%2Fspdk%2Fcommit%2F90b4bd6cf9bb5805c0c6d8df982ac5f2e
    >>> 3d90cce&amp;data=02%7C01%7Cevgeniik%40mellanox.com%7C0618f1d9969f4b5
    >>> 81faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C63678
    >>> 4860985451622&amp;sdata=KGlTXoO7JkRyLYE3CcPAoXZVO8Q8VPbSIHPQngsrjxQ%
    >>> 3D&amp;reserved=0 merged recently changes the behavior ?
    >>>
    >>> Do you use upstream OFED or Mellanox MOFED? Which version ?
    >>>
    >>> Best regards
    >>>
    >>> Sasha
    >>>
    >>> On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
    >>>> Sure, done.  Issue #500, do I win a prize? :)
    >>>>
    >>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
    >>>> github.com%2Fspdk%2Fspdk%2Fissues%2F500&amp;data=02%7C01%7Cevgeniik
    >>>> %40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d
    >>>> 9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=0sR9wMRj1
    >>>> CrM3MQsLXEB8cPEABFDyky8Xo%2Fs3SV97Sc%3D&amp;reserved=0
    >>>>
    >>>> -Joe
    >>>>
    >>>>> -----Original Message-----
    >>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, 
    >>>>> James R
    >>>>> Sent: Wednesday, November 14, 2018 11:16 AM
    >>>>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
    >>>>> Subject: Re: [SPDK] nvmf_tgt seg fault
    >>>>>
    >>>>> Thanks for the report Joe.  Could you file an issue in GitHub for this?
    >>>>>
    >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
    >>>>> Fgithub.com%2Fspdk%2Fspdk%2Fissues&amp;data=02%7C01%7Cevgeniik%40m
    >>>>> ellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba
    >>>>> 6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=BAJdbf4bpTp
    >>>>> FVnbHOeiuzdlC3xya3KoXFY8Dvs8Sjqk%3D&amp;reserved=0
    >>>>>
    >>>>> Thanks,
    >>>>>
    >>>>> -Jim
    >>>>>
    >>>>>
    >>>>> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" 
    >>>>> <spdk- bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
    >>>>>
    >>>>>        Hi everyone-
    >>>>>
    >>>>>        I'm running a dual socket Skylake server with P4510 NVMe 
    >>>>> and 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.
    >>>>> SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK 
    >>>>> NVMeoF target and exercising it from an initiator system (similar 
    >>>>> config to the target but with 50Gb NIC) using FIO with the bdev 
    >>>>> plugin.  I find 128K sequential workloads reliably and immediately 
    >>>>> seg fault nvmf_tgt.  I can run 4KB random workloads without 
    >>>>> experiencing the seg fault, so the problem seems tied to the block 
    >>>>> size and/or IO pattern.  I can run the same IO pattern against a 
    >>>>> local PCIe device using SPDK without a problem, I only see the 
    >>>>> failure when running the NVMeoF target with FIO running the IO 
    >>>>> patter from an SPDK initiator system.
    >>>>>
    >>>>>        Steps to reproduce and seg fault output follow below.
    >>>>>
    >>>>>        Start the target:
    >>>>>        sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r 
    >>>>> /var/tmp/spdk1.sock
    >>>>>
    >>>>>        Configure the target:
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d1 -t pcie - a 0000:1a:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d2 -t pcie - a 0000:1b:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d3 -t pcie - a 0000:1c:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d4 -t pcie - a 0000:1d:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d5 -t pcie - a 0000:3d:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d6 -t pcie - a 0000:3e:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d7 -t pcie - a 0000:3f:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d8 -t pcie - a 0000:40:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n
    >>>>> raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store 
    >>>>> raid1
    >>>>> store1
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l1
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l2
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l3
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l4
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l5
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l6
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l7
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l8
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l9
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l10
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l11
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l12
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn1 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn2 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn3 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn4 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn5 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn6 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn7 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn8 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn9 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn10 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn11 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn12 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn1 store1/l1
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn2 store1/l2
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn3 store1/l3
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn4 store1/l4
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn5 store1/l5
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn6 store1/l6
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn7 store1/l7
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn8 store1/l8
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn9 store1/l9
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn10 store1/l10
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn11 store1/l11
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn12 store1/l12
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
    >>>>>
    >>>>>        FIO file on initiator:
    >>>>>        [global]
    >>>>>        rw=rw
    >>>>>        rwmixread=100
    >>>>>        numjobs=1
    >>>>>        iodepth=32
    >>>>>        bs=128k
    >>>>>        direct=1
    >>>>>        thread=1
    >>>>>        time_based=1
    >>>>>        ramp_time=10
    >>>>>        runtime=10
    >>>>>        ioengine=spdk_bdev
    >>>>>        spdk_conf=/home/don/fio/nvmeof.conf
    >>>>>        group_reporting=1
    >>>>>        unified_rw_reporting=1
    >>>>>        exitall=1
    >>>>>        randrepeat=0
    >>>>>        norandommap=1
    >>>>>        cpus_allowed_policy=split
    >>>>>        cpus_allowed=1-2
    >>>>>        [job1]
    >>>>>        filename=b0n1
    >>>>>
    >>>>>        Config file on initiator:
    >>>>>        [Nvme]
    >>>>>        TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420
    >>>>> subnqn:nqn.2018-
    >>>>> 11.io.spdk:nqn1 adrfam:IPv4" b0
    >>>>>
    >>>>>        Run FIO on initiator and nvmf_tgt seg faults immediate:
    >>>>>        sudo
    >>>>> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plu
    >>>>> gi
    >>>>> n fio sr.ini
    >>>>>
    >>>>>        Seg fault looks like this:
    >>>>>        mlx5: donsl202: got completion with error:
    >>>>>        00000000 00000000 00000000 00000000
    >>>>>        00000000 00000000 00000000 00000000
    >>>>>        00000001 00000000 00000000 00000000
    >>>>>        00000000 9d005304 0800011b 0008d0d2
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
    >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
    >>>>> QP#1 changed to: IBV_QPS_ERR
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request 
    >>>>> Flushed Error
    >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
    >>>>> QP#1 changed to: IBV_QPS_ERR
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
    >>>>> Flushed Error
    >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
    >>>>> QP#1 changed to: IBV_QPS_ERR
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
    >>>>> Flushed Error
    >>>>>        Segmentation fault
    >>>>>
    >>>>>        Adds this to dmesg:
    >>>>>        [71561.859644] nvme nvme1: Connect rejected: status 8 
    >>>>> (invalid service ID).
    >>>>>        [71561.866466] nvme nvme1: rdma connection establishment 
    >>>>> failed (-
    >>>>> 104)
    >>>>>        [71567.805288] reactor_7[9166]: segfault at 88 ip
    >>>>> 00005630621e6580 sp
    >>>>> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
    >>>>>        [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 
    >>>>> 0f 1f 44 00 00 41 81
    >>>>> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00 
    >>>>> <49> 8b 96 88
    >>>>> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
    >>>>>
    >>>>>
    >>>>>
    >>>>>        _______________________________________________
    >>>>>        SPDK mailing list
    >>>>>        SPDK(a)lists.01.org
    >>>>>        
    >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
    >>>>> Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgen
    >>>>> iik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d
    >>>>> 2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBG
    >>>>> bU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>>>>
    >>>>>
    >>>>> _______________________________________________
    >>>>> SPDK mailing list
    >>>>> SPDK(a)lists.01.org
    >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
    >>>>> Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgen
    >>>>> iik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d
    >>>>> 2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBG
    >>>>> bU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>>> _______________________________________________
    >>>> SPDK mailing list
    >>>> SPDK(a)lists.01.org
    >>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
    >>>> lists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgenii
    >>>> k%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4
    >>>> d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u
    >>>> 6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>> _______________________________________________
    >>> SPDK mailing list
    >>> SPDK(a)lists.01.org
    >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
    >>> ists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%
    >>> 40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9b
    >>> a6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj6
    >>> 4t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>> _______________________________________________
    >>> SPDK mailing list
    >>> SPDK(a)lists.01.org
    >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
    >>> ists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%
    >>> 40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9b
    >>> a6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj6
    >>> 4t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >> _______________________________________________
    >> SPDK mailing list
    >> SPDK(a)lists.01.org
    >> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
    >> sts.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40
    >> mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a
    >> 4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5N
    >> ul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    > _______________________________________________
    > SPDK mailing list
    > SPDK(a)lists.01.org
    > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
    > ts.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40me
    > llanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d1
    > 49256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5Nul5z
    > x%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    

_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-27 17:34 Howell, Seth
  0 siblings, 0 replies; 17+ messages in thread
From: Howell, Seth @ 2018-11-27 17:34 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 26813 bytes --]

Hi Jim,

The theory behind your patch makes sense to me. However I am running into an issue with a test I am using to verify the functionality of your patch. 
I applied your patch and Darek's initial patch (The one that exposes this issue) located at https://review.gerrithub.io/c/spdk/spdk/+/433076 on top of the current master (9e2f1bff814). 
I then ran the nvmf_lvol.sh locally as follows:
sudo HUGEMEM=8192 ./test/nvmf/lvol/nvmf_lvol.sh iso
I got back errors similar to the ones I observed when Darek's patch was first introduced.
For example:
nvme_rdma.c:1499:nvme_rdma_qpair_submit_request: *ERROR*: nvme_rdma_req_init() failed.
After adding a few further prints, I found the root of the error was still coming from nvme_rdma.c:1048 where we compare the length of the Memory Region from the start of the buffer to the buffer length. It looks like we are still getting data buffers split over MRs in the host.

Your patch does, however, seem to fix the issues we are facing on the target side. Prior to applying your patch, I was consistently getting errors such as the following (which is similar to the previous errors encountered.):
    	mlx5: sethhowe-desk.ch.intel.com: got completion with error:
    	00000000 00000000 00000000 00000000
    	00000000 00000000 00000000 00000000
    	00000001 00000000 00000000 00000000
    	00000000 9d003304 1000010d 010091d2
But those disappeared after applying your patch. This makes sense since all of those buffers exist in the same spdk_mempool which should be created with a single contiguous DPDK memory allocation. 
I am not sure why your patch doesn't fix the initiator side. From our conversation yesterday,  I assumed that we were using buffers allocated in the same way as on the target side. I will do some more research today to figure out why I am still getting MR spanning errors. I assume that bdevperf is using buffers in a way we wouldn't expect. I just thought I would share my preliminary results to help push the conversation forward.

Thanks,

Seth

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Monday, November 26, 2018 5:20 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] nvmf_tgt seg fault

Here's a patch I'd like to throw out for discussion:

https://review.gerrithub.io/#/c/spdk/spdk/+/434895/

It sort of reverts GerritHub #428716 - at least the part about registering a separate MR for each hugepage.  It tells DPDK to *not* free memory that has been dynamically allocated - that way we don't have to worry about DPDK freeing the memory in different units than it was allocated.  Fixes all of the MR-spanning issues, and significantly relaxes bumping up against the maximum number of MRs supported by a NIC.

Downside is that if a user does a big allocation and then later frees it, that application retains the memory.  But the vast majority of SPDK use cases allocate all of the memory up front (via mempools) and don't free it until the application shuts down.

I'm driving for simplicity here.  This would ensure that all buffers allocated via SPDK malloc routines would never span an MR boundary and we could avoid a bunch of complexity in both the initiator and target.

-Jim




On 11/22/18, 10:55 AM, "SPDK on behalf of Evgenii Kochetov" <spdk-bounces(a)lists.01.org on behalf of evgeniik(a)mellanox.com> wrote:

    Hi,
    
    To follow up Sasha's email with more details and add some food for thought.
    
    First of all here is an easy way to reproduce the problem:
        1. Disable all hugeapages except 2MB
    	echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
        2. Create simple NVMf target config:
    	[Transport]
    	  Type RDMA
    	[Null]
    	  Dev Null0 4096 4096
    	[Subsystem0]
    	  NQN nqn.2016-06.io.spdk:cnode0
    	  SN SPDK000DEADBEAF00
    	  Namespace Null0
    	  Listen RDMA 1.1.1.1:4420
    	  AllowAnyHost yes
        3. Start NVMf target app:
    	./app/nvmf_tgt/nvmf_tgt -c nvmf.conf -m 0x01 -L rdma
        4. Start initiator (perf tool):
    	./examples/nvme/perf/perf -q 16 -o 131072 -w read -t 10 -r 'trtype:RDMA adrfam:IPv4 traddr:1.1.1.1 trsvcid:4420 subnqn:nqn.2016-06.io.spdk:cnode0'
        5. Check for errors on target:
    	mlx5: host: got completion with error:
    	00000000 00000000 00000000 00000000
    	00000000 00000000 00000000 00000000
    	00000001 00000000 00000000 00000000
    	00000000 9d005304 0800032d 0002bfd2
    	rdma.c:2584:spdk_nvmf_rdma_poller_poll: *DEBUG*: CQ error on CQ 0xc27560, Request 0x13392248 (4): local protection error
    
    The root cause, as it was noted already, is dynamic memory allocation feature. To be more precise its part that splits allocated memory regions into hugepage sized segments in function memory_hotplug_cb in lib/env_dpdk/memory.c. This code was added in this change https://review.gerrithub.io/c/spdk/spdk/+/428716. As a result separate MRs are registered for each 2MB hugepage. When we create memory pool for data buffers (data_buf_pool) in lib/nvmf/rdma.c some buffers cross the hugepage (and MR) boundary. When this buffer is used for RDMA operation local protection error is generated.
    The change itself looks reasonable and it's not clear at the moment how this problem can be fixed.
    
    As Ben said, -s parameter can be used as a workaround. In our setup when we add '-s 1024' to target parameters no errors occur.
    
    BR, Evgeniy.
    
    -----Original Message-----
    From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Sasha Kotchubievsky
    Sent: Thursday, November 22, 2018 3:21 PM
    To: Storage Performance Development Kit <spdk(a)lists.01.org>; Walker, Benjamin <benjamin.walker(a)intel.com>
    Subject: Re: [SPDK] nvmf_tgt seg fault
    
    Hi,
    
    It looks like, some allocations cross huge-page boundary and MR boundary as well.
    
    After switching to bigger huge-pages (1GB instead of 2M) we don't see the problem.
    
    Crash, after hit "local protection error", is solved in "master". In 18.10, the crash is result of wrong  processing of op code in RDMA completion.
    
    Sasha
    
    On 11/21/2018 7:03 PM, Walker, Benjamin wrote:
    > On Tue, 2018-11-20 at 09:57 +0200, Sasha Kotchubievsky wrote:
    >> We see the similar issue on ARM platform with SPDK 18.10+couple our 
    >> ARM related patches
    >>
    >> target crashes after receiving completion with in error state
    >>
    >> rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 
    >> 0x7ff220, Request 0x13586616 (4): local protection error
    >>
    >> target crashes after "local protection error" followed by flush errors.
    >> The same pattern I see in logs reported in the email thread.
    > Joe, Sasha - can you both try to reproduce the issue after having the 
    > NVMe-oF target pre-allocate its memory? This is the '-s' option to the 
    > target. Set it to at least 4GB to be safe. I'm concerned that this is 
    > a problem introduced by the patches that enable dynamic memory 
    > allocation (which creates multiple ibv_mrs and requires smarter splitting code that doesn't exist yet).
    >
    >
    >> 90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce improves error handling, so 
    >> I'd to check if it solves the issue.
    >>
    >> Best regards
    >>
    >> Sasha
    >>
    >> On 11/18/2018 9:19 PM, Luse, Paul E wrote:
    >>> FYI the test I ran was on master as of Fri... I can check versions 
    >>> if you tell me the steps to get exactly what you're looking for
    >>>
    >>> Thx
    >>> Paul
    >>>
    >>> -----Original Message-----
    >>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha 
    >>> Kotchubievsky
    >>> Sent: Sunday, November 18, 2018 9:52 AM
    >>> To: spdk(a)lists.01.org
    >>> Subject: Re: [SPDK] nvmf_tgt seg fault
    >>>
    >>> Hi,
    >>>
    >>> Can you check the issue in latest master?
    >>>
    >>> Is
    >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
    >>> ithub.com%2Fspdk%2Fspdk%2Fcommit%2F90b4bd6cf9bb5805c0c6d8df982ac5f2e
    >>> 3d90cce&amp;data=02%7C01%7Cevgeniik%40mellanox.com%7C0618f1d9969f4b5
    >>> 81faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C63678
    >>> 4860985451622&amp;sdata=KGlTXoO7JkRyLYE3CcPAoXZVO8Q8VPbSIHPQngsrjxQ%
    >>> 3D&amp;reserved=0 merged recently changes the behavior ?
    >>>
    >>> Do you use upstream OFED or Mellanox MOFED? Which version ?
    >>>
    >>> Best regards
    >>>
    >>> Sasha
    >>>
    >>> On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
    >>>> Sure, done.  Issue #500, do I win a prize? :)
    >>>>
    >>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
    >>>> github.com%2Fspdk%2Fspdk%2Fissues%2F500&amp;data=02%7C01%7Cevgeniik
    >>>> %40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d
    >>>> 9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=0sR9wMRj1
    >>>> CrM3MQsLXEB8cPEABFDyky8Xo%2Fs3SV97Sc%3D&amp;reserved=0
    >>>>
    >>>> -Joe
    >>>>
    >>>>> -----Original Message-----
    >>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, 
    >>>>> James R
    >>>>> Sent: Wednesday, November 14, 2018 11:16 AM
    >>>>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
    >>>>> Subject: Re: [SPDK] nvmf_tgt seg fault
    >>>>>
    >>>>> Thanks for the report Joe.  Could you file an issue in GitHub for this?
    >>>>>
    >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
    >>>>> Fgithub.com%2Fspdk%2Fspdk%2Fissues&amp;data=02%7C01%7Cevgeniik%40m
    >>>>> ellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba
    >>>>> 6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=BAJdbf4bpTp
    >>>>> FVnbHOeiuzdlC3xya3KoXFY8Dvs8Sjqk%3D&amp;reserved=0
    >>>>>
    >>>>> Thanks,
    >>>>>
    >>>>> -Jim
    >>>>>
    >>>>>
    >>>>> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" 
    >>>>> <spdk- bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
    >>>>>
    >>>>>        Hi everyone-
    >>>>>
    >>>>>        I'm running a dual socket Skylake server with P4510 NVMe 
    >>>>> and 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.
    >>>>> SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK 
    >>>>> NVMeoF target and exercising it from an initiator system (similar 
    >>>>> config to the target but with 50Gb NIC) using FIO with the bdev 
    >>>>> plugin.  I find 128K sequential workloads reliably and immediately 
    >>>>> seg fault nvmf_tgt.  I can run 4KB random workloads without 
    >>>>> experiencing the seg fault, so the problem seems tied to the block 
    >>>>> size and/or IO pattern.  I can run the same IO pattern against a 
    >>>>> local PCIe device using SPDK without a problem, I only see the 
    >>>>> failure when running the NVMeoF target with FIO running the IO 
    >>>>> patter from an SPDK initiator system.
    >>>>>
    >>>>>        Steps to reproduce and seg fault output follow below.
    >>>>>
    >>>>>        Start the target:
    >>>>>        sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r 
    >>>>> /var/tmp/spdk1.sock
    >>>>>
    >>>>>        Configure the target:
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d1 -t pcie - a 0000:1a:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d2 -t pcie - a 0000:1b:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d3 -t pcie - a 0000:1c:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d4 -t pcie - a 0000:1d:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d5 -t pcie - a 0000:3d:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d6 -t pcie - a 0000:3e:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d7 -t pcie - a 0000:3f:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d8 -t pcie - a 0000:40:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n
    >>>>> raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store 
    >>>>> raid1
    >>>>> store1
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l1
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l2
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l3
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l4
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l5
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l6
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l7
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l8
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l9
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l10
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l11
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l12
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn1 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn2 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn3 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn4 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn5 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn6 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn7 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn8 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn9 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn10 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn11 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn12 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn1 store1/l1
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn2 store1/l2
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn3 store1/l3
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn4 store1/l4
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn5 store1/l5
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn6 store1/l6
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn7 store1/l7
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn8 store1/l8
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn9 store1/l9
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn10 store1/l10
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn11 store1/l11
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn12 store1/l12
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
    >>>>>
    >>>>>        FIO file on initiator:
    >>>>>        [global]
    >>>>>        rw=rw
    >>>>>        rwmixread=100
    >>>>>        numjobs=1
    >>>>>        iodepth=32
    >>>>>        bs=128k
    >>>>>        direct=1
    >>>>>        thread=1
    >>>>>        time_based=1
    >>>>>        ramp_time=10
    >>>>>        runtime=10
    >>>>>        ioengine=spdk_bdev
    >>>>>        spdk_conf=/home/don/fio/nvmeof.conf
    >>>>>        group_reporting=1
    >>>>>        unified_rw_reporting=1
    >>>>>        exitall=1
    >>>>>        randrepeat=0
    >>>>>        norandommap=1
    >>>>>        cpus_allowed_policy=split
    >>>>>        cpus_allowed=1-2
    >>>>>        [job1]
    >>>>>        filename=b0n1
    >>>>>
    >>>>>        Config file on initiator:
    >>>>>        [Nvme]
    >>>>>        TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420
    >>>>> subnqn:nqn.2018-
    >>>>> 11.io.spdk:nqn1 adrfam:IPv4" b0
    >>>>>
    >>>>>        Run FIO on initiator and nvmf_tgt seg faults immediate:
    >>>>>        sudo
    >>>>> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plu
    >>>>> gi
    >>>>> n fio sr.ini
    >>>>>
    >>>>>        Seg fault looks like this:
    >>>>>        mlx5: donsl202: got completion with error:
    >>>>>        00000000 00000000 00000000 00000000
    >>>>>        00000000 00000000 00000000 00000000
    >>>>>        00000001 00000000 00000000 00000000
    >>>>>        00000000 9d005304 0800011b 0008d0d2
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
    >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
    >>>>> QP#1 changed to: IBV_QPS_ERR
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request 
    >>>>> Flushed Error
    >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
    >>>>> QP#1 changed to: IBV_QPS_ERR
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
    >>>>> Flushed Error
    >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
    >>>>> QP#1 changed to: IBV_QPS_ERR
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
    >>>>> Flushed Error
    >>>>>        Segmentation fault
    >>>>>
    >>>>>        Adds this to dmesg:
    >>>>>        [71561.859644] nvme nvme1: Connect rejected: status 8 
    >>>>> (invalid service ID).
    >>>>>        [71561.866466] nvme nvme1: rdma connection establishment 
    >>>>> failed (-
    >>>>> 104)
    >>>>>        [71567.805288] reactor_7[9166]: segfault at 88 ip
    >>>>> 00005630621e6580 sp
    >>>>> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
    >>>>>        [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 
    >>>>> 0f 1f 44 00 00 41 81
    >>>>> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00 
    >>>>> <49> 8b 96 88
    >>>>> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
    >>>>>
    >>>>>
    >>>>>
    >>>>>        _______________________________________________
    >>>>>        SPDK mailing list
    >>>>>        SPDK(a)lists.01.org
    >>>>>        
    >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
    >>>>> Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgen
    >>>>> iik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d
    >>>>> 2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBG
    >>>>> bU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>>>>
    >>>>>
    >>>>> _______________________________________________
    >>>>> SPDK mailing list
    >>>>> SPDK(a)lists.01.org
    >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
    >>>>> Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgen
    >>>>> iik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d
    >>>>> 2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBG
    >>>>> bU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>>> _______________________________________________
    >>>> SPDK mailing list
    >>>> SPDK(a)lists.01.org
    >>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
    >>>> lists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgenii
    >>>> k%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4
    >>>> d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u
    >>>> 6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>> _______________________________________________
    >>> SPDK mailing list
    >>> SPDK(a)lists.01.org
    >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
    >>> ists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%
    >>> 40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9b
    >>> a6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj6
    >>> 4t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>> _______________________________________________
    >>> SPDK mailing list
    >>> SPDK(a)lists.01.org
    >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
    >>> ists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%
    >>> 40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9b
    >>> a6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj6
    >>> 4t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >> _______________________________________________
    >> SPDK mailing list
    >> SPDK(a)lists.01.org
    >> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
    >> sts.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40
    >> mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a
    >> 4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5N
    >> ul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    > _______________________________________________
    > SPDK mailing list
    > SPDK(a)lists.01.org
    > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
    > ts.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40me
    > llanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d1
    > 49256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5Nul5z
    > x%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    

_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-27  9:07 Sasha Kotchubievsky
  0 siblings, 0 replies; 17+ messages in thread
From: Sasha Kotchubievsky @ 2018-11-27  9:07 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 26035 bytes --]

Great.

We'll retest our use case with this patch

Sasha

On 11/27/2018 2:19 AM, Harris, James R wrote:
> Here's a patch I'd like to throw out for discussion:
>
> https://review.gerrithub.io/#/c/spdk/spdk/+/434895/
>
> It sort of reverts GerritHub #428716 - at least the part about registering a separate MR for each hugepage.  It tells DPDK to *not* free memory that has been dynamically allocated - that way we don't have to worry about DPDK freeing the memory in different units than it was allocated.  Fixes all of the MR-spanning issues, and significantly relaxes bumping up against the maximum number of MRs supported by a NIC.
>
> Downside is that if a user does a big allocation and then later frees it, that application retains the memory.  But the vast majority of SPDK use cases allocate all of the memory up front (via mempools) and don't free it until the application shuts down.
>
> I'm driving for simplicity here.  This would ensure that all buffers allocated via SPDK malloc routines would never span an MR boundary and we could avoid a bunch of complexity in both the initiator and target.
>
> -Jim
>
>
>
>
> On 11/22/18, 10:55 AM, "SPDK on behalf of Evgenii Kochetov" <spdk-bounces(a)lists.01.org on behalf of evgeniik(a)mellanox.com> wrote:
>
>      Hi,
>      
>      To follow up Sasha's email with more details and add some food for thought.
>      
>      First of all here is an easy way to reproduce the problem:
>          1. Disable all hugeapages except 2MB
>      	echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>          2. Create simple NVMf target config:
>      	[Transport]
>      	  Type RDMA
>      	[Null]
>      	  Dev Null0 4096 4096
>      	[Subsystem0]
>      	  NQN nqn.2016-06.io.spdk:cnode0
>      	  SN SPDK000DEADBEAF00
>      	  Namespace Null0
>      	  Listen RDMA 1.1.1.1:4420
>      	  AllowAnyHost yes
>          3. Start NVMf target app:
>      	./app/nvmf_tgt/nvmf_tgt -c nvmf.conf -m 0x01 -L rdma
>          4. Start initiator (perf tool):
>      	./examples/nvme/perf/perf -q 16 -o 131072 -w read -t 10 -r 'trtype:RDMA adrfam:IPv4 traddr:1.1.1.1 trsvcid:4420 subnqn:nqn.2016-06.io.spdk:cnode0'
>          5. Check for errors on target:
>      	mlx5: host: got completion with error:
>      	00000000 00000000 00000000 00000000
>      	00000000 00000000 00000000 00000000
>      	00000001 00000000 00000000 00000000
>      	00000000 9d005304 0800032d 0002bfd2
>      	rdma.c:2584:spdk_nvmf_rdma_poller_poll: *DEBUG*: CQ error on CQ 0xc27560, Request 0x13392248 (4): local protection error
>      
>      The root cause, as it was noted already, is dynamic memory allocation feature. To be more precise its part that splits allocated memory regions into hugepage sized segments in function memory_hotplug_cb in lib/env_dpdk/memory.c. This code was added in this change https://review.gerrithub.io/c/spdk/spdk/+/428716. As a result separate MRs are registered for each 2MB hugepage. When we create memory pool for data buffers (data_buf_pool) in lib/nvmf/rdma.c some buffers cross the hugepage (and MR) boundary. When this buffer is used for RDMA operation local protection error is generated.
>      The change itself looks reasonable and it's not clear at the moment how this problem can be fixed.
>      
>      As Ben said, -s parameter can be used as a workaround. In our setup when we add '-s 1024' to target parameters no errors occur.
>      
>      BR, Evgeniy.
>      
>      -----Original Message-----
>      From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Sasha Kotchubievsky
>      Sent: Thursday, November 22, 2018 3:21 PM
>      To: Storage Performance Development Kit <spdk(a)lists.01.org>; Walker, Benjamin <benjamin.walker(a)intel.com>
>      Subject: Re: [SPDK] nvmf_tgt seg fault
>      
>      Hi,
>      
>      It looks like, some allocations cross huge-page boundary and MR boundary as well.
>      
>      After switching to bigger huge-pages (1GB instead of 2M) we don't see the problem.
>      
>      Crash, after hit "local protection error", is solved in "master". In 18.10, the crash is result of wrong  processing of op code in RDMA completion.
>      
>      Sasha
>      
>      On 11/21/2018 7:03 PM, Walker, Benjamin wrote:
>      > On Tue, 2018-11-20 at 09:57 +0200, Sasha Kotchubievsky wrote:
>      >> We see the similar issue on ARM platform with SPDK 18.10+couple our
>      >> ARM related patches
>      >>
>      >> target crashes after receiving completion with in error state
>      >>
>      >> rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ
>      >> 0x7ff220, Request 0x13586616 (4): local protection error
>      >>
>      >> target crashes after "local protection error" followed by flush errors.
>      >> The same pattern I see in logs reported in the email thread.
>      > Joe, Sasha - can you both try to reproduce the issue after having the
>      > NVMe-oF target pre-allocate its memory? This is the '-s' option to the
>      > target. Set it to at least 4GB to be safe. I'm concerned that this is
>      > a problem introduced by the patches that enable dynamic memory
>      > allocation (which creates multiple ibv_mrs and requires smarter splitting code that doesn't exist yet).
>      >
>      >
>      >> 90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce improves error handling, so
>      >> I'd to check if it solves the issue.
>      >>
>      >> Best regards
>      >>
>      >> Sasha
>      >>
>      >> On 11/18/2018 9:19 PM, Luse, Paul E wrote:
>      >>> FYI the test I ran was on master as of Fri... I can check versions
>      >>> if you tell me the steps to get exactly what you're looking for
>      >>>
>      >>> Thx
>      >>> Paul
>      >>>
>      >>> -----Original Message-----
>      >>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha
>      >>> Kotchubievsky
>      >>> Sent: Sunday, November 18, 2018 9:52 AM
>      >>> To: spdk(a)lists.01.org
>      >>> Subject: Re: [SPDK] nvmf_tgt seg fault
>      >>>
>      >>> Hi,
>      >>>
>      >>> Can you check the issue in latest master?
>      >>>
>      >>> Is
>      >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
>      >>> ithub.com%2Fspdk%2Fspdk%2Fcommit%2F90b4bd6cf9bb5805c0c6d8df982ac5f2e
>      >>> 3d90cce&amp;data=02%7C01%7Cevgeniik%40mellanox.com%7C0618f1d9969f4b5
>      >>> 81faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C63678
>      >>> 4860985451622&amp;sdata=KGlTXoO7JkRyLYE3CcPAoXZVO8Q8VPbSIHPQngsrjxQ%
>      >>> 3D&amp;reserved=0 merged recently changes the behavior ?
>      >>>
>      >>> Do you use upstream OFED or Mellanox MOFED? Which version ?
>      >>>
>      >>> Best regards
>      >>>
>      >>> Sasha
>      >>>
>      >>> On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
>      >>>> Sure, done.  Issue #500, do I win a prize? :)
>      >>>>
>      >>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
>      >>>> github.com%2Fspdk%2Fspdk%2Fissues%2F500&amp;data=02%7C01%7Cevgeniik
>      >>>> %40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d
>      >>>> 9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=0sR9wMRj1
>      >>>> CrM3MQsLXEB8cPEABFDyky8Xo%2Fs3SV97Sc%3D&amp;reserved=0
>      >>>>
>      >>>> -Joe
>      >>>>
>      >>>>> -----Original Message-----
>      >>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris,
>      >>>>> James R
>      >>>>> Sent: Wednesday, November 14, 2018 11:16 AM
>      >>>>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
>      >>>>> Subject: Re: [SPDK] nvmf_tgt seg fault
>      >>>>>
>      >>>>> Thanks for the report Joe.  Could you file an issue in GitHub for this?
>      >>>>>
>      >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
>      >>>>> Fgithub.com%2Fspdk%2Fspdk%2Fissues&amp;data=02%7C01%7Cevgeniik%40m
>      >>>>> ellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba
>      >>>>> 6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=BAJdbf4bpTp
>      >>>>> FVnbHOeiuzdlC3xya3KoXFY8Dvs8Sjqk%3D&amp;reserved=0
>      >>>>>
>      >>>>> Thanks,
>      >>>>>
>      >>>>> -Jim
>      >>>>>
>      >>>>>
>      >>>>> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R"
>      >>>>> <spdk- bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
>      >>>>>
>      >>>>>        Hi everyone-
>      >>>>>
>      >>>>>        I'm running a dual socket Skylake server with P4510 NVMe
>      >>>>> and 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.
>      >>>>> SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK
>      >>>>> NVMeoF target and exercising it from an initiator system (similar
>      >>>>> config to the target but with 50Gb NIC) using FIO with the bdev
>      >>>>> plugin.  I find 128K sequential workloads reliably and immediately
>      >>>>> seg fault nvmf_tgt.  I can run 4KB random workloads without
>      >>>>> experiencing the seg fault, so the problem seems tied to the block
>      >>>>> size and/or IO pattern.  I can run the same IO pattern against a
>      >>>>> local PCIe device using SPDK without a problem, I only see the
>      >>>>> failure when running the NVMeoF target with FIO running the IO
>      >>>>> patter from an SPDK initiator system.
>      >>>>>
>      >>>>>        Steps to reproduce and seg fault output follow below.
>      >>>>>
>      >>>>>        Start the target:
>      >>>>>        sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r
>      >>>>> /var/tmp/spdk1.sock
>      >>>>>
>      >>>>>        Configure the target:
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b
>      >>>>> d1 -t pcie - a 0000:1a:00.0
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b
>      >>>>> d2 -t pcie - a 0000:1b:00.0
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b
>      >>>>> d3 -t pcie - a 0000:1c:00.0
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b
>      >>>>> d4 -t pcie - a 0000:1d:00.0
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b
>      >>>>> d5 -t pcie - a 0000:3d:00.0
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b
>      >>>>> d6 -t pcie - a 0000:3e:00.0
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b
>      >>>>> d7 -t pcie - a 0000:3f:00.0
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b
>      >>>>> d8 -t pcie - a 0000:40:00.0
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n
>      >>>>> raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store
>      >>>>> raid1
>      >>>>> store1
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>      >>>>> store1 l1
>      >>>>> 1200000
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>      >>>>> store1 l2
>      >>>>> 1200000
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>      >>>>> store1 l3
>      >>>>> 1200000
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>      >>>>> store1 l4
>      >>>>> 1200000
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>      >>>>> store1 l5
>      >>>>> 1200000
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>      >>>>> store1 l6
>      >>>>> 1200000
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>      >>>>> store1 l7
>      >>>>> 1200000
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>      >>>>> store1 l8
>      >>>>> 1200000
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>      >>>>> store1 l9
>      >>>>> 1200000
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>      >>>>> store1 l10
>      >>>>> 1200000
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>      >>>>> store1 l11
>      >>>>> 1200000
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>      >>>>> store1 l12
>      >>>>> 1200000
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn1 -a
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn2 -a
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn3 -a
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn4 -a
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn5 -a
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn6 -a
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn7 -a
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn8 -a
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn9 -a
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn10 -a
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn11 -a
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn12 -a
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn1 store1/l1
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn2 store1/l2
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn3 store1/l3
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn4 store1/l4
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn5 store1/l5
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn6 store1/l6
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn7 store1/l7
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn8 store1/l8
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn9 store1/l9
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn10 store1/l10
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn11 store1/l11
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>      >>>>> nqn.2018-
>      >>>>> 11.io.spdk:nqn12 store1/l12
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock
>      >>>>> nvmf_subsystem_add_listener
>      >>>>> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock
>      >>>>> nvmf_subsystem_add_listener
>      >>>>> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock
>      >>>>> nvmf_subsystem_add_listener
>      >>>>> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock
>      >>>>> nvmf_subsystem_add_listener
>      >>>>> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock
>      >>>>> nvmf_subsystem_add_listener
>      >>>>> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock
>      >>>>> nvmf_subsystem_add_listener
>      >>>>> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock
>      >>>>> nvmf_subsystem_add_listener
>      >>>>> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock
>      >>>>> nvmf_subsystem_add_listener
>      >>>>> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock
>      >>>>> nvmf_subsystem_add_listener
>      >>>>> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock
>      >>>>> nvmf_subsystem_add_listener
>      >>>>> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock
>      >>>>> nvmf_subsystem_add_listener
>      >>>>> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
>      >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock
>      >>>>> nvmf_subsystem_add_listener
>      >>>>> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
>      >>>>>
>      >>>>>        FIO file on initiator:
>      >>>>>        [global]
>      >>>>>        rw=rw
>      >>>>>        rwmixread=100
>      >>>>>        numjobs=1
>      >>>>>        iodepth=32
>      >>>>>        bs=128k
>      >>>>>        direct=1
>      >>>>>        thread=1
>      >>>>>        time_based=1
>      >>>>>        ramp_time=10
>      >>>>>        runtime=10
>      >>>>>        ioengine=spdk_bdev
>      >>>>>        spdk_conf=/home/don/fio/nvmeof.conf
>      >>>>>        group_reporting=1
>      >>>>>        unified_rw_reporting=1
>      >>>>>        exitall=1
>      >>>>>        randrepeat=0
>      >>>>>        norandommap=1
>      >>>>>        cpus_allowed_policy=split
>      >>>>>        cpus_allowed=1-2
>      >>>>>        [job1]
>      >>>>>        filename=b0n1
>      >>>>>
>      >>>>>        Config file on initiator:
>      >>>>>        [Nvme]
>      >>>>>        TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420
>      >>>>> subnqn:nqn.2018-
>      >>>>> 11.io.spdk:nqn1 adrfam:IPv4" b0
>      >>>>>
>      >>>>>        Run FIO on initiator and nvmf_tgt seg faults immediate:
>      >>>>>        sudo
>      >>>>> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plu
>      >>>>> gi
>      >>>>> n fio sr.ini
>      >>>>>
>      >>>>>        Seg fault looks like this:
>      >>>>>        mlx5: donsl202: got completion with error:
>      >>>>>        00000000 00000000 00000000 00000000
>      >>>>>        00000000 00000000 00000000 00000000
>      >>>>>        00000001 00000000 00000000 00000000
>      >>>>>        00000000 9d005304 0800011b 0008d0d2
>      >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error
>      >>>>> on CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
>      >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV
>      >>>>> QP#1 changed to: IBV_QPS_ERR
>      >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error
>      >>>>> on CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request
>      >>>>> Flushed Error
>      >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV
>      >>>>> QP#1 changed to: IBV_QPS_ERR
>      >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error
>      >>>>> on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request
>      >>>>> Flushed Error
>      >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV
>      >>>>> QP#1 changed to: IBV_QPS_ERR
>      >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error
>      >>>>> on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request
>      >>>>> Flushed Error
>      >>>>>        Segmentation fault
>      >>>>>
>      >>>>>        Adds this to dmesg:
>      >>>>>        [71561.859644] nvme nvme1: Connect rejected: status 8
>      >>>>> (invalid service ID).
>      >>>>>        [71561.866466] nvme nvme1: rdma connection establishment
>      >>>>> failed (-
>      >>>>> 104)
>      >>>>>        [71567.805288] reactor_7[9166]: segfault at 88 ip
>      >>>>> 00005630621e6580 sp
>      >>>>> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
>      >>>>>        [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff
>      >>>>> 0f 1f 44 00 00 41 81
>      >>>>> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00
>      >>>>> <49> 8b 96 88
>      >>>>> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>        _______________________________________________
>      >>>>>        SPDK mailing list
>      >>>>>        SPDK(a)lists.01.org
>      >>>>>
>      >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
>      >>>>> Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgen
>      >>>>> iik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d
>      >>>>> 2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBG
>      >>>>> bU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>      >>>>>
>      >>>>>
>      >>>>> _______________________________________________
>      >>>>> SPDK mailing list
>      >>>>> SPDK(a)lists.01.org
>      >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
>      >>>>> Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgen
>      >>>>> iik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d
>      >>>>> 2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBG
>      >>>>> bU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>      >>>> _______________________________________________
>      >>>> SPDK mailing list
>      >>>> SPDK(a)lists.01.org
>      >>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
>      >>>> lists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgenii
>      >>>> k%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4
>      >>>> d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u
>      >>>> 6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>      >>> _______________________________________________
>      >>> SPDK mailing list
>      >>> SPDK(a)lists.01.org
>      >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
>      >>> ists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%
>      >>> 40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9b
>      >>> a6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj6
>      >>> 4t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>      >>> _______________________________________________
>      >>> SPDK mailing list
>      >>> SPDK(a)lists.01.org
>      >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
>      >>> ists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%
>      >>> 40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9b
>      >>> a6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj6
>      >>> 4t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>      >> _______________________________________________
>      >> SPDK mailing list
>      >> SPDK(a)lists.01.org
>      >> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
>      >> sts.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40
>      >> mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a
>      >> 4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5N
>      >> ul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>      > _______________________________________________
>      > SPDK mailing list
>      > SPDK(a)lists.01.org
>      > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
>      > ts.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40me
>      > llanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d1
>      > 49256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5Nul5z
>      > x%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>      _______________________________________________
>      SPDK mailing list
>      SPDK(a)lists.01.org
>      https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>      _______________________________________________
>      SPDK mailing list
>      SPDK(a)lists.01.org
>      https://lists.01.org/mailman/listinfo/spdk
>      
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-27  0:19 Harris, James R
  0 siblings, 0 replies; 17+ messages in thread
From: Harris, James R @ 2018-11-27  0:19 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 24386 bytes --]

Here's a patch I'd like to throw out for discussion:

https://review.gerrithub.io/#/c/spdk/spdk/+/434895/

It sort of reverts GerritHub #428716 - at least the part about registering a separate MR for each hugepage.  It tells DPDK to *not* free memory that has been dynamically allocated - that way we don't have to worry about DPDK freeing the memory in different units than it was allocated.  Fixes all of the MR-spanning issues, and significantly relaxes bumping up against the maximum number of MRs supported by a NIC.

Downside is that if a user does a big allocation and then later frees it, that application retains the memory.  But the vast majority of SPDK use cases allocate all of the memory up front (via mempools) and don't free it until the application shuts down.

I'm driving for simplicity here.  This would ensure that all buffers allocated via SPDK malloc routines would never span an MR boundary and we could avoid a bunch of complexity in both the initiator and target.

-Jim




On 11/22/18, 10:55 AM, "SPDK on behalf of Evgenii Kochetov" <spdk-bounces(a)lists.01.org on behalf of evgeniik(a)mellanox.com> wrote:

    Hi,
    
    To follow up Sasha's email with more details and add some food for thought.
    
    First of all here is an easy way to reproduce the problem:
        1. Disable all hugeapages except 2MB
    	echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
        2. Create simple NVMf target config:
    	[Transport]
    	  Type RDMA
    	[Null]
    	  Dev Null0 4096 4096
    	[Subsystem0]
    	  NQN nqn.2016-06.io.spdk:cnode0
    	  SN SPDK000DEADBEAF00
    	  Namespace Null0
    	  Listen RDMA 1.1.1.1:4420
    	  AllowAnyHost yes
        3. Start NVMf target app:
    	./app/nvmf_tgt/nvmf_tgt -c nvmf.conf -m 0x01 -L rdma
        4. Start initiator (perf tool):
    	./examples/nvme/perf/perf -q 16 -o 131072 -w read -t 10 -r 'trtype:RDMA adrfam:IPv4 traddr:1.1.1.1 trsvcid:4420 subnqn:nqn.2016-06.io.spdk:cnode0'
        5. Check for errors on target:
    	mlx5: host: got completion with error:
    	00000000 00000000 00000000 00000000
    	00000000 00000000 00000000 00000000
    	00000001 00000000 00000000 00000000
    	00000000 9d005304 0800032d 0002bfd2
    	rdma.c:2584:spdk_nvmf_rdma_poller_poll: *DEBUG*: CQ error on CQ 0xc27560, Request 0x13392248 (4): local protection error
    
    The root cause, as it was noted already, is dynamic memory allocation feature. To be more precise its part that splits allocated memory regions into hugepage sized segments in function memory_hotplug_cb in lib/env_dpdk/memory.c. This code was added in this change https://review.gerrithub.io/c/spdk/spdk/+/428716. As a result separate MRs are registered for each 2MB hugepage. When we create memory pool for data buffers (data_buf_pool) in lib/nvmf/rdma.c some buffers cross the hugepage (and MR) boundary. When this buffer is used for RDMA operation local protection error is generated.
    The change itself looks reasonable and it's not clear at the moment how this problem can be fixed.
    
    As Ben said, -s parameter can be used as a workaround. In our setup when we add '-s 1024' to target parameters no errors occur.
    
    BR, Evgeniy.
    
    -----Original Message-----
    From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Sasha Kotchubievsky
    Sent: Thursday, November 22, 2018 3:21 PM
    To: Storage Performance Development Kit <spdk(a)lists.01.org>; Walker, Benjamin <benjamin.walker(a)intel.com>
    Subject: Re: [SPDK] nvmf_tgt seg fault
    
    Hi,
    
    It looks like, some allocations cross huge-page boundary and MR boundary as well.
    
    After switching to bigger huge-pages (1GB instead of 2M) we don't see the problem.
    
    Crash, after hit "local protection error", is solved in "master". In 18.10, the crash is result of wrong  processing of op code in RDMA completion.
    
    Sasha
    
    On 11/21/2018 7:03 PM, Walker, Benjamin wrote:
    > On Tue, 2018-11-20 at 09:57 +0200, Sasha Kotchubievsky wrote:
    >> We see the similar issue on ARM platform with SPDK 18.10+couple our 
    >> ARM related patches
    >>
    >> target crashes after receiving completion with in error state
    >>
    >> rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 
    >> 0x7ff220, Request 0x13586616 (4): local protection error
    >>
    >> target crashes after "local protection error" followed by flush errors.
    >> The same pattern I see in logs reported in the email thread.
    > Joe, Sasha - can you both try to reproduce the issue after having the 
    > NVMe-oF target pre-allocate its memory? This is the '-s' option to the 
    > target. Set it to at least 4GB to be safe. I'm concerned that this is 
    > a problem introduced by the patches that enable dynamic memory 
    > allocation (which creates multiple ibv_mrs and requires smarter splitting code that doesn't exist yet).
    >
    >
    >> 90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce improves error handling, so 
    >> I'd to check if it solves the issue.
    >>
    >> Best regards
    >>
    >> Sasha
    >>
    >> On 11/18/2018 9:19 PM, Luse, Paul E wrote:
    >>> FYI the test I ran was on master as of Fri... I can check versions 
    >>> if you tell me the steps to get exactly what you're looking for
    >>>
    >>> Thx
    >>> Paul
    >>>
    >>> -----Original Message-----
    >>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha 
    >>> Kotchubievsky
    >>> Sent: Sunday, November 18, 2018 9:52 AM
    >>> To: spdk(a)lists.01.org
    >>> Subject: Re: [SPDK] nvmf_tgt seg fault
    >>>
    >>> Hi,
    >>>
    >>> Can you check the issue in latest master?
    >>>
    >>> Is
    >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
    >>> ithub.com%2Fspdk%2Fspdk%2Fcommit%2F90b4bd6cf9bb5805c0c6d8df982ac5f2e
    >>> 3d90cce&amp;data=02%7C01%7Cevgeniik%40mellanox.com%7C0618f1d9969f4b5
    >>> 81faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C63678
    >>> 4860985451622&amp;sdata=KGlTXoO7JkRyLYE3CcPAoXZVO8Q8VPbSIHPQngsrjxQ%
    >>> 3D&amp;reserved=0 merged recently changes the behavior ?
    >>>
    >>> Do you use upstream OFED or Mellanox MOFED? Which version ?
    >>>
    >>> Best regards
    >>>
    >>> Sasha
    >>>
    >>> On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
    >>>> Sure, done.  Issue #500, do I win a prize? :)
    >>>>
    >>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
    >>>> github.com%2Fspdk%2Fspdk%2Fissues%2F500&amp;data=02%7C01%7Cevgeniik
    >>>> %40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d
    >>>> 9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=0sR9wMRj1
    >>>> CrM3MQsLXEB8cPEABFDyky8Xo%2Fs3SV97Sc%3D&amp;reserved=0
    >>>>
    >>>> -Joe
    >>>>
    >>>>> -----Original Message-----
    >>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, 
    >>>>> James R
    >>>>> Sent: Wednesday, November 14, 2018 11:16 AM
    >>>>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
    >>>>> Subject: Re: [SPDK] nvmf_tgt seg fault
    >>>>>
    >>>>> Thanks for the report Joe.  Could you file an issue in GitHub for this?
    >>>>>
    >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
    >>>>> Fgithub.com%2Fspdk%2Fspdk%2Fissues&amp;data=02%7C01%7Cevgeniik%40m
    >>>>> ellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba
    >>>>> 6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=BAJdbf4bpTp
    >>>>> FVnbHOeiuzdlC3xya3KoXFY8Dvs8Sjqk%3D&amp;reserved=0
    >>>>>
    >>>>> Thanks,
    >>>>>
    >>>>> -Jim
    >>>>>
    >>>>>
    >>>>> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" 
    >>>>> <spdk- bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
    >>>>>
    >>>>>        Hi everyone-
    >>>>>
    >>>>>        I'm running a dual socket Skylake server with P4510 NVMe 
    >>>>> and 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.
    >>>>> SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK 
    >>>>> NVMeoF target and exercising it from an initiator system (similar 
    >>>>> config to the target but with 50Gb NIC) using FIO with the bdev 
    >>>>> plugin.  I find 128K sequential workloads reliably and immediately 
    >>>>> seg fault nvmf_tgt.  I can run 4KB random workloads without 
    >>>>> experiencing the seg fault, so the problem seems tied to the block 
    >>>>> size and/or IO pattern.  I can run the same IO pattern against a 
    >>>>> local PCIe device using SPDK without a problem, I only see the 
    >>>>> failure when running the NVMeoF target with FIO running the IO 
    >>>>> patter from an SPDK initiator system.
    >>>>>
    >>>>>        Steps to reproduce and seg fault output follow below.
    >>>>>
    >>>>>        Start the target:
    >>>>>        sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r 
    >>>>> /var/tmp/spdk1.sock
    >>>>>
    >>>>>        Configure the target:
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d1 -t pcie - a 0000:1a:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d2 -t pcie - a 0000:1b:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d3 -t pcie - a 0000:1c:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d4 -t pcie - a 0000:1d:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d5 -t pcie - a 0000:3d:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d6 -t pcie - a 0000:3e:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d7 -t pcie - a 0000:3f:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
    >>>>> d8 -t pcie - a 0000:40:00.0
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n
    >>>>> raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store 
    >>>>> raid1
    >>>>> store1
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l1
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l2
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l3
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l4
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l5
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l6
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l7
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l8
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l9
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l10
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l11
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
    >>>>> store1 l12
    >>>>> 1200000
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn1 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn2 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn3 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn4 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn5 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn6 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn7 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn8 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn9 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn10 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn11 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn12 -a
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn1 store1/l1
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn2 store1/l2
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn3 store1/l3
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn4 store1/l4
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn5 store1/l5
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn6 store1/l6
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn7 store1/l7
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn8 store1/l8
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn9 store1/l9
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn10 store1/l10
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn11 store1/l11
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
    >>>>> nqn.2018-
    >>>>> 11.io.spdk:nqn12 store1/l12
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
    >>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
    >>>>> nvmf_subsystem_add_listener
    >>>>> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
    >>>>>
    >>>>>        FIO file on initiator:
    >>>>>        [global]
    >>>>>        rw=rw
    >>>>>        rwmixread=100
    >>>>>        numjobs=1
    >>>>>        iodepth=32
    >>>>>        bs=128k
    >>>>>        direct=1
    >>>>>        thread=1
    >>>>>        time_based=1
    >>>>>        ramp_time=10
    >>>>>        runtime=10
    >>>>>        ioengine=spdk_bdev
    >>>>>        spdk_conf=/home/don/fio/nvmeof.conf
    >>>>>        group_reporting=1
    >>>>>        unified_rw_reporting=1
    >>>>>        exitall=1
    >>>>>        randrepeat=0
    >>>>>        norandommap=1
    >>>>>        cpus_allowed_policy=split
    >>>>>        cpus_allowed=1-2
    >>>>>        [job1]
    >>>>>        filename=b0n1
    >>>>>
    >>>>>        Config file on initiator:
    >>>>>        [Nvme]
    >>>>>        TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420
    >>>>> subnqn:nqn.2018-
    >>>>> 11.io.spdk:nqn1 adrfam:IPv4" b0
    >>>>>
    >>>>>        Run FIO on initiator and nvmf_tgt seg faults immediate:
    >>>>>        sudo
    >>>>> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plu
    >>>>> gi
    >>>>> n fio sr.ini
    >>>>>
    >>>>>        Seg fault looks like this:
    >>>>>        mlx5: donsl202: got completion with error:
    >>>>>        00000000 00000000 00000000 00000000
    >>>>>        00000000 00000000 00000000 00000000
    >>>>>        00000001 00000000 00000000 00000000
    >>>>>        00000000 9d005304 0800011b 0008d0d2
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
    >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
    >>>>> QP#1 changed to: IBV_QPS_ERR
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request 
    >>>>> Flushed Error
    >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
    >>>>> QP#1 changed to: IBV_QPS_ERR
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
    >>>>> Flushed Error
    >>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
    >>>>> QP#1 changed to: IBV_QPS_ERR
    >>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
    >>>>> on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
    >>>>> Flushed Error
    >>>>>        Segmentation fault
    >>>>>
    >>>>>        Adds this to dmesg:
    >>>>>        [71561.859644] nvme nvme1: Connect rejected: status 8 
    >>>>> (invalid service ID).
    >>>>>        [71561.866466] nvme nvme1: rdma connection establishment 
    >>>>> failed (-
    >>>>> 104)
    >>>>>        [71567.805288] reactor_7[9166]: segfault at 88 ip
    >>>>> 00005630621e6580 sp
    >>>>> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
    >>>>>        [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 
    >>>>> 0f 1f 44 00 00 41 81
    >>>>> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00 
    >>>>> <49> 8b 96 88
    >>>>> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
    >>>>>
    >>>>>
    >>>>>
    >>>>>        _______________________________________________
    >>>>>        SPDK mailing list
    >>>>>        SPDK(a)lists.01.org
    >>>>>        
    >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
    >>>>> Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgen
    >>>>> iik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d
    >>>>> 2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBG
    >>>>> bU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>>>>
    >>>>>
    >>>>> _______________________________________________
    >>>>> SPDK mailing list
    >>>>> SPDK(a)lists.01.org
    >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
    >>>>> Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgen
    >>>>> iik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d
    >>>>> 2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBG
    >>>>> bU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>>> _______________________________________________
    >>>> SPDK mailing list
    >>>> SPDK(a)lists.01.org
    >>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
    >>>> lists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgenii
    >>>> k%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4
    >>>> d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u
    >>>> 6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>> _______________________________________________
    >>> SPDK mailing list
    >>> SPDK(a)lists.01.org
    >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
    >>> ists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%
    >>> 40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9b
    >>> a6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj6
    >>> 4t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >>> _______________________________________________
    >>> SPDK mailing list
    >>> SPDK(a)lists.01.org
    >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
    >>> ists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%
    >>> 40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9b
    >>> a6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj6
    >>> 4t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    >> _______________________________________________
    >> SPDK mailing list
    >> SPDK(a)lists.01.org
    >> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
    >> sts.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40
    >> mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a
    >> 4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5N
    >> ul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    > _______________________________________________
    > SPDK mailing list
    > SPDK(a)lists.01.org
    > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
    > ts.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40me
    > llanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d1
    > 49256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5Nul5z
    > x%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-22 17:55 Evgenii Kochetov
  0 siblings, 0 replies; 17+ messages in thread
From: Evgenii Kochetov @ 2018-11-22 17:55 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 21182 bytes --]

Hi,

To follow up Sasha's email with more details and add some food for thought.

First of all here is an easy way to reproduce the problem:
    1. Disable all hugeapages except 2MB
	echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
    2. Create simple NVMf target config:
	[Transport]
	  Type RDMA
	[Null]
	  Dev Null0 4096 4096
	[Subsystem0]
	  NQN nqn.2016-06.io.spdk:cnode0
	  SN SPDK000DEADBEAF00
	  Namespace Null0
	  Listen RDMA 1.1.1.1:4420
	  AllowAnyHost yes
    3. Start NVMf target app:
	./app/nvmf_tgt/nvmf_tgt -c nvmf.conf -m 0x01 -L rdma
    4. Start initiator (perf tool):
	./examples/nvme/perf/perf -q 16 -o 131072 -w read -t 10 -r 'trtype:RDMA adrfam:IPv4 traddr:1.1.1.1 trsvcid:4420 subnqn:nqn.2016-06.io.spdk:cnode0'
    5. Check for errors on target:
	mlx5: host: got completion with error:
	00000000 00000000 00000000 00000000
	00000000 00000000 00000000 00000000
	00000001 00000000 00000000 00000000
	00000000 9d005304 0800032d 0002bfd2
	rdma.c:2584:spdk_nvmf_rdma_poller_poll: *DEBUG*: CQ error on CQ 0xc27560, Request 0x13392248 (4): local protection error

The root cause, as it was noted already, is dynamic memory allocation feature. To be more precise its part that splits allocated memory regions into hugepage sized segments in function memory_hotplug_cb in lib/env_dpdk/memory.c. This code was added in this change https://review.gerrithub.io/c/spdk/spdk/+/428716. As a result separate MRs are registered for each 2MB hugepage. When we create memory pool for data buffers (data_buf_pool) in lib/nvmf/rdma.c some buffers cross the hugepage (and MR) boundary. When this buffer is used for RDMA operation local protection error is generated.
The change itself looks reasonable and it's not clear at the moment how this problem can be fixed.

As Ben said, -s parameter can be used as a workaround. In our setup when we add '-s 1024' to target parameters no errors occur.

BR, Evgeniy.

-----Original Message-----
From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Sasha Kotchubievsky
Sent: Thursday, November 22, 2018 3:21 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>; Walker, Benjamin <benjamin.walker(a)intel.com>
Subject: Re: [SPDK] nvmf_tgt seg fault

Hi,

It looks like, some allocations cross huge-page boundary and MR boundary as well.

After switching to bigger huge-pages (1GB instead of 2M) we don't see the problem.

Crash, after hit "local protection error", is solved in "master". In 18.10, the crash is result of wrong  processing of op code in RDMA completion.

Sasha

On 11/21/2018 7:03 PM, Walker, Benjamin wrote:
> On Tue, 2018-11-20 at 09:57 +0200, Sasha Kotchubievsky wrote:
>> We see the similar issue on ARM platform with SPDK 18.10+couple our 
>> ARM related patches
>>
>> target crashes after receiving completion with in error state
>>
>> rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 
>> 0x7ff220, Request 0x13586616 (4): local protection error
>>
>> target crashes after "local protection error" followed by flush errors.
>> The same pattern I see in logs reported in the email thread.
> Joe, Sasha - can you both try to reproduce the issue after having the 
> NVMe-oF target pre-allocate its memory? This is the '-s' option to the 
> target. Set it to at least 4GB to be safe. I'm concerned that this is 
> a problem introduced by the patches that enable dynamic memory 
> allocation (which creates multiple ibv_mrs and requires smarter splitting code that doesn't exist yet).
>
>
>> 90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce improves error handling, so 
>> I'd to check if it solves the issue.
>>
>> Best regards
>>
>> Sasha
>>
>> On 11/18/2018 9:19 PM, Luse, Paul E wrote:
>>> FYI the test I ran was on master as of Fri... I can check versions 
>>> if you tell me the steps to get exactly what you're looking for
>>>
>>> Thx
>>> Paul
>>>
>>> -----Original Message-----
>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha 
>>> Kotchubievsky
>>> Sent: Sunday, November 18, 2018 9:52 AM
>>> To: spdk(a)lists.01.org
>>> Subject: Re: [SPDK] nvmf_tgt seg fault
>>>
>>> Hi,
>>>
>>> Can you check the issue in latest master?
>>>
>>> Is
>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
>>> ithub.com%2Fspdk%2Fspdk%2Fcommit%2F90b4bd6cf9bb5805c0c6d8df982ac5f2e
>>> 3d90cce&amp;data=02%7C01%7Cevgeniik%40mellanox.com%7C0618f1d9969f4b5
>>> 81faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C63678
>>> 4860985451622&amp;sdata=KGlTXoO7JkRyLYE3CcPAoXZVO8Q8VPbSIHPQngsrjxQ%
>>> 3D&amp;reserved=0 merged recently changes the behavior ?
>>>
>>> Do you use upstream OFED or Mellanox MOFED? Which version ?
>>>
>>> Best regards
>>>
>>> Sasha
>>>
>>> On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
>>>> Sure, done.  Issue #500, do I win a prize? :)
>>>>
>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
>>>> github.com%2Fspdk%2Fspdk%2Fissues%2F500&amp;data=02%7C01%7Cevgeniik
>>>> %40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d
>>>> 9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=0sR9wMRj1
>>>> CrM3MQsLXEB8cPEABFDyky8Xo%2Fs3SV97Sc%3D&amp;reserved=0
>>>>
>>>> -Joe
>>>>
>>>>> -----Original Message-----
>>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, 
>>>>> James R
>>>>> Sent: Wednesday, November 14, 2018 11:16 AM
>>>>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
>>>>> Subject: Re: [SPDK] nvmf_tgt seg fault
>>>>>
>>>>> Thanks for the report Joe.  Could you file an issue in GitHub for this?
>>>>>
>>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
>>>>> Fgithub.com%2Fspdk%2Fspdk%2Fissues&amp;data=02%7C01%7Cevgeniik%40m
>>>>> ellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba
>>>>> 6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=BAJdbf4bpTp
>>>>> FVnbHOeiuzdlC3xya3KoXFY8Dvs8Sjqk%3D&amp;reserved=0
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Jim
>>>>>
>>>>>
>>>>> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" 
>>>>> <spdk- bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
>>>>>
>>>>>        Hi everyone-
>>>>>
>>>>>        I'm running a dual socket Skylake server with P4510 NVMe 
>>>>> and 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.
>>>>> SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK 
>>>>> NVMeoF target and exercising it from an initiator system (similar 
>>>>> config to the target but with 50Gb NIC) using FIO with the bdev 
>>>>> plugin.  I find 128K sequential workloads reliably and immediately 
>>>>> seg fault nvmf_tgt.  I can run 4KB random workloads without 
>>>>> experiencing the seg fault, so the problem seems tied to the block 
>>>>> size and/or IO pattern.  I can run the same IO pattern against a 
>>>>> local PCIe device using SPDK without a problem, I only see the 
>>>>> failure when running the NVMeoF target with FIO running the IO 
>>>>> patter from an SPDK initiator system.
>>>>>
>>>>>        Steps to reproduce and seg fault output follow below.
>>>>>
>>>>>        Start the target:
>>>>>        sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r 
>>>>> /var/tmp/spdk1.sock
>>>>>
>>>>>        Configure the target:
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
>>>>> d1 -t pcie - a 0000:1a:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
>>>>> d2 -t pcie - a 0000:1b:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
>>>>> d3 -t pcie - a 0000:1c:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
>>>>> d4 -t pcie - a 0000:1d:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
>>>>> d5 -t pcie - a 0000:3d:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
>>>>> d6 -t pcie - a 0000:3e:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
>>>>> d7 -t pcie - a 0000:3f:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b 
>>>>> d8 -t pcie - a 0000:40:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n
>>>>> raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store 
>>>>> raid1
>>>>> store1
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l1
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l2
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l3
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l4
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l5
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l6
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l7
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l8
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l9
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l10
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l11
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l12
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn1 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn2 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn3 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn4 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn5 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn6 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn7 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn8 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn9 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn10 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn11 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn12 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn1 store1/l1
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn2 store1/l2
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn3 store1/l3
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn4 store1/l4
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn5 store1/l5
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn6 store1/l6
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn7 store1/l7
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn8 store1/l8
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn9 store1/l9
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn10 store1/l10
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn11 store1/l11
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn12 store1/l12
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>>>> nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>>>> nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>>>> nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>>>> nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>>>> nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>>>> nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>>>> nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>>>> nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>>>> nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>>>> nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>>>> nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock 
>>>>> nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
>>>>>
>>>>>        FIO file on initiator:
>>>>>        [global]
>>>>>        rw=rw
>>>>>        rwmixread=100
>>>>>        numjobs=1
>>>>>        iodepth=32
>>>>>        bs=128k
>>>>>        direct=1
>>>>>        thread=1
>>>>>        time_based=1
>>>>>        ramp_time=10
>>>>>        runtime=10
>>>>>        ioengine=spdk_bdev
>>>>>        spdk_conf=/home/don/fio/nvmeof.conf
>>>>>        group_reporting=1
>>>>>        unified_rw_reporting=1
>>>>>        exitall=1
>>>>>        randrepeat=0
>>>>>        norandommap=1
>>>>>        cpus_allowed_policy=split
>>>>>        cpus_allowed=1-2
>>>>>        [job1]
>>>>>        filename=b0n1
>>>>>
>>>>>        Config file on initiator:
>>>>>        [Nvme]
>>>>>        TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420
>>>>> subnqn:nqn.2018-
>>>>> 11.io.spdk:nqn1 adrfam:IPv4" b0
>>>>>
>>>>>        Run FIO on initiator and nvmf_tgt seg faults immediate:
>>>>>        sudo
>>>>> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plu
>>>>> gi
>>>>> n fio sr.ini
>>>>>
>>>>>        Seg fault looks like this:
>>>>>        mlx5: donsl202: got completion with error:
>>>>>        00000000 00000000 00000000 00000000
>>>>>        00000000 00000000 00000000 00000000
>>>>>        00000001 00000000 00000000 00000000
>>>>>        00000000 9d005304 0800011b 0008d0d2
>>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
>>>>> on CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
>>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
>>>>> QP#1 changed to: IBV_QPS_ERR
>>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
>>>>> on CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request 
>>>>> Flushed Error
>>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
>>>>> QP#1 changed to: IBV_QPS_ERR
>>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
>>>>> on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
>>>>> Flushed Error
>>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV 
>>>>> QP#1 changed to: IBV_QPS_ERR
>>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error 
>>>>> on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
>>>>> Flushed Error
>>>>>        Segmentation fault
>>>>>
>>>>>        Adds this to dmesg:
>>>>>        [71561.859644] nvme nvme1: Connect rejected: status 8 
>>>>> (invalid service ID).
>>>>>        [71561.866466] nvme nvme1: rdma connection establishment 
>>>>> failed (-
>>>>> 104)
>>>>>        [71567.805288] reactor_7[9166]: segfault at 88 ip
>>>>> 00005630621e6580 sp
>>>>> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
>>>>>        [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 
>>>>> 0f 1f 44 00 00 41 81
>>>>> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00 
>>>>> <49> 8b 96 88
>>>>> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
>>>>>
>>>>>
>>>>>
>>>>>        _______________________________________________
>>>>>        SPDK mailing list
>>>>>        SPDK(a)lists.01.org
>>>>>        
>>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
>>>>> Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgen
>>>>> iik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d
>>>>> 2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBG
>>>>> bU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> SPDK mailing list
>>>>> SPDK(a)lists.01.org
>>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
>>>>> Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgen
>>>>> iik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d
>>>>> 2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBG
>>>>> bU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>>>> _______________________________________________
>>>> SPDK mailing list
>>>> SPDK(a)lists.01.org
>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
>>>> lists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgenii
>>>> k%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4
>>>> d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u
>>>> 6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
>>> ists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%
>>> 40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9b
>>> a6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj6
>>> 4t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
>>> ists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%
>>> 40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9b
>>> a6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj6
>>> 4t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
>> sts.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40
>> mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a
>> 4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5N
>> ul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
> ts.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40me
> llanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d1
> 49256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5Nul5z
> x%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cevgeniik%40mellanox.com%7C0618f1d9969f4b581faf08d650750bd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636784860985451622&amp;sdata=tJBGbU3u6Bj64t5Nul5zx%2FBNb36fLhAJtQPqR4PrRlc%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-22 12:21 Sasha Kotchubievsky
  0 siblings, 0 replies; 17+ messages in thread
From: Sasha Kotchubievsky @ 2018-11-22 12:21 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 15266 bytes --]

Hi,

It looks like, some allocations cross huge-page boundary and MR boundary 
as well.

After switching to bigger huge-pages (1GB instead of 2M) we don't see 
the problem.

Crash, after hit "local protection error", is solved in "master". In 
18.10, the crash is result of wrong  processing of op code in RDMA 
completion.

Sasha

On 11/21/2018 7:03 PM, Walker, Benjamin wrote:
> On Tue, 2018-11-20 at 09:57 +0200, Sasha Kotchubievsky wrote:
>> We see the similar issue on ARM platform with SPDK 18.10+couple our ARM
>> related patches
>>
>> target crashes after receiving completion with in error state
>>
>> rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ
>> 0x7ff220, Request 0x13586616 (4): local protection error
>>
>> target crashes after "local protection error" followed by flush errors.
>> The same pattern I see in logs reported in the email thread.
> Joe, Sasha - can you both try to reproduce the issue after having the NVMe-oF
> target pre-allocate its memory? This is the '-s' option to the target. Set it to
> at least 4GB to be safe. I'm concerned that this is a problem introduced by the
> patches that enable dynamic memory allocation (which creates multiple ibv_mrs
> and requires smarter splitting code that doesn't exist yet).
>
>
>> 90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce improves error handling, so I'd
>> to check if it solves the issue.
>>
>> Best regards
>>
>> Sasha
>>
>> On 11/18/2018 9:19 PM, Luse, Paul E wrote:
>>> FYI the test I ran was on master as of Fri... I can check versions if you
>>> tell me the steps to get exactly what you're looking for
>>>
>>> Thx
>>> Paul
>>>
>>> -----Original Message-----
>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha
>>> Kotchubievsky
>>> Sent: Sunday, November 18, 2018 9:52 AM
>>> To: spdk(a)lists.01.org
>>> Subject: Re: [SPDK] nvmf_tgt seg fault
>>>
>>> Hi,
>>>
>>> Can you check the issue in latest master?
>>>
>>> Is
>>> https://github.com/spdk/spdk/commit/90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce
>>> merged recently changes the behavior ?
>>>
>>> Do you use upstream OFED or Mellanox MOFED? Which version ?
>>>
>>> Best regards
>>>
>>> Sasha
>>>
>>> On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
>>>> Sure, done.  Issue #500, do I win a prize? :)
>>>>
>>>> https://github.com/spdk/spdk/issues/500
>>>>
>>>> -Joe
>>>>
>>>>> -----Original Message-----
>>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris,
>>>>> James R
>>>>> Sent: Wednesday, November 14, 2018 11:16 AM
>>>>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
>>>>> Subject: Re: [SPDK] nvmf_tgt seg fault
>>>>>
>>>>> Thanks for the report Joe.  Could you file an issue in GitHub for this?
>>>>>
>>>>> https://github.com/spdk/spdk/issues
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Jim
>>>>>
>>>>>
>>>>> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" <spdk-
>>>>> bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
>>>>>
>>>>>        Hi everyone-
>>>>>
>>>>>        I'm running a dual socket Skylake server with P4510 NVMe and
>>>>> 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.
>>>>> SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK
>>>>> NVMeoF target and exercising it from an initiator system (similar
>>>>> config to the target but with 50Gb NIC) using FIO with the bdev
>>>>> plugin.  I find 128K sequential workloads reliably and immediately
>>>>> seg fault nvmf_tgt.  I can run 4KB random workloads without
>>>>> experiencing the seg fault, so the problem seems tied to the block
>>>>> size and/or IO pattern.  I can run the same IO pattern against a
>>>>> local PCIe device using SPDK without a problem, I only see the failure
>>>>> when running the NVMeoF target with FIO running the IO patter from an
>>>>> SPDK initiator system.
>>>>>
>>>>>        Steps to reproduce and seg fault output follow below.
>>>>>
>>>>>        Start the target:
>>>>>        sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r
>>>>> /var/tmp/spdk1.sock
>>>>>
>>>>>        Configure the target:
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d1
>>>>> -t pcie - a 0000:1a:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d2
>>>>> -t pcie - a 0000:1b:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d3
>>>>> -t pcie - a 0000:1c:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d4
>>>>> -t pcie - a 0000:1d:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d5
>>>>> -t pcie - a 0000:3d:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d6
>>>>> -t pcie - a 0000:3e:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d7
>>>>> -t pcie - a 0000:3f:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d8
>>>>> -t pcie - a 0000:40:00.0
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n
>>>>> raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store raid1
>>>>> store1
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l1
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l2
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l3
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l4
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l5
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l6
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l7
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l8
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l9
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l10
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l11
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>>>> store1 l12
>>>>> 1200000
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn1 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn2 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn3 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn4 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn5 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn6 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn7 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn8 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn9 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn10 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn11 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn12 -a
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn1 store1/l1
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn2 store1/l2
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn3 store1/l3
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn4 store1/l4
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn5 store1/l5
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn6 store1/l6
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn7 store1/l7
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn8 store1/l8
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn9 store1/l9
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn10 store1/l10
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn11 store1/l11
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>>>> nqn.2018-
>>>>> 11.io.spdk:nqn12 store1/l12
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
>>>>>        sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>>>> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
>>>>>
>>>>>        FIO file on initiator:
>>>>>        [global]
>>>>>        rw=rw
>>>>>        rwmixread=100
>>>>>        numjobs=1
>>>>>        iodepth=32
>>>>>        bs=128k
>>>>>        direct=1
>>>>>        thread=1
>>>>>        time_based=1
>>>>>        ramp_time=10
>>>>>        runtime=10
>>>>>        ioengine=spdk_bdev
>>>>>        spdk_conf=/home/don/fio/nvmeof.conf
>>>>>        group_reporting=1
>>>>>        unified_rw_reporting=1
>>>>>        exitall=1
>>>>>        randrepeat=0
>>>>>        norandommap=1
>>>>>        cpus_allowed_policy=split
>>>>>        cpus_allowed=1-2
>>>>>        [job1]
>>>>>        filename=b0n1
>>>>>
>>>>>        Config file on initiator:
>>>>>        [Nvme]
>>>>>        TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420
>>>>> subnqn:nqn.2018-
>>>>> 11.io.spdk:nqn1 adrfam:IPv4" b0
>>>>>
>>>>>        Run FIO on initiator and nvmf_tgt seg faults immediate:
>>>>>        sudo
>>>>> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plugi
>>>>> n fio sr.ini
>>>>>
>>>>>        Seg fault looks like this:
>>>>>        mlx5: donsl202: got completion with error:
>>>>>        00000000 00000000 00000000 00000000
>>>>>        00000000 00000000 00000000 00000000
>>>>>        00000001 00000000 00000000 00000000
>>>>>        00000000 9d005304 0800011b 0008d0d2
>>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
>>>>> CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
>>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
>>>>> changed to: IBV_QPS_ERR
>>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
>>>>> CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request
>>>>> Flushed Error
>>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
>>>>> changed to: IBV_QPS_ERR
>>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
>>>>> CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request
>>>>> Flushed Error
>>>>>        rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
>>>>> changed to: IBV_QPS_ERR
>>>>>        rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
>>>>> CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request
>>>>> Flushed Error
>>>>>        Segmentation fault
>>>>>
>>>>>        Adds this to dmesg:
>>>>>        [71561.859644] nvme nvme1: Connect rejected: status 8 (invalid
>>>>> service ID).
>>>>>        [71561.866466] nvme nvme1: rdma connection establishment failed (-
>>>>> 104)
>>>>>        [71567.805288] reactor_7[9166]: segfault at 88 ip
>>>>> 00005630621e6580 sp
>>>>> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
>>>>>        [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 0f
>>>>> 1f 44 00 00 41 81
>>>>> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00
>>>>> <49> 8b 96 88
>>>>> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
>>>>>
>>>>>
>>>>>
>>>>>        _______________________________________________
>>>>>        SPDK mailing list
>>>>>        SPDK(a)lists.01.org
>>>>>        https://lists.01.org/mailman/listinfo/spdk
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> SPDK mailing list
>>>>> SPDK(a)lists.01.org
>>>>> https://lists.01.org/mailman/listinfo/spdk
>>>> _______________________________________________
>>>> SPDK mailing list
>>>> SPDK(a)lists.01.org
>>>> https://lists.01.org/mailman/listinfo/spdk
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://lists.01.org/mailman/listinfo/spdk
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://lists.01.org/mailman/listinfo/spdk
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-21 19:28 Gruher, Joseph R
  0 siblings, 0 replies; 17+ messages in thread
From: Gruher, Joseph R @ 2018-11-21 19:28 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 1782 bytes --]

> Joe, Sasha - can you both try to reproduce the issue after having the NVMe-oF
> target pre-allocate its memory? This is the '-s' option to the target. Set it to at
> least 4GB to be safe. I'm concerned that this is a problem introduced by the
> patches that enable dynamic memory allocation (which creates multiple ibv_mrs
> and requires smarter splitting code that doesn't exist yet).

I can't seem to start the target with -s 4096, the most it will allow is 2048:

don(a)donsl202:~/install/spdk/scripts$ sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -s 4096 -m 0x0F0000 -r /var/tmp/spdk2.sock
Starting SPDK v19.01-pre / DPDK 18.08.0 initialization...
[ DPDK EAL parameters: nvmf --no-shconf -c 0x0F0000 -m 4096 --base-virtaddr=0x200000000000 --file-prefix=spdk_pid24899 ]
EAL: Detected 24 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: Not enough memory available! Requested: 4096MB, available: 2048MB
EAL: FATAL: Cannot init memory
EAL: Cannot init memory
Failed to initialize DPDK
Unable to initialize SPDK env 

It if I use a value of 2048, with master I still get something like this when I launch FIO, and FIO does not report any IO:

rdma.c:2487:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f68d0009250, Request 0x140088144468024 (5): Work Request Flushed Error 

If I try it on 18.10 I still get my seg fault:

mlx5: donsl202: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008813 080000d2 00362ad2
rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7fcde409d050, Request 0x140522271589792 (10): remote access error
Segmentation fault

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-21 17:03 Walker, Benjamin
  0 siblings, 0 replies; 17+ messages in thread
From: Walker, Benjamin @ 2018-11-21 17:03 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 15222 bytes --]

On Tue, 2018-11-20 at 09:57 +0200, Sasha Kotchubievsky wrote:
> 
> We see the similar issue on ARM platform with SPDK 18.10+couple our ARM 
> related patches
> 
> target crashes after receiving completion with in error state
> 
> rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 
> 0x7ff220, Request 0x13586616 (4): local protection error
> 
> target crashes after "local protection error" followed by flush errors. 
> The same pattern I see in logs reported in the email thread.

Joe, Sasha - can you both try to reproduce the issue after having the NVMe-oF
target pre-allocate its memory? This is the '-s' option to the target. Set it to
at least 4GB to be safe. I'm concerned that this is a problem introduced by the
patches that enable dynamic memory allocation (which creates multiple ibv_mrs
and requires smarter splitting code that doesn't exist yet).


> 
> 90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce improves error handling, so I'd 
> to check if it solves the issue.
> 
> Best regards
> 
> Sasha
> 
> On 11/18/2018 9:19 PM, Luse, Paul E wrote:
> > FYI the test I ran was on master as of Fri... I can check versions if you
> > tell me the steps to get exactly what you're looking for
> > 
> > Thx
> > Paul
> > 
> > -----Original Message-----
> > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha
> > Kotchubievsky
> > Sent: Sunday, November 18, 2018 9:52 AM
> > To: spdk(a)lists.01.org
> > Subject: Re: [SPDK] nvmf_tgt seg fault
> > 
> > Hi,
> > 
> > Can you check the issue in latest master?
> > 
> > Is
> > https://github.com/spdk/spdk/commit/90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce
> > merged recently changes the behavior ?
> > 
> > Do you use upstream OFED or Mellanox MOFED? Which version ?
> > 
> > Best regards
> > 
> > Sasha
> > 
> > On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
> > > Sure, done.  Issue #500, do I win a prize? :)
> > > 
> > > https://github.com/spdk/spdk/issues/500
> > > 
> > > -Joe
> > > 
> > > > -----Original Message-----
> > > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris,
> > > > James R
> > > > Sent: Wednesday, November 14, 2018 11:16 AM
> > > > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > > > Subject: Re: [SPDK] nvmf_tgt seg fault
> > > > 
> > > > Thanks for the report Joe.  Could you file an issue in GitHub for this?
> > > > 
> > > > https://github.com/spdk/spdk/issues
> > > > 
> > > > Thanks,
> > > > 
> > > > -Jim
> > > > 
> > > > 
> > > > On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" <spdk-
> > > > bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
> > > > 
> > > >       Hi everyone-
> > > > 
> > > >       I'm running a dual socket Skylake server with P4510 NVMe and
> > > > 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.
> > > > SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK
> > > > NVMeoF target and exercising it from an initiator system (similar
> > > > config to the target but with 50Gb NIC) using FIO with the bdev
> > > > plugin.  I find 128K sequential workloads reliably and immediately
> > > > seg fault nvmf_tgt.  I can run 4KB random workloads without
> > > > experiencing the seg fault, so the problem seems tied to the block
> > > > size and/or IO pattern.  I can run the same IO pattern against a
> > > > local PCIe device using SPDK without a problem, I only see the failure
> > > > when running the NVMeoF target with FIO running the IO patter from an
> > > > SPDK initiator system.
> > > > 
> > > >       Steps to reproduce and seg fault output follow below.
> > > > 
> > > >       Start the target:
> > > >       sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r
> > > > /var/tmp/spdk1.sock
> > > > 
> > > >       Configure the target:
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d1
> > > > -t pcie - a 0000:1a:00.0
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d2
> > > > -t pcie - a 0000:1b:00.0
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d3
> > > > -t pcie - a 0000:1c:00.0
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d4
> > > > -t pcie - a 0000:1d:00.0
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d5
> > > > -t pcie - a 0000:3d:00.0
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d6
> > > > -t pcie - a 0000:3e:00.0
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d7
> > > > -t pcie - a 0000:3f:00.0
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d8
> > > > -t pcie - a 0000:40:00.0
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n
> > > > raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store raid1
> > > > store1
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> > > > store1 l1
> > > > 1200000
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> > > > store1 l2
> > > > 1200000
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> > > > store1 l3
> > > > 1200000
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> > > > store1 l4
> > > > 1200000
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> > > > store1 l5
> > > > 1200000
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> > > > store1 l6
> > > > 1200000
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> > > > store1 l7
> > > > 1200000
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> > > > store1 l8
> > > > 1200000
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> > > > store1 l9
> > > > 1200000
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> > > > store1 l10
> > > > 1200000
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> > > > store1 l11
> > > > 1200000
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> > > > store1 l12
> > > > 1200000
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> > > > nqn.2018-
> > > > 11.io.spdk:nqn1 -a
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> > > > nqn.2018-
> > > > 11.io.spdk:nqn2 -a
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> > > > nqn.2018-
> > > > 11.io.spdk:nqn3 -a
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> > > > nqn.2018-
> > > > 11.io.spdk:nqn4 -a
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> > > > nqn.2018-
> > > > 11.io.spdk:nqn5 -a
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> > > > nqn.2018-
> > > > 11.io.spdk:nqn6 -a
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> > > > nqn.2018-
> > > > 11.io.spdk:nqn7 -a
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> > > > nqn.2018-
> > > > 11.io.spdk:nqn8 -a
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> > > > nqn.2018-
> > > > 11.io.spdk:nqn9 -a
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> > > > nqn.2018-
> > > > 11.io.spdk:nqn10 -a
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> > > > nqn.2018-
> > > > 11.io.spdk:nqn11 -a
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> > > > nqn.2018-
> > > > 11.io.spdk:nqn12 -a
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> > > > nqn.2018-
> > > > 11.io.spdk:nqn1 store1/l1
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> > > > nqn.2018-
> > > > 11.io.spdk:nqn2 store1/l2
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> > > > nqn.2018-
> > > > 11.io.spdk:nqn3 store1/l3
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> > > > nqn.2018-
> > > > 11.io.spdk:nqn4 store1/l4
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> > > > nqn.2018-
> > > > 11.io.spdk:nqn5 store1/l5
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> > > > nqn.2018-
> > > > 11.io.spdk:nqn6 store1/l6
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> > > > nqn.2018-
> > > > 11.io.spdk:nqn7 store1/l7
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> > > > nqn.2018-
> > > > 11.io.spdk:nqn8 store1/l8
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> > > > nqn.2018-
> > > > 11.io.spdk:nqn9 store1/l9
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> > > > nqn.2018-
> > > > 11.io.spdk:nqn10 store1/l10
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> > > > nqn.2018-
> > > > 11.io.spdk:nqn11 store1/l11
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> > > > nqn.2018-
> > > > 11.io.spdk:nqn12 store1/l12
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> > > > nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> > > > nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> > > > nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> > > > nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> > > > nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> > > > nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> > > > nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> > > > nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> > > > nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> > > > nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> > > > nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
> > > >       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> > > > nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
> > > > 
> > > >       FIO file on initiator:
> > > >       [global]
> > > >       rw=rw
> > > >       rwmixread=100
> > > >       numjobs=1
> > > >       iodepth=32
> > > >       bs=128k
> > > >       direct=1
> > > >       thread=1
> > > >       time_based=1
> > > >       ramp_time=10
> > > >       runtime=10
> > > >       ioengine=spdk_bdev
> > > >       spdk_conf=/home/don/fio/nvmeof.conf
> > > >       group_reporting=1
> > > >       unified_rw_reporting=1
> > > >       exitall=1
> > > >       randrepeat=0
> > > >       norandommap=1
> > > >       cpus_allowed_policy=split
> > > >       cpus_allowed=1-2
> > > >       [job1]
> > > >       filename=b0n1
> > > > 
> > > >       Config file on initiator:
> > > >       [Nvme]
> > > >       TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420
> > > > subnqn:nqn.2018-
> > > > 11.io.spdk:nqn1 adrfam:IPv4" b0
> > > > 
> > > >       Run FIO on initiator and nvmf_tgt seg faults immediate:
> > > >       sudo
> > > > LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plugi
> > > > n fio sr.ini
> > > > 
> > > >       Seg fault looks like this:
> > > >       mlx5: donsl202: got completion with error:
> > > >       00000000 00000000 00000000 00000000
> > > >       00000000 00000000 00000000 00000000
> > > >       00000001 00000000 00000000 00000000
> > > >       00000000 9d005304 0800011b 0008d0d2
> > > >       rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
> > > > CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
> > > >       rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
> > > > changed to: IBV_QPS_ERR
> > > >       rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
> > > > CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request
> > > > Flushed Error
> > > >       rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
> > > > changed to: IBV_QPS_ERR
> > > >       rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
> > > > CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request
> > > > Flushed Error
> > > >       rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
> > > > changed to: IBV_QPS_ERR
> > > >       rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
> > > > CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request
> > > > Flushed Error
> > > >       Segmentation fault
> > > > 
> > > >       Adds this to dmesg:
> > > >       [71561.859644] nvme nvme1: Connect rejected: status 8 (invalid
> > > > service ID).
> > > >       [71561.866466] nvme nvme1: rdma connection establishment failed (-
> > > > 104)
> > > >       [71567.805288] reactor_7[9166]: segfault at 88 ip
> > > > 00005630621e6580 sp
> > > > 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
> > > >       [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 0f
> > > > 1f 44 00 00 41 81
> > > > f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00
> > > > <49> 8b 96 88
> > > > 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
> > > > 
> > > > 
> > > > 
> > > >       _______________________________________________
> > > >       SPDK mailing list
> > > >       SPDK(a)lists.01.org
> > > >       https://lists.01.org/mailman/listinfo/spdk
> > > > 
> > > > 
> > > > _______________________________________________
> > > > SPDK mailing list
> > > > SPDK(a)lists.01.org
> > > > https://lists.01.org/mailman/listinfo/spdk
> > > 
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/spdk
> > 
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-20  7:57 Sasha Kotchubievsky
  0 siblings, 0 replies; 17+ messages in thread
From: Sasha Kotchubievsky @ 2018-11-20  7:57 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 14757 bytes --]

I just want to be sure that you run source above 
90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce

We see the similar issue on ARM platform with SPDK 18.10+couple our ARM 
related patches

target crashes after receiving completion with in error state

mlx5: retchet01-snic.mtr.labs.mlnx: got completion with error:

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

0000000c 00000000 00000000 00000000

00000000 9d005304 08000243 ca75dbd2

rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 
0x7ff220, Request 0x13586616 (4): local protection error

rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#12 changed 
to: IBV_QPS_ERR

rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 
0x7ff220, Request 0x13586616 (5): Work Request Flushed Error

rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#12 changed 
to: IBV_QPS_ERR

rdma.c:2699:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 
0x7ff220, Request 0x13637696 (5): Work Request Flushed Error

Program received signal SIGSEGV, Segmentation fault.

0x000000000044bb20 in spdk_nvmf_rdma_set_ibv_state ()

Missing separate debuginfos, use: debuginfo-install spdk-18.10-2.el7.aarch64

(gdb)

(gdb) bt

#0 0x000000000044bb20 in spdk_nvmf_rdma_set_ibv_state ()

#1 0x000000000044cf44 in spdk_nvmf_rdma_poller_poll.isra.9 ()

#2 0x000000000044d110 in spdk_nvmf_rdma_poll_group_poll ()

#3 0x000000000044a12c in spdk_nvmf_transport_poll_group_poll ()

#4 0x00000000004483bc in spdk_nvmf_poll_group_poll ()

#5 0x0000000000452ae8 in _spdk_reactor_run ()

#6 0x0000000000453080 in spdk_reactors_start ()

#7 0x0000000000451e44 in spdk_app_start ()

#8 0x000000000040803c in main ()


target crashes after "local protection error" followed by flush errors. 
The same pattern I see in logs reported in the email thread.

90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce improves error handling, so I'd 
to check if it solves the issue.

Best regards

Sasha

On 11/18/2018 9:19 PM, Luse, Paul E wrote:
> FYI the test I ran was on master as of Fri... I can check versions if you tell me the steps to get exactly what you're looking for
>
> Thx
> Paul
>
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha Kotchubievsky
> Sent: Sunday, November 18, 2018 9:52 AM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] nvmf_tgt seg fault
>
> Hi,
>
> Can you check the issue in latest master?
>
> Is
> https://github.com/spdk/spdk/commit/90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce
> merged recently changes the behavior ?
>
> Do you use upstream OFED or Mellanox MOFED? Which version ?
>
> Best regards
>
> Sasha
>
> On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
>> Sure, done.  Issue #500, do I win a prize? :)
>>
>> https://github.com/spdk/spdk/issues/500
>>
>> -Joe
>>
>>> -----Original Message-----
>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris,
>>> James R
>>> Sent: Wednesday, November 14, 2018 11:16 AM
>>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
>>> Subject: Re: [SPDK] nvmf_tgt seg fault
>>>
>>> Thanks for the report Joe.  Could you file an issue in GitHub for this?
>>>
>>> https://github.com/spdk/spdk/issues
>>>
>>> Thanks,
>>>
>>> -Jim
>>>
>>>
>>> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" <spdk-
>>> bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
>>>
>>>       Hi everyone-
>>>
>>>       I'm running a dual socket Skylake server with P4510 NVMe and
>>> 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.
>>> SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK
>>> NVMeoF target and exercising it from an initiator system (similar
>>> config to the target but with 50Gb NIC) using FIO with the bdev
>>> plugin.  I find 128K sequential workloads reliably and immediately
>>> seg fault nvmf_tgt.  I can run 4KB random workloads without
>>> experiencing the seg fault, so the problem seems tied to the block
>>> size and/or IO pattern.  I can run the same IO pattern against a
>>> local PCIe device using SPDK without a problem, I only see the failure when running the NVMeoF target with FIO running the IO patter from an SPDK initiator system.
>>>
>>>       Steps to reproduce and seg fault output follow below.
>>>
>>>       Start the target:
>>>       sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r
>>> /var/tmp/spdk1.sock
>>>
>>>       Configure the target:
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d1
>>> -t pcie - a 0000:1a:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d2
>>> -t pcie - a 0000:1b:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d3
>>> -t pcie - a 0000:1c:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d4
>>> -t pcie - a 0000:1d:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d5
>>> -t pcie - a 0000:3d:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d6
>>> -t pcie - a 0000:3e:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d7
>>> -t pcie - a 0000:3f:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d8
>>> -t pcie - a 0000:40:00.0
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n
>>> raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store raid1 store1
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l1
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l2
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l3
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l4
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l5
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l6
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l7
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l8
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l9
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l10
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l11
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
>>> store1 l12
>>> 1200000
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn1 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn2 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn3 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn4 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn5 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn6 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn7 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn8 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn9 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn10 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn11 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
>>> nqn.2018-
>>> 11.io.spdk:nqn12 -a
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn1 store1/l1
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn2 store1/l2
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn3 store1/l3
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn4 store1/l4
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn5 store1/l5
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn6 store1/l6
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn7 store1/l7
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn8 store1/l8
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn9 store1/l9
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn10 store1/l10
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn11 store1/l11
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
>>> nqn.2018-
>>> 11.io.spdk:nqn12 store1/l12
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
>>>       sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>>> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
>>>
>>>       FIO file on initiator:
>>>       [global]
>>>       rw=rw
>>>       rwmixread=100
>>>       numjobs=1
>>>       iodepth=32
>>>       bs=128k
>>>       direct=1
>>>       thread=1
>>>       time_based=1
>>>       ramp_time=10
>>>       runtime=10
>>>       ioengine=spdk_bdev
>>>       spdk_conf=/home/don/fio/nvmeof.conf
>>>       group_reporting=1
>>>       unified_rw_reporting=1
>>>       exitall=1
>>>       randrepeat=0
>>>       norandommap=1
>>>       cpus_allowed_policy=split
>>>       cpus_allowed=1-2
>>>       [job1]
>>>       filename=b0n1
>>>
>>>       Config file on initiator:
>>>       [Nvme]
>>>       TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420
>>> subnqn:nqn.2018-
>>> 11.io.spdk:nqn1 adrfam:IPv4" b0
>>>
>>>       Run FIO on initiator and nvmf_tgt seg faults immediate:
>>>       sudo
>>> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plugi
>>> n fio sr.ini
>>>
>>>       Seg fault looks like this:
>>>       mlx5: donsl202: got completion with error:
>>>       00000000 00000000 00000000 00000000
>>>       00000000 00000000 00000000 00000000
>>>       00000001 00000000 00000000 00000000
>>>       00000000 9d005304 0800011b 0008d0d2
>>>       rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
>>> CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
>>>       rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
>>> changed to: IBV_QPS_ERR
>>>       rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
>>> CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request
>>> Flushed Error
>>>       rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
>>> changed to: IBV_QPS_ERR
>>>       rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
>>> CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request
>>> Flushed Error
>>>       rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
>>> changed to: IBV_QPS_ERR
>>>       rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
>>> CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request
>>> Flushed Error
>>>       Segmentation fault
>>>
>>>       Adds this to dmesg:
>>>       [71561.859644] nvme nvme1: Connect rejected: status 8 (invalid
>>> service ID).
>>>       [71561.866466] nvme nvme1: rdma connection establishment failed (-104)
>>>       [71567.805288] reactor_7[9166]: segfault at 88 ip
>>> 00005630621e6580 sp
>>> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
>>>       [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 0f
>>> 1f 44 00 00 41 81
>>> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00
>>> <49> 8b 96 88
>>> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
>>>
>>>
>>>
>>>       _______________________________________________
>>>       SPDK mailing list
>>>       SPDK(a)lists.01.org
>>>       https://lists.01.org/mailman/listinfo/spdk
>>>
>>>
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://lists.01.org/mailman/listinfo/spdk
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-18 21:57 Gruher, Joseph R
  0 siblings, 0 replies; 17+ messages in thread
From: Gruher, Joseph R @ 2018-11-18 21:57 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 14317 bytes --]

For my testing, I do not install Mellanox OFED, just use the Ubuntu packages.

My original report used SPDK 18.10.  I tried to reproduce on master today.  When I start my FIO job I get a bunch of prints like this on the target, but it does not seg fault:

rdma.c:2487:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f430c009250, Request 0x139925942219968 (4): local protection error
rdma.c:2487:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f430c009250, Request 0x139925942219816 (5): Work Request Flushed Error
rdma.c:2487:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f430c009250, Request 0x139925942210248 (5): Work Request Flushed Error
rdma.c:2487:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f430c009250, Request 0x139925942210096 (5): Work Request Flushed Error
rdma.c:2487:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f430c009250, Request 0x139925942211328 (5): Work Request Flushed Error

My FIO job also hangs up and doesn't exit.  I think it is probably not getting any IO from the target.

Paul, if you have time, it would be interesting to see what you get with 18.10.

Thanks,
Joe

> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
> Sent: Sunday, November 18, 2018 11:19 AM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] nvmf_tgt seg fault
> 
> FYI the test I ran was on master as of Fri... I can check versions if you tell me
> the steps to get exactly what you're looking for
> 
> Thx
> Paul
> 
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha
> Kotchubievsky
> Sent: Sunday, November 18, 2018 9:52 AM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] nvmf_tgt seg fault
> 
> Hi,
> 
> Can you check the issue in latest master?
> 
> Is
> https://github.com/spdk/spdk/commit/90b4bd6cf9bb5805c0c6d8df982ac5f2
> e3d90cce
> merged recently changes the behavior ?
> 
> Do you use upstream OFED or Mellanox MOFED? Which version ?
> 
> Best regards
> 
> Sasha
> 
> On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
> > Sure, done.  Issue #500, do I win a prize? :)
> >
> > https://github.com/spdk/spdk/issues/500
> >
> > -Joe
> >
> >> -----Original Message-----
> >> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris,
> >> James R
> >> Sent: Wednesday, November 14, 2018 11:16 AM
> >> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> >> Subject: Re: [SPDK] nvmf_tgt seg fault
> >>
> >> Thanks for the report Joe.  Could you file an issue in GitHub for this?
> >>
> >> https://github.com/spdk/spdk/issues
> >>
> >> Thanks,
> >>
> >> -Jim
> >>
> >>
> >> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" <spdk-
> >> bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
> >>
> >>      Hi everyone-
> >>
> >>      I'm running a dual socket Skylake server with P4510 NVMe and
> >> 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.
> >> SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK
> >> NVMeoF target and exercising it from an initiator system (similar
> >> config to the target but with 50Gb NIC) using FIO with the bdev
> >> plugin.  I find 128K sequential workloads reliably and immediately
> >> seg fault nvmf_tgt.  I can run 4KB random workloads without
> >> experiencing the seg fault, so the problem seems tied to the block
> >> size and/or IO pattern.  I can run the same IO pattern against a
> >> local PCIe device using SPDK without a problem, I only see the failure
> when running the NVMeoF target with FIO running the IO patter from an
> SPDK initiator system.
> >>
> >>      Steps to reproduce and seg fault output follow below.
> >>
> >>      Start the target:
> >>      sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r
> >> /var/tmp/spdk1.sock
> >>
> >>      Configure the target:
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d1
> >> -t pcie - a 0000:1a:00.0
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d2
> >> -t pcie - a 0000:1b:00.0
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d3
> >> -t pcie - a 0000:1c:00.0
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d4
> >> -t pcie - a 0000:1d:00.0
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d5
> >> -t pcie - a 0000:3d:00.0
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d6
> >> -t pcie - a 0000:3e:00.0
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d7
> >> -t pcie - a 0000:3f:00.0
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d8
> >> -t pcie - a 0000:40:00.0
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n
> >> raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store raid1 store1
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> >> store1 l1
> >> 1200000
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> >> store1 l2
> >> 1200000
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> >> store1 l3
> >> 1200000
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> >> store1 l4
> >> 1200000
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> >> store1 l5
> >> 1200000
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> >> store1 l6
> >> 1200000
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> >> store1 l7
> >> 1200000
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> >> store1 l8
> >> 1200000
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> >> store1 l9
> >> 1200000
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> >> store1 l10
> >> 1200000
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> >> store1 l11
> >> 1200000
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l
> >> store1 l12
> >> 1200000
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> >> nqn.2018-
> >> 11.io.spdk:nqn1 -a
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> >> nqn.2018-
> >> 11.io.spdk:nqn2 -a
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> >> nqn.2018-
> >> 11.io.spdk:nqn3 -a
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> >> nqn.2018-
> >> 11.io.spdk:nqn4 -a
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> >> nqn.2018-
> >> 11.io.spdk:nqn5 -a
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> >> nqn.2018-
> >> 11.io.spdk:nqn6 -a
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> >> nqn.2018-
> >> 11.io.spdk:nqn7 -a
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> >> nqn.2018-
> >> 11.io.spdk:nqn8 -a
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> >> nqn.2018-
> >> 11.io.spdk:nqn9 -a
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> >> nqn.2018-
> >> 11.io.spdk:nqn10 -a
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> >> nqn.2018-
> >> 11.io.spdk:nqn11 -a
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create
> >> nqn.2018-
> >> 11.io.spdk:nqn12 -a
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> >> nqn.2018-
> >> 11.io.spdk:nqn1 store1/l1
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> >> nqn.2018-
> >> 11.io.spdk:nqn2 store1/l2
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> >> nqn.2018-
> >> 11.io.spdk:nqn3 store1/l3
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> >> nqn.2018-
> >> 11.io.spdk:nqn4 store1/l4
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> >> nqn.2018-
> >> 11.io.spdk:nqn5 store1/l5
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> >> nqn.2018-
> >> 11.io.spdk:nqn6 store1/l6
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> >> nqn.2018-
> >> 11.io.spdk:nqn7 store1/l7
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> >> nqn.2018-
> >> 11.io.spdk:nqn8 store1/l8
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> >> nqn.2018-
> >> 11.io.spdk:nqn9 store1/l9
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> >> nqn.2018-
> >> 11.io.spdk:nqn10 store1/l10
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> >> nqn.2018-
> >> 11.io.spdk:nqn11 store1/l11
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns
> >> nqn.2018-
> >> 11.io.spdk:nqn12 store1/l12
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> >> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> >> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> >> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> >> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> >> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> >> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> >> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> >> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> >> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> >> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> >> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
> >>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> >> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
> >>
> >>      FIO file on initiator:
> >>      [global]
> >>      rw=rw
> >>      rwmixread=100
> >>      numjobs=1
> >>      iodepth=32
> >>      bs=128k
> >>      direct=1
> >>      thread=1
> >>      time_based=1
> >>      ramp_time=10
> >>      runtime=10
> >>      ioengine=spdk_bdev
> >>      spdk_conf=/home/don/fio/nvmeof.conf
> >>      group_reporting=1
> >>      unified_rw_reporting=1
> >>      exitall=1
> >>      randrepeat=0
> >>      norandommap=1
> >>      cpus_allowed_policy=split
> >>      cpus_allowed=1-2
> >>      [job1]
> >>      filename=b0n1
> >>
> >>      Config file on initiator:
> >>      [Nvme]
> >>      TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420
> >> subnqn:nqn.2018-
> >> 11.io.spdk:nqn1 adrfam:IPv4" b0
> >>
> >>      Run FIO on initiator and nvmf_tgt seg faults immediate:
> >>      sudo
> >>
> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plugi
> >> n fio sr.ini
> >>
> >>      Seg fault looks like this:
> >>      mlx5: donsl202: got completion with error:
> >>      00000000 00000000 00000000 00000000
> >>      00000000 00000000 00000000 00000000
> >>      00000001 00000000 00000000 00000000
> >>      00000000 9d005304 0800011b 0008d0d2
> >>      rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
> >> CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection
> error
> >>      rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
> >> changed to: IBV_QPS_ERR
> >>      rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
> >> CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request
> >> Flushed Error
> >>      rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
> >> changed to: IBV_QPS_ERR
> >>      rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
> >> CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request
> >> Flushed Error
> >>      rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
> >> changed to: IBV_QPS_ERR
> >>      rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on
> >> CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request
> >> Flushed Error
> >>      Segmentation fault
> >>
> >>      Adds this to dmesg:
> >>      [71561.859644] nvme nvme1: Connect rejected: status 8 (invalid
> >> service ID).
> >>      [71561.866466] nvme nvme1: rdma connection establishment failed (-
> 104)
> >>      [71567.805288] reactor_7[9166]: segfault at 88 ip
> >> 00005630621e6580 sp
> >> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
> >>      [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 0f
> >> 1f 44 00 00 41 81
> >> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00
> >> <49> 8b 96 88
> >> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
> >>
> >>
> >>
> >>      _______________________________________________
> >>      SPDK mailing list
> >>      SPDK(a)lists.01.org
> >>      https://lists.01.org/mailman/listinfo/spdk
> >>
> >>
> >> _______________________________________________
> >> SPDK mailing list
> >> SPDK(a)lists.01.org
> >> https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-18 19:19 Luse, Paul E
  0 siblings, 0 replies; 17+ messages in thread
From: Luse, Paul E @ 2018-11-18 19:19 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 12161 bytes --]

FYI the test I ran was on master as of Fri... I can check versions if you tell me the steps to get exactly what you're looking for

Thx
Paul

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha Kotchubievsky
Sent: Sunday, November 18, 2018 9:52 AM
To: spdk(a)lists.01.org
Subject: Re: [SPDK] nvmf_tgt seg fault

Hi,

Can you check the issue in latest master?

Is
https://github.com/spdk/spdk/commit/90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce
merged recently changes the behavior ?

Do you use upstream OFED or Mellanox MOFED? Which version ?

Best regards

Sasha

On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
> Sure, done.  Issue #500, do I win a prize? :)
>
> https://github.com/spdk/spdk/issues/500
>
> -Joe
>
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, 
>> James R
>> Sent: Wednesday, November 14, 2018 11:16 AM
>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
>> Subject: Re: [SPDK] nvmf_tgt seg fault
>>
>> Thanks for the report Joe.  Could you file an issue in GitHub for this?
>>
>> https://github.com/spdk/spdk/issues
>>
>> Thanks,
>>
>> -Jim
>>
>>
>> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" <spdk- 
>> bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
>>
>>      Hi everyone-
>>
>>      I'm running a dual socket Skylake server with P4510 NVMe and 
>> 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.  
>> SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK 
>> NVMeoF target and exercising it from an initiator system (similar 
>> config to the target but with 50Gb NIC) using FIO with the bdev 
>> plugin.  I find 128K sequential workloads reliably and immediately 
>> seg fault nvmf_tgt.  I can run 4KB random workloads without 
>> experiencing the seg fault, so the problem seems tied to the block 
>> size and/or IO pattern.  I can run the same IO pattern against a 
>> local PCIe device using SPDK without a problem, I only see the failure when running the NVMeoF target with FIO running the IO patter from an SPDK initiator system.
>>
>>      Steps to reproduce and seg fault output follow below.
>>
>>      Start the target:
>>      sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r 
>> /var/tmp/spdk1.sock
>>
>>      Configure the target:
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d1 
>> -t pcie - a 0000:1a:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d2 
>> -t pcie - a 0000:1b:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d3 
>> -t pcie - a 0000:1c:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d4 
>> -t pcie - a 0000:1d:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d5 
>> -t pcie - a 0000:3d:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d6 
>> -t pcie - a 0000:3e:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d7 
>> -t pcie - a 0000:3f:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d8 
>> -t pcie - a 0000:40:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n 
>> raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store raid1 store1
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l 
>> store1 l1
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l 
>> store1 l2
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l 
>> store1 l3
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l 
>> store1 l4
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l 
>> store1 l5
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l 
>> store1 l6
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l 
>> store1 l7
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l 
>> store1 l8
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l 
>> store1 l9
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l 
>> store1 l10
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l 
>> store1 l11
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l 
>> store1 l12
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create 
>> nqn.2018-
>> 11.io.spdk:nqn1 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create 
>> nqn.2018-
>> 11.io.spdk:nqn2 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create 
>> nqn.2018-
>> 11.io.spdk:nqn3 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create 
>> nqn.2018-
>> 11.io.spdk:nqn4 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create 
>> nqn.2018-
>> 11.io.spdk:nqn5 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create 
>> nqn.2018-
>> 11.io.spdk:nqn6 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create 
>> nqn.2018-
>> 11.io.spdk:nqn7 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create 
>> nqn.2018-
>> 11.io.spdk:nqn8 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create 
>> nqn.2018-
>> 11.io.spdk:nqn9 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create 
>> nqn.2018-
>> 11.io.spdk:nqn10 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create 
>> nqn.2018-
>> 11.io.spdk:nqn11 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create 
>> nqn.2018-
>> 11.io.spdk:nqn12 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns 
>> nqn.2018-
>> 11.io.spdk:nqn1 store1/l1
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns 
>> nqn.2018-
>> 11.io.spdk:nqn2 store1/l2
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns 
>> nqn.2018-
>> 11.io.spdk:nqn3 store1/l3
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns 
>> nqn.2018-
>> 11.io.spdk:nqn4 store1/l4
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns 
>> nqn.2018-
>> 11.io.spdk:nqn5 store1/l5
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns 
>> nqn.2018-
>> 11.io.spdk:nqn6 store1/l6
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns 
>> nqn.2018-
>> 11.io.spdk:nqn7 store1/l7
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns 
>> nqn.2018-
>> 11.io.spdk:nqn8 store1/l8
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns 
>> nqn.2018-
>> 11.io.spdk:nqn9 store1/l9
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns 
>> nqn.2018-
>> 11.io.spdk:nqn10 store1/l10
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns 
>> nqn.2018-
>> 11.io.spdk:nqn11 store1/l11
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns 
>> nqn.2018-
>> 11.io.spdk:nqn12 store1/l12
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
>>
>>      FIO file on initiator:
>>      [global]
>>      rw=rw
>>      rwmixread=100
>>      numjobs=1
>>      iodepth=32
>>      bs=128k
>>      direct=1
>>      thread=1
>>      time_based=1
>>      ramp_time=10
>>      runtime=10
>>      ioengine=spdk_bdev
>>      spdk_conf=/home/don/fio/nvmeof.conf
>>      group_reporting=1
>>      unified_rw_reporting=1
>>      exitall=1
>>      randrepeat=0
>>      norandommap=1
>>      cpus_allowed_policy=split
>>      cpus_allowed=1-2
>>      [job1]
>>      filename=b0n1
>>
>>      Config file on initiator:
>>      [Nvme]
>>      TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420 
>> subnqn:nqn.2018-
>> 11.io.spdk:nqn1 adrfam:IPv4" b0
>>
>>      Run FIO on initiator and nvmf_tgt seg faults immediate:
>>      sudo
>> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plugi
>> n fio sr.ini
>>
>>      Seg fault looks like this:
>>      mlx5: donsl202: got completion with error:
>>      00000000 00000000 00000000 00000000
>>      00000000 00000000 00000000 00000000
>>      00000001 00000000 00000000 00000000
>>      00000000 9d005304 0800011b 0008d0d2
>>      rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on 
>> CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
>>      rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1 
>> changed to: IBV_QPS_ERR
>>      rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on 
>> CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request 
>> Flushed Error
>>      rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1 
>> changed to: IBV_QPS_ERR
>>      rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on 
>> CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
>> Flushed Error
>>      rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1 
>> changed to: IBV_QPS_ERR
>>      rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on 
>> CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request 
>> Flushed Error
>>      Segmentation fault
>>
>>      Adds this to dmesg:
>>      [71561.859644] nvme nvme1: Connect rejected: status 8 (invalid 
>> service ID).
>>      [71561.866466] nvme nvme1: rdma connection establishment failed (-104)
>>      [71567.805288] reactor_7[9166]: segfault at 88 ip 
>> 00005630621e6580 sp
>> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
>>      [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 0f 
>> 1f 44 00 00 41 81
>> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00 
>> <49> 8b 96 88
>> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
>>
>>
>>
>>      _______________________________________________
>>      SPDK mailing list
>>      SPDK(a)lists.01.org
>>      https://lists.01.org/mailman/listinfo/spdk
>>
>>
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-18 16:52 Sasha Kotchubievsky
  0 siblings, 0 replies; 17+ messages in thread
From: Sasha Kotchubievsky @ 2018-11-18 16:52 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 11429 bytes --]

Hi,

Can you check the issue in latest master?

Is 
https://github.com/spdk/spdk/commit/90b4bd6cf9bb5805c0c6d8df982ac5f2e3d90cce 
merged recently changes the behavior ?

Do you use upstream OFED or Mellanox MOFED? Which version ?

Best regards

Sasha

On 11/14/2018 9:26 PM, Gruher, Joseph R wrote:
> Sure, done.  Issue #500, do I win a prize? :)
>
> https://github.com/spdk/spdk/issues/500
>
> -Joe
>
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
>> Sent: Wednesday, November 14, 2018 11:16 AM
>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
>> Subject: Re: [SPDK] nvmf_tgt seg fault
>>
>> Thanks for the report Joe.  Could you file an issue in GitHub for this?
>>
>> https://github.com/spdk/spdk/issues
>>
>> Thanks,
>>
>> -Jim
>>
>>
>> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" <spdk-
>> bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
>>
>>      Hi everyone-
>>
>>      I'm running a dual socket Skylake server with P4510 NVMe and 100Gb
>> Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.  SPDK version is
>> 18.10, FIO version is 3.12.  I'm running the SPDK NVMeoF target and
>> exercising it from an initiator system (similar config to the target but with
>> 50Gb NIC) using FIO with the bdev plugin.  I find 128K sequential workloads
>> reliably and immediately seg fault nvmf_tgt.  I can run 4KB random workloads
>> without experiencing the seg fault, so the problem seems tied to the block
>> size and/or IO pattern.  I can run the same IO pattern against a local PCIe
>> device using SPDK without a problem, I only see the failure when running the
>> NVMeoF target with FIO running the IO patter from an SPDK initiator system.
>>
>>      Steps to reproduce and seg fault output follow below.
>>
>>      Start the target:
>>      sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r
>> /var/tmp/spdk1.sock
>>
>>      Configure the target:
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d1 -t pcie -
>> a 0000:1a:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d2 -t pcie -
>> a 0000:1b:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d3 -t pcie -
>> a 0000:1c:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d4 -t pcie -
>> a 0000:1d:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d5 -t pcie -
>> a 0000:3d:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d6 -t pcie -
>> a 0000:3e:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d7 -t pcie -
>> a 0000:3f:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d8 -t pcie -
>> a 0000:40:00.0
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n raid1 -s 4 -r 0
>> -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store raid1 store1
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l1
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l2
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l3
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l4
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l5
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l6
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l7
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l8
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l9
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l10
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l11
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l12
>> 1200000
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
>> 11.io.spdk:nqn1 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
>> 11.io.spdk:nqn2 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
>> 11.io.spdk:nqn3 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
>> 11.io.spdk:nqn4 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
>> 11.io.spdk:nqn5 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
>> 11.io.spdk:nqn6 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
>> 11.io.spdk:nqn7 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
>> 11.io.spdk:nqn8 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
>> 11.io.spdk:nqn9 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
>> 11.io.spdk:nqn10 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
>> 11.io.spdk:nqn11 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
>> 11.io.spdk:nqn12 -a
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
>> 11.io.spdk:nqn1 store1/l1
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
>> 11.io.spdk:nqn2 store1/l2
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
>> 11.io.spdk:nqn3 store1/l3
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
>> 11.io.spdk:nqn4 store1/l4
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
>> 11.io.spdk:nqn5 store1/l5
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
>> 11.io.spdk:nqn6 store1/l6
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
>> 11.io.spdk:nqn7 store1/l7
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
>> 11.io.spdk:nqn8 store1/l8
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
>> 11.io.spdk:nqn9 store1/l9
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
>> 11.io.spdk:nqn10 store1/l10
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
>> 11.io.spdk:nqn11 store1/l11
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
>> 11.io.spdk:nqn12 store1/l12
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
>>      sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
>> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
>>
>>      FIO file on initiator:
>>      [global]
>>      rw=rw
>>      rwmixread=100
>>      numjobs=1
>>      iodepth=32
>>      bs=128k
>>      direct=1
>>      thread=1
>>      time_based=1
>>      ramp_time=10
>>      runtime=10
>>      ioengine=spdk_bdev
>>      spdk_conf=/home/don/fio/nvmeof.conf
>>      group_reporting=1
>>      unified_rw_reporting=1
>>      exitall=1
>>      randrepeat=0
>>      norandommap=1
>>      cpus_allowed_policy=split
>>      cpus_allowed=1-2
>>      [job1]
>>      filename=b0n1
>>
>>      Config file on initiator:
>>      [Nvme]
>>      TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420 subnqn:nqn.2018-
>> 11.io.spdk:nqn1 adrfam:IPv4" b0
>>
>>      Run FIO on initiator and nvmf_tgt seg faults immediate:
>>      sudo
>> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plugi
>> n fio sr.ini
>>
>>      Seg fault looks like this:
>>      mlx5: donsl202: got completion with error:
>>      00000000 00000000 00000000 00000000
>>      00000000 00000000 00000000 00000000
>>      00000001 00000000 00000000 00000000
>>      00000000 9d005304 0800011b 0008d0d2
>>      rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ
>> 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
>>      rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
>> changed to: IBV_QPS_ERR
>>      rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ
>> 0x7f079c01d170, Request 0x139670660105216 (5): Work Request Flushed
>> Error
>>      rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
>> changed to: IBV_QPS_ERR
>>      rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ
>> 0x7f079c01d170, Request 0x139670660106280 (5): Work Request Flushed
>> Error
>>      rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
>> changed to: IBV_QPS_ERR
>>      rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ
>> 0x7f079c01d170, Request 0x139670660106280 (5): Work Request Flushed
>> Error
>>      Segmentation fault
>>
>>      Adds this to dmesg:
>>      [71561.859644] nvme nvme1: Connect rejected: status 8 (invalid service
>> ID).
>>      [71561.866466] nvme nvme1: rdma connection establishment failed (-104)
>>      [71567.805288] reactor_7[9166]: segfault at 88 ip 00005630621e6580 sp
>> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
>>      [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 0f 1f 44 00 00 41 81
>> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00 <49> 8b 96 88
>> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
>>
>>
>>
>>      _______________________________________________
>>      SPDK mailing list
>>      SPDK(a)lists.01.org
>>      https://lists.01.org/mailman/listinfo/spdk
>>
>>
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-14 19:26 Gruher, Joseph R
  0 siblings, 0 replies; 17+ messages in thread
From: Gruher, Joseph R @ 2018-11-14 19:26 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 10636 bytes --]

Sure, done.  Issue #500, do I win a prize? :)

https://github.com/spdk/spdk/issues/500

-Joe

> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
> Sent: Wednesday, November 14, 2018 11:16 AM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] nvmf_tgt seg fault
> 
> Thanks for the report Joe.  Could you file an issue in GitHub for this?
> 
> https://github.com/spdk/spdk/issues
> 
> Thanks,
> 
> -Jim
> 
> 
> On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" <spdk-
> bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:
> 
>     Hi everyone-
> 
>     I'm running a dual socket Skylake server with P4510 NVMe and 100Gb
> Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.  SPDK version is
> 18.10, FIO version is 3.12.  I'm running the SPDK NVMeoF target and
> exercising it from an initiator system (similar config to the target but with
> 50Gb NIC) using FIO with the bdev plugin.  I find 128K sequential workloads
> reliably and immediately seg fault nvmf_tgt.  I can run 4KB random workloads
> without experiencing the seg fault, so the problem seems tied to the block
> size and/or IO pattern.  I can run the same IO pattern against a local PCIe
> device using SPDK without a problem, I only see the failure when running the
> NVMeoF target with FIO running the IO patter from an SPDK initiator system.
> 
>     Steps to reproduce and seg fault output follow below.
> 
>     Start the target:
>     sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r
> /var/tmp/spdk1.sock
> 
>     Configure the target:
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d1 -t pcie -
> a 0000:1a:00.0
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d2 -t pcie -
> a 0000:1b:00.0
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d3 -t pcie -
> a 0000:1c:00.0
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d4 -t pcie -
> a 0000:1d:00.0
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d5 -t pcie -
> a 0000:3d:00.0
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d6 -t pcie -
> a 0000:3e:00.0
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d7 -t pcie -
> a 0000:3f:00.0
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d8 -t pcie -
> a 0000:40:00.0
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n raid1 -s 4 -r 0
> -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store raid1 store1
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l1
> 1200000
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l2
> 1200000
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l3
> 1200000
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l4
> 1200000
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l5
> 1200000
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l6
> 1200000
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l7
> 1200000
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l8
> 1200000
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l9
> 1200000
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l10
> 1200000
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l11
> 1200000
>     sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l12
> 1200000
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
> 11.io.spdk:nqn1 -a
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
> 11.io.spdk:nqn2 -a
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
> 11.io.spdk:nqn3 -a
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
> 11.io.spdk:nqn4 -a
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
> 11.io.spdk:nqn5 -a
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
> 11.io.spdk:nqn6 -a
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
> 11.io.spdk:nqn7 -a
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
> 11.io.spdk:nqn8 -a
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
> 11.io.spdk:nqn9 -a
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
> 11.io.spdk:nqn10 -a
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
> 11.io.spdk:nqn11 -a
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-
> 11.io.spdk:nqn12 -a
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
> 11.io.spdk:nqn1 store1/l1
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
> 11.io.spdk:nqn2 store1/l2
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
> 11.io.spdk:nqn3 store1/l3
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
> 11.io.spdk:nqn4 store1/l4
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
> 11.io.spdk:nqn5 store1/l5
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
> 11.io.spdk:nqn6 store1/l6
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
> 11.io.spdk:nqn7 store1/l7
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
> 11.io.spdk:nqn8 store1/l8
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
> 11.io.spdk:nqn9 store1/l9
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
> 11.io.spdk:nqn10 store1/l10
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
> 11.io.spdk:nqn11 store1/l11
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-
> 11.io.spdk:nqn12 store1/l12
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
>     sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener
> nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
> 
>     FIO file on initiator:
>     [global]
>     rw=rw
>     rwmixread=100
>     numjobs=1
>     iodepth=32
>     bs=128k
>     direct=1
>     thread=1
>     time_based=1
>     ramp_time=10
>     runtime=10
>     ioengine=spdk_bdev
>     spdk_conf=/home/don/fio/nvmeof.conf
>     group_reporting=1
>     unified_rw_reporting=1
>     exitall=1
>     randrepeat=0
>     norandommap=1
>     cpus_allowed_policy=split
>     cpus_allowed=1-2
>     [job1]
>     filename=b0n1
> 
>     Config file on initiator:
>     [Nvme]
>     TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420 subnqn:nqn.2018-
> 11.io.spdk:nqn1 adrfam:IPv4" b0
> 
>     Run FIO on initiator and nvmf_tgt seg faults immediate:
>     sudo
> LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plugi
> n fio sr.ini
> 
>     Seg fault looks like this:
>     mlx5: donsl202: got completion with error:
>     00000000 00000000 00000000 00000000
>     00000000 00000000 00000000 00000000
>     00000001 00000000 00000000 00000000
>     00000000 9d005304 0800011b 0008d0d2
>     rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ
> 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
>     rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
> changed to: IBV_QPS_ERR
>     rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ
> 0x7f079c01d170, Request 0x139670660105216 (5): Work Request Flushed
> Error
>     rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
> changed to: IBV_QPS_ERR
>     rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ
> 0x7f079c01d170, Request 0x139670660106280 (5): Work Request Flushed
> Error
>     rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1
> changed to: IBV_QPS_ERR
>     rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ
> 0x7f079c01d170, Request 0x139670660106280 (5): Work Request Flushed
> Error
>     Segmentation fault
> 
>     Adds this to dmesg:
>     [71561.859644] nvme nvme1: Connect rejected: status 8 (invalid service
> ID).
>     [71561.866466] nvme nvme1: rdma connection establishment failed (-104)
>     [71567.805288] reactor_7[9166]: segfault at 88 ip 00005630621e6580 sp
> 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
>     [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 0f 1f 44 00 00 41 81
> f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00 <49> 8b 96 88
> 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
> 
> 
> 
>     _______________________________________________
>     SPDK mailing list
>     SPDK(a)lists.01.org
>     https://lists.01.org/mailman/listinfo/spdk
> 
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] nvmf_tgt seg fault
@ 2018-11-14 19:16 Harris, James R
  0 siblings, 0 replies; 17+ messages in thread
From: Harris, James R @ 2018-11-14 19:16 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 9609 bytes --]

Thanks for the report Joe.  Could you file an issue in GitHub for this?

https://github.com/spdk/spdk/issues

Thanks,

-Jim


On 11/14/18, 12:14 PM, "SPDK on behalf of Gruher, Joseph R" <spdk-bounces(a)lists.01.org on behalf of joseph.r.gruher(a)intel.com> wrote:

    Hi everyone-
    
    I'm running a dual socket Skylake server with P4510 NVMe and 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.  SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK NVMeoF target and exercising it from an initiator system (similar config to the target but with 50Gb NIC) using FIO with the bdev plugin.  I find 128K sequential workloads reliably and immediately seg fault nvmf_tgt.  I can run 4KB random workloads without experiencing the seg fault, so the problem seems tied to the block size and/or IO pattern.  I can run the same IO pattern against a local PCIe device using SPDK without a problem, I only see the failure when running the NVMeoF target with FIO running the IO patter from an SPDK initiator system.
    
    Steps to reproduce and seg fault output follow below.
    
    Start the target:
    sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r /var/tmp/spdk1.sock
    
    Configure the target:
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d1 -t pcie -a 0000:1a:00.0
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d2 -t pcie -a 0000:1b:00.0
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d3 -t pcie -a 0000:1c:00.0
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d4 -t pcie -a 0000:1d:00.0
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d5 -t pcie -a 0000:3d:00.0
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d6 -t pcie -a 0000:3e:00.0
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d7 -t pcie -a 0000:3f:00.0
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d8 -t pcie -a 0000:40:00.0
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store raid1 store1
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l1 1200000
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l2 1200000
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l3 1200000
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l4 1200000
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l5 1200000
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l6 1200000
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l7 1200000
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l8 1200000
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l9 1200000
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l10 1200000
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l11 1200000
    sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l12 1200000
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn1 -a
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn2 -a
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn3 -a
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn4 -a
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn5 -a
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn6 -a
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn7 -a
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn8 -a
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn9 -a
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn10 -a
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn11 -a
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn12 -a
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn1 store1/l1
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn2 store1/l2
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn3 store1/l3
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn4 store1/l4
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn5 store1/l5
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn6 store1/l6
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn7 store1/l7
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn8 store1/l8
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn9 store1/l9
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn10 store1/l10
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn11 store1/l11
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn12 store1/l12
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
    sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420
    
    FIO file on initiator:
    [global]
    rw=rw
    rwmixread=100
    numjobs=1
    iodepth=32
    bs=128k
    direct=1
    thread=1
    time_based=1
    ramp_time=10
    runtime=10
    ioengine=spdk_bdev
    spdk_conf=/home/don/fio/nvmeof.conf
    group_reporting=1
    unified_rw_reporting=1
    exitall=1
    randrepeat=0
    norandommap=1
    cpus_allowed_policy=split
    cpus_allowed=1-2
    [job1]
    filename=b0n1
    
    Config file on initiator:
    [Nvme]
    TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420 subnqn:nqn.2018-11.io.spdk:nqn1 adrfam:IPv4" b0
    
    Run FIO on initiator and nvmf_tgt seg faults immediate:
    sudo LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plugin fio sr.ini
    
    Seg fault looks like this:
    mlx5: donsl202: got completion with error:
    00000000 00000000 00000000 00000000
    00000000 00000000 00000000 00000000
    00000001 00000000 00000000 00000000
    00000000 9d005304 0800011b 0008d0d2
    rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
    rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1 changed to: IBV_QPS_ERR
    rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request Flushed Error
    rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1 changed to: IBV_QPS_ERR
    rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request Flushed Error
    rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1 changed to: IBV_QPS_ERR
    rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request Flushed Error
    Segmentation fault
    
    Adds this to dmesg:
    [71561.859644] nvme nvme1: Connect rejected: status 8 (invalid service ID).
    [71561.866466] nvme nvme1: rdma connection establishment failed (-104)
    [71567.805288] reactor_7[9166]: segfault at 88 ip 00005630621e6580 sp 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
    [71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 0f 1f 44 00 00 41 81 f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00 <49> 8b 96 88 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48
    
    
    
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [SPDK] nvmf_tgt seg fault
@ 2018-11-14 19:14 Gruher, Joseph R
  0 siblings, 0 replies; 17+ messages in thread
From: Gruher, Joseph R @ 2018-11-14 19:14 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 8681 bytes --]

Hi everyone-

I'm running a dual socket Skylake server with P4510 NVMe and 100Gb Mellanox CX4 NIC.  OS is Ubuntu 18.04 with kernel 4.18.16.  SPDK version is 18.10, FIO version is 3.12.  I'm running the SPDK NVMeoF target and exercising it from an initiator system (similar config to the target but with 50Gb NIC) using FIO with the bdev plugin.  I find 128K sequential workloads reliably and immediately seg fault nvmf_tgt.  I can run 4KB random workloads without experiencing the seg fault, so the problem seems tied to the block size and/or IO pattern.  I can run the same IO pattern against a local PCIe device using SPDK without a problem, I only see the failure when running the NVMeoF target with FIO running the IO patter from an SPDK initiator system.

Steps to reproduce and seg fault output follow below.

Start the target:
sudo ~/install/spdk/app/nvmf_tgt/nvmf_tgt -m 0x0000F0 -r /var/tmp/spdk1.sock

Configure the target:
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d1 -t pcie -a 0000:1a:00.0
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d2 -t pcie -a 0000:1b:00.0
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d3 -t pcie -a 0000:1c:00.0
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d4 -t pcie -a 0000:1d:00.0
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d5 -t pcie -a 0000:3d:00.0
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d6 -t pcie -a 0000:3e:00.0
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d7 -t pcie -a 0000:3f:00.0
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_nvme_bdev -b d8 -t pcie -a 0000:40:00.0
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_raid_bdev -n raid1 -s 4 -r 0 -b "d1n1 d2n1 d3n1 d4n1 d5n1 d6n1 d7n1 d8n1"
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_store raid1 store1
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l1 1200000
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l2 1200000
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l3 1200000
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l4 1200000
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l5 1200000
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l6 1200000
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l7 1200000
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l8 1200000
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l9 1200000
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l10 1200000
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l11 1200000
sudo ./rpc.py -s /var/tmp/spdk1.sock construct_lvol_bdev -l store1 l12 1200000
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn1 -a
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn2 -a
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn3 -a
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn4 -a
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn5 -a
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn6 -a
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn7 -a
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn8 -a
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn9 -a
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn10 -a
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn11 -a
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_create nqn.2018-11.io.spdk:nqn12 -a
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn1 store1/l1
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn2 store1/l2
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn3 store1/l3
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn4 store1/l4
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn5 store1/l5
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn6 store1/l6
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn7 store1/l7
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn8 store1/l8
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn9 store1/l9
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn10 store1/l10
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn11 store1/l11
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_ns nqn.2018-11.io.spdk:nqn12 store1/l12
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn1 -t rdma -a 10.5.0.202 -s 4420
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn2 -t rdma -a 10.5.0.202 -s 4420
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn3 -t rdma -a 10.5.0.202 -s 4420
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn4 -t rdma -a 10.5.0.202 -s 4420
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn5 -t rdma -a 10.5.0.202 -s 4420
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn6 -t rdma -a 10.5.0.202 -s 4420
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn7 -t rdma -a 10.5.0.202 -s 4420
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn8 -t rdma -a 10.5.0.202 -s 4420
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn9 -t rdma -a 10.5.0.202 -s 4420
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn10 -t rdma -a 10.5.0.202 -s 4420
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn11 -t rdma -a 10.5.0.202 -s 4420
sudo ./rpc.py -s /var/tmp/spdk1.sock nvmf_subsystem_add_listener nqn.2018-11.io.spdk:nqn12 -t rdma -a 10.5.0.202 -s 4420

FIO file on initiator:
[global]
rw=rw
rwmixread=100
numjobs=1
iodepth=32
bs=128k
direct=1
thread=1
time_based=1
ramp_time=10
runtime=10
ioengine=spdk_bdev
spdk_conf=/home/don/fio/nvmeof.conf
group_reporting=1
unified_rw_reporting=1
exitall=1
randrepeat=0
norandommap=1
cpus_allowed_policy=split
cpus_allowed=1-2
[job1]
filename=b0n1

Config file on initiator:
[Nvme]
TransportID "trtype:RDMA traddr:10.5.0.202 trsvcid:4420 subnqn:nqn.2018-11.io.spdk:nqn1 adrfam:IPv4" b0

Run FIO on initiator and nvmf_tgt seg faults immediate:
sudo LD_PRELOAD=/home/don/install/spdk/examples/bdev/fio_plugin/fio_plugin fio sr.ini

Seg fault looks like this:
mlx5: donsl202: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 9d005304 0800011b 0008d0d2
rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f079c01d170, Request 0x139670660105216 (4): local protection error
rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1 changed to: IBV_QPS_ERR
rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f079c01d170, Request 0x139670660105216 (5): Work Request Flushed Error
rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1 changed to: IBV_QPS_ERR
rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request Flushed Error
rdma.c: 501:spdk_nvmf_rdma_set_ibv_state: *NOTICE*: IBV QP#1 changed to: IBV_QPS_ERR
rdma.c:2698:spdk_nvmf_rdma_poller_poll: *WARNING*: CQ error on CQ 0x7f079c01d170, Request 0x139670660106280 (5): Work Request Flushed Error
Segmentation fault

Adds this to dmesg:
[71561.859644] nvme nvme1: Connect rejected: status 8 (invalid service ID).
[71561.866466] nvme nvme1: rdma connection establishment failed (-104)
[71567.805288] reactor_7[9166]: segfault at 88 ip 00005630621e6580 sp 00007f07af5fc400 error 4 in nvmf_tgt[563062194000+df000]
[71567.805293] Code: 48 8b 30 e8 82 f7 ff ff e9 7d fe ff ff 0f 1f 44 00 00 41 81 f9 80 00 00 00 75 37 49 8b 07 4c 8b 70 40 48 c7 40 50 00 00 00 00 <49> 8b 96 88 00 00 00 48 89 50 58 49 8b 96 88 00 00 00 48 89 02 48




^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-11-27 21:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-20 14:19 [SPDK] nvmf_tgt seg fault Luse, Paul E
  -- strict thread matches above, loose matches on Subject: below --
2018-11-27 21:45 Howell, Seth
2018-11-27 18:05 Luse, Paul E
2018-11-27 17:34 Howell, Seth
2018-11-27  9:07 Sasha Kotchubievsky
2018-11-27  0:19 Harris, James R
2018-11-22 17:55 Evgenii Kochetov
2018-11-22 12:21 Sasha Kotchubievsky
2018-11-21 19:28 Gruher, Joseph R
2018-11-21 17:03 Walker, Benjamin
2018-11-20  7:57 Sasha Kotchubievsky
2018-11-18 21:57 Gruher, Joseph R
2018-11-18 19:19 Luse, Paul E
2018-11-18 16:52 Sasha Kotchubievsky
2018-11-14 19:26 Gruher, Joseph R
2018-11-14 19:16 Harris, James R
2018-11-14 19:14 Gruher, Joseph R

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.