linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Selvin Xavier <selvin.xavier@broadcom.com>
To: Yi Zhang <yi.zhang@redhat.com>
Cc: sagi@grimberg.me, linux-nvme@lists.infradead.org
Subject: Re: Bug report: NVMeoF RDMA bnxt_roce: I/O 1 QID 0 timeout
Date: Tue, 12 Nov 2019 10:21:22 +0530	[thread overview]
Message-ID: <CA+sbYW0LqFwsqgQ92AU83c=5VWiEjx+OCuQan+mYN2LYBXnGkg@mail.gmail.com> (raw)
In-Reply-To: <661008349.11145565.1573528695548.JavaMail.zimbra@redhat.com>

Hi Yi Zhang,
Thanks for reporting the issue.  Seems like the FW on bnxt_re side is
not responding once the issue happens and all the commands are failing
after that.
I will get in touch with you for retrieving the FW information and
further investigation. We might need a FW upgrade too.

Thanks,
Selvin


On Tue, Nov 12, 2019 at 8:48 AM Yi Zhang <yi.zhang@redhat.com> wrote:
>
> Hello
>
> I would like to report a NVMeoF RDMA I/O timeout issue, here is the reproducer and kernel log, let me know if you need more info
>
> Reproducer:
> 1.setup nvmeof rdma roce environment
>    target: qedr roce
>    client: bnxt roce
>
> 2. client: connect target
> 3. client: do fio stress
>
> HW info:
> target:
> # lspci | grep -i ql
> 19:00.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
> 19:00.1 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
> 19:00.2 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
> 19:00.3 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
> client:
> # lspci  | grep -i bro
> 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
> 01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
> 18:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02)
> 19:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller (rev 01)
> 19:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller (rev 01)
>
>
> kerne log:
> client:
> [   65.974085] nvme nvme2: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.45.186:4420
> [   65.974592] nvme nvme2: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
> [   66.027510] nvme nvme2: creating 1 I/O queues.
> [   66.053178] nvme nvme2: mapped 1/0/0 default/read/poll queues.
> [   66.053802] nvme nvme2: new ctrl: NQN "testnqn", addr 172.31.45.186:4420
> [   66.054465] nvme2n1: detected capacity change from 0 to 1600321314816
> [   81.926738] nvme nvme2: I/O 1 QID 0 timeout
> [   81.930943] nvme nvme2: starting error recovery
> [   88.070471] nvme nvme2: I/O 1 QID 0 timeout
> [   93.702284] nvme nvme2: I/O 1 QID 0 timeout
> [   98.310013] nvme nvme2: I/O 49 QID 1 timeout
> [   98.314297] nvme nvme2: I/O 50 QID 1 timeout
> [   98.318575] nvme nvme2: I/O 51 QID 1 timeout
> [   98.322855] nvme nvme2: I/O 52 QID 1 timeout
> [   98.327135] nvme nvme2: I/O 54 QID 1 timeout
> [   98.331419] nvme nvme2: I/O 55 QID 1 timeout
> [   98.335699] nvme nvme2: I/O 56 QID 1 timeout
> [   98.339984] nvme nvme2: I/O 57 QID 1 timeout
> [   98.344259] nvme nvme2: I/O 65 QID 1 timeout
> [   98.348531] nvme nvme2: I/O 66 QID 1 timeout
> [   98.352805] nvme nvme2: I/O 67 QID 1 timeout
> [   98.357086] nvme nvme2: I/O 68 QID 1 timeout
> [   98.361369] nvme nvme2: I/O 69 QID 1 timeout
> [   98.365648] nvme nvme2: I/O 70 QID 1 timeout
> [   98.369928] nvme nvme2: I/O 71 QID 1 timeout
> [   98.374203] nvme nvme2: I/O 72 QID 1 timeout
> [   98.378484] nvme nvme2: I/O 73 QID 1 timeout
> [   98.382764] nvme nvme2: I/O 74 QID 1 timeout
> [   98.387035] nvme nvme2: I/O 75 QID 1 timeout
> [   98.391310] nvme nvme2: I/O 76 QID 1 timeout
> [   98.395590] nvme nvme2: I/O 77 QID 1 timeout
> [   98.399864] nvme nvme2: I/O 78 QID 1 timeout
> [   98.404141] nvme nvme2: I/O 79 QID 1 timeout
> [   98.408415] nvme nvme2: I/O 80 QID 1 timeout
> [   98.412686] nvme nvme2: I/O 81 QID 1 timeout
> [   98.416961] nvme nvme2: I/O 82 QID 1 timeout
> [   98.421232] nvme nvme2: I/O 86 QID 1 timeout
> [   98.425503] nvme nvme2: I/O 87 QID 1 timeout
> [   98.429776] nvme nvme2: I/O 94 QID 1 timeout
> [   98.434050] nvme nvme2: I/O 95 QID 1 timeout
> [   98.438323] nvme nvme2: I/O 96 QID 1 timeout
> [   98.442596] nvme nvme2: I/O 97 QID 1 timeout
> [   98.446867] nvme nvme2: I/O 98 QID 1 timeout
> [   98.451141] nvme nvme2: I/O 99 QID 1 timeout
> [   98.455412] nvme nvme2: I/O 100 QID 1 timeout
> [   98.459772] nvme nvme2: I/O 101 QID 1 timeout
> [   98.464131] nvme nvme2: I/O 102 QID 1 timeout
> [   98.468490] nvme nvme2: I/O 103 QID 1 timeout
> [   98.472849] nvme nvme2: I/O 104 QID 1 timeout
> [   98.477208] nvme nvme2: I/O 105 QID 1 timeout
> [   98.481568] nvme nvme2: I/O 106 QID 1 timeout
> [   98.485926] nvme nvme2: I/O 109 QID 1 timeout
> [   98.490286] nvme nvme2: I/O 111 QID 1 timeout
> [   98.494644] nvme nvme2: I/O 112 QID 1 timeout
> [   98.499004] nvme nvme2: I/O 113 QID 1 timeout
> [   98.503363] nvme nvme2: I/O 115 QID 1 timeout
> [   98.507721] nvme nvme2: I/O 116 QID 1 timeout
> [   98.512080] nvme nvme2: I/O 117 QID 1 timeout
> [   98.516441] nvme nvme2: I/O 118 QID 1 timeout
> [   98.520807] nvme nvme2: I/O 119 QID 1 timeout
> [   98.525167] nvme nvme2: I/O 120 QID 1 timeout
> [   98.529526] nvme nvme2: I/O 121 QID 1 timeout
> [   98.533885] nvme nvme2: I/O 122 QID 1 timeout
> [   98.538245] nvme nvme2: I/O 123 QID 1 timeout
> [   98.542603] nvme nvme2: I/O 124 QID 1 timeout
> [   98.546963] nvme nvme2: I/O 126 QID 1 timeout
> [   99.846073] nvme nvme2: I/O 1 QID 0 timeout
> [  102.405967] bnxt_en 0000:19:00.0: QPLIB: cmdq[0x41d]=0x3 timedout (20000)msec
> [  102.413114] infiniband bnxt_re0: Failed to modify HW QP
> [  102.418367] bnxt_en 0000:19:00.0: QPLIB: cmdq[0x0]=0x3 send failed
> [  102.424551] infiniband bnxt_re0: Failed to modify HW QP
> [  102.429779] ------------[ cut here ]------------
> [  102.434399] failed to drain send queue: -110
> [  102.438730] WARNING: CPU: 43 PID: 780 at drivers/infiniband/core/verbs.c:2656 __ib_drain_sq+0x143/0x190 [ib_core]
> [  102.448985] Modules linked in: nvme_rdma nvme_fabrics nvmet_rdma nvmet 8021q garp mrp stp llc ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp vfat fat mlx5_ib opa_vnic ib_umad ib_ipoib rpcrdma intel_rapl_msr sunrpc intel_rapl_common rdma_ucm ib_iser libiscsi isst_if_common scsi_transport_iscsi iTCO_wdt iTCO_vendor_support skx_edac dcdbas nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif iw_cxgb4 kvm irqbypass rdma_cm iw_cm ib_cm libcxgb crct10dif_pclmul crc32_pclmul hfi1 bnxt_re rdmavt ghash_clmulni_intel intel_cstate ib_uverbs ib_core intel_uncore dell_smbios intel_rapl_perf wmi_bmof dell_wmi_descriptor sg pcspkr mei_me ipmi_si i2c_i801 lpc_ich mei ipmi_devintf ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops i2c_algo_bit drm_vram_helper ttm mlx5_core drm ahci csiostor cxgb4 libahci crc32c_intel nvme mlxfw bnxt_en nvme_core libata megaraid_sas
> [  102.449018]  scsi_transport_fc tg3 pci_hyperv_intf wmi dm_mirror dm_region_hash dm_log dm_mod
> [  102.543833] CPU: 43 PID: 780 Comm: kworker/u98:3 Kdump: loaded Not tainted 5.4.0-0.rc6.2.elrdy.x86_64 #1
> [  102.553303] Hardware name: Dell Inc. PowerEdge R740/00WGD1, BIOS 2.2.11 06/13/2019
> [  102.560872] Workqueue: nvme-wq nvme_rdma_error_recovery_work [nvme_rdma]
> [  102.567572] RIP: 0010:__ib_drain_sq+0x143/0x190 [ib_core]
> [  102.572968] Code: 00 00 00 48 89 df e8 8c 28 7f f4 48 85 c0 74 e1 e9 69 ff ff ff 89 c6 48 c7 c7 48 ce 51 c1 c6 05 cd 0c 04 00 01 e8 56 2f fe f3 <0f> 0b e9 4d ff ff ff 80 3d b9 0c 04 00 00 0f 85 40 ff ff ff 89 c6
> [  102.591711] RSP: 0018:ffffbb79c7883cf8 EFLAGS: 00010282
> [  102.596940] RAX: 0000000000000000 RBX: ffff9ed9bdac3818 RCX: 0000000000000000
> [  102.604072] RDX: ffff9ee60fb66900 RSI: ffff9ee60fb56b88 RDI: ffff9ee60fb56b88
> [  102.611204] RBP: ffff9ee60b1d8400 R08: 000000000000079a R09: 000000000000002b
> [  102.618334] R10: 0000000000000000 R11: ffffbb79c7883ba0 R12: ffffbb79c7883d28
> [  102.625458] R13: ffffbb79c7883d00 R14: ffff9eda44b0fbc0 R15: ffff9ee5ff4000b0
> [  102.632584] FS:  0000000000000000(0000) GS:ffff9ee60fb40000(0000) knlGS:0000000000000000
> [  102.640669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  102.646415] CR2: 00007faba43bcec8 CR3: 0000000f4fa0a004 CR4: 00000000007606e0
> [  102.653547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  102.660680] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  102.667812] PKRU: 55555554
> [  102.670525] Call Trace:
> [  102.672987]  ? wb_over_bg_thresh+0x20c/0x220
> [  102.677265]  ib_drain_qp+0xe/0x20 [ib_core]
> [  102.681448]  nvme_rdma_teardown_io_queues.part.35+0x4e/0xa0 [nvme_rdma]
> [  102.688057]  nvme_rdma_error_recovery_work+0x35/0x90 [nvme_rdma]
> [  102.694064]  process_one_work+0x1a1/0x360
> [  102.698074]  worker_thread+0x30/0x380
> [  102.701742]  ? pwq_unbound_release_workfn+0xd0/0xd0
> [  102.706622]  kthread+0x112/0x130
> [  102.709851]  ? __kthread_parkme+0x70/0x70
> [  102.713867]  ret_from_fork+0x35/0x40
> [  102.717444] ---[ end trace b9af855765aeac25 ]---
> [  102.722074] bnxt_en 0000:19:00.0: QPLIB: cmdq[0x0]=0x3 send failed
> [  102.728271] infiniband bnxt_re0: Failed to modify HW QP
>
> target:
> [   44.362119] nvmet: adding nsid 1 to subsystem testnqn
> [   44.362662] nvmet_rdma: enabling port 2 (172.31.45.186:4420)
> [   61.722470] SubnSet(OPA_PortInfo) smlid 0x4
> [   62.127400] nvmet: creating controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:07212bc2-57e0-4729-9377-4608892fed05.
> [   62.180752] nvmet: creating controller 1 for subsystem testnqn for NQN nqn.2014-08.org.nvmexpress:uuid:07212bc2-57e0-4729-9377-4608892fed05.
> [   80.914135] [qedr_poll_cq_req:3816(qedr0)]Error: POLL CQ with ROCE_CQE_REQ_STS_TRANSPORT_RETRY_CNT_ERR. CQ icid=0x1, QP icid=0x2
> [   92.839414] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> [   92.845445] nvmet: ctrl 1 fatal error occurred!
>
> Best Regards,
>   Yi Zhang
>
>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

      reply	other threads:[~2019-11-12  4:51 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1269174553.11144711.1573527781970.JavaMail.zimbra@redhat.com>
2019-11-12  3:18 ` Yi Zhang
2019-11-12  4:51   ` Selvin Xavier [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+sbYW0LqFwsqgQ92AU83c=5VWiEjx+OCuQan+mYN2LYBXnGkg@mail.gmail.com' \
    --to=selvin.xavier@broadcom.com \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    --cc=yi.zhang@redhat.com \
    --subject='Re: Bug report: NVMeoF RDMA bnxt_roce: I/O 1 QID 0 timeout' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).