All of lore.kernel.org
 help / color / mirror / Atom feed
* kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
       [not found] <1575599772.13802712.1492592011810.JavaMail.zimbra@redhat.com>
@ 2017-04-20  6:03     ` Yi Zhang
  0 siblings, 0 replies; 14+ messages in thread
From: Yi Zhang @ 2017-04-20  6:03 UTC (permalink / raw)
  To: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: sagi-NQWnxTmZq1alnMjI0IkVqw, maxg-VPRAkNaXOzVWk0Htik3J/w

Hi

I reproduced two different kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7, here is the steps and kernel log, thanks

Reproduce steps:
1. Configure NVMe over RDMA on target
#nvmetcli restore rdma.json
2. connect to target on client
#nvme connect-all -t rdam -a $IP -s 4420
3. reset_controller during IO on client
#!/bin/bash
num=0
fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -bs_unaligned -runtime=300 -size=-group_reporting -name=mytest -numjobs=60 &
sleep 5
while [ $num -lt 50 ]
do
	echo 1 >/sys/block/nvme0n1/device/reset_controller
	[ $? -eq 1 ] && echo "reset_controller operation failed: $num" && exit 1
	((num++))
	sleep 0.5
done

RMDA device:
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

Kernel log
[1]
[ 5968.515237] DMAR: DRHD: handling fault status reg 2
[ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[ 5968.519450] 00000000 00000000 00000000 00000000
[ 5968.519451] 00000000 00000000 00000000 00000000
[ 5968.519451] 00000000 00000000 00000000 00000000
[ 5968.519452] 00000000 02005104 00000316 a71710e3
[ 5968.546797] DMAR: [DMA Read] Request device [05:00.0] fault addr ab978000 [fault reason 06] PTE Read access is not set
[ 5999.693035] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[ 5999.701799] IP: 0x1
[ 5999.704142] PGD 0 
[ 5999.704143] 
[ 5999.708052] Oops: 0010 [#1] SMP
[ 5999.711562] Modules linked in: sch_mqprio bridge 8021q garp mrp stp llc nvme_rdma nvme_fabrics nvme_core rpcrdma ib_isert iscsi_target_mod ib_isee
[ 5999.791200]  drm tg3 devlink ahci libahci ptp libata crc32c_intel i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 5999.803558] CPU: 16 PID: 3839 Comm: kworker/16:1H Not tainted 4.11.0-rc7+ #1
[ 5999.811440] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016
[ 5999.819812] Workqueue: kblockd blk_mq_timeout_work
[ 5999.825848] task: ffff93c9258f8000 task.stack: ffffbc26a13f8000
[ 5999.833113] RIP: 0010:0x1
[ 5999.836684] RSP: 0018:ffffbc26a13fbca8 EFLAGS: 00010202
[ 5999.843170] RAX: ffff93c91a776000 RBX: ffff93c917a33600 RCX: ffff93ca3fa00000
[ 5999.851789] RDX: ffffbc26a13fbcb0 RSI: ffffbc26a13fbcb8 RDI: ffff93c91a777c00
[ 5999.860395] RBP: ffffbc26a13fbd08 R08: 000000000000ffff R09: 0000000000000000
[ 5999.869001] R10: 00000574e9925428 R11: 0000000000000020 R12: ffff93c92e22cbd0
[ 5999.877616] R13: ffff93da3e3fc000 R14: ffff93c9facd5000 R15: ffff93c9179efc00
[ 5999.886235] FS:  0000000000000000(0000) GS:ffff93ca3fa00000(0000) knlGS:0000000000000000
[ 5999.895934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5999.903021] CR2: 0000000000000001 CR3: 0000000fd5e72000 CR4: 00000000001406e0
[ 5999.911676] Call Trace:
[ 5999.915091]  ? nvme_rdma_unmap_data+0x126/0x1b0 [nvme_rdma]
[ 5999.922014]  nvme_rdma_complete_rq+0x1c/0xa0 [nvme_rdma]
[ 5999.928654]  __blk_mq_complete_request+0xb9/0x130
[ 5999.934615]  blk_mq_rq_timed_out+0x66/0x70
[ 5999.939900]  blk_mq_check_expired+0x37/0x60
[ 5999.945277]  bt_iter+0x48/0x50
[ 5999.949387]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[ 5999.955440]  ? blk_mq_rq_timed_out+0x70/0x70
[ 5999.960912]  ? blk_mq_rq_timed_out+0x70/0x70
[ 5999.966378]  blk_mq_timeout_work+0x88/0x170
[ 5999.971734]  process_one_work+0x165/0x410
[ 5999.976884]  worker_thread+0x137/0x4c0
[ 5999.981740]  kthread+0x109/0x140
[ 5999.986002]  ? rescuer_thread+0x3b0/0x3b0
[ 5999.991293]  ? kthread_park+0x90/0x90
[ 5999.996202]  ret_from_fork+0x2c/0x40
[ 6000.001023] Code:  Bad RIP value.
[ 6000.005395] RIP: 0x1 RSP: ffffbc26a13fbca8
[ 6000.010641] CR2: 0000000000000001
[ 6000.017674] ---[ end trace aefe12bb2d39bb6c ]---
[ 6000.025847] Kernel panic - not syncing: Fatal exception
[ 6000.032339] Kernel Offset: 0x7800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 6000.047129] ---[ end Kernel panic - not syncing: Fatal exception

[2]
[  181.885449] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.45.92:4420
[  182.051854] nvme nvme0: creating 40 I/O queues.
[  183.196669] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.45.92:4420
[  335.152533] DMAR: DRHD: handling fault status reg 2
[  335.155522] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[  335.155523] 00000000 00000000 00000000 00000000
[  335.155523] 00000000 00000000 00000000 00000000
[  335.155524] 00000000 00000000 00000000 00000000
[  335.155524] 00000000 02005104 00000313 2d56a1e3
[  335.184087] DMAR: [DMA Read] Request device [05:00.0] fault addr afe64000 [fault reason 06] PTE Read access is not set
[  335.565825] nvme nvme0: creating 40 I/O queues.
[  335.946585] DMAR: DRHD: handling fault status reg 102
[  335.948848] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[  335.948849] 00000000 00000000 00000000 00000000
[  335.948849] 00000000 00000000 00000000 00000000
[  335.948849] 00000000 00000000 00000000 00000000
[  335.948850] 00000000 02005104 0000033e 123982e2
[  335.978349] DMAR: [DMA Read] Request device [05:00.0] fault addr af0c6000 [fault reason 06] PTE Read access is not set
[  336.286112] nvme nvme0: creating 40 I/O queues.
[  336.976392] nvme nvme0: creating 40 I/O queues.
[  337.329610] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[  337.335456] 00000000 00000000 00000000 00000000
[  337.340521] 00000000 00000000 00000000 00000000
[  337.345586] 00000000 00000000 00000000 00000000
[  337.350651] 00000000 93005204 0000038c 052a29e3
[  337.623917] nvme nvme0: creating 40 I/O queues.
[  338.286747] nvme nvme0: creating 40 I/O queues.
[  338.647457] DMAR: DRHD: handling fault status reg 202
[  338.649077] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[  338.649078] 00000000 00000000 00000000 00000000
[  338.649079] 00000000 00000000 00000000 00000000
[  338.649079] 00000000 00000000 00000000 00000000
[  338.649080] 00000000 02005104 000003dc 096258e2
[  338.681899] DMAR: [DMA Read] Request device [05:00.0] fault addr adaf8000 [fault reason 06] PTE Read access is not set
[  339.003086] nvme nvme0: creating 40 I/O queues.
[  341.419403] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[  341.428698] IP: 0x1
[  341.431518] PGD 0 
[  341.431519] 
[  341.436353] Oops: 0010 [#1] SMP
[  341.440319] Modules linked in: sch_mqprio bridge 8021q garp mrp stp llc nvme_rdma nvme_fabrics nvme_core rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srpe
[  341.523752]  drm tg3 ahci devlink libahci ptp libata crc32c_intel i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[  341.536724] CPU: 29 PID: 859 Comm: kworker/u82:2 Not tainted 4.11.0-rc7+ #1
[  341.545128] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016
[  341.554122] Workqueue: writeback wb_workfn (flush-259:0)
[  341.560683] task: ffff94c637484380 task.stack: ffffb6d08ed1c000
[  341.567928] RIP: 0010:0x1
[  341.571481] RSP: 0018:ffffb6d08ed1f670 EFLAGS: 00010282
[  341.577959] RAX: ffff94c63e6cb800 RBX: ffff94b53080cd20 RCX: 0000000000000001
[  341.586589] RDX: ffffb6d08ed1f678 RSI: ffff94c62c3413a8 RDI: ffff94c63e6ce400
[  341.595227] RBP: ffffb6d08ed1f6c0 R08: ffff94c62c3413a8 R09: 0000000000000000
[  341.603848] R10: 0000000000000000 R11: 0000000000000000 R12: ffff94b4e5620fc0
[  341.612459] R13: ffff94c63a39c000 R14: 0000000000000002 R15: ffff94c62c341200
[  341.621069] FS:  0000000000000000(0000) GS:ffff94c63f380000(0000) knlGS:0000000000000000
[  341.630754] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  341.637822] CR2: 0000000000000001 CR3: 00000008d5c09000 CR4: 00000000001406e0
[  341.646457] Call Trace:
[  341.649862]  ? nvme_rdma_post_send+0x9b/0x100 [nvme_rdma]
[  341.656577]  nvme_rdma_queue_rq+0x2fb/0x680 [nvme_rdma]
[  341.663085]  blk_mq_try_issue_directly+0xbb/0x110
[  341.668995]  blk_mq_make_request+0x354/0x620
[  341.674424]  generic_make_request+0x110/0x2c0
[  341.679947]  submit_bio+0x75/0x150
[  341.684400]  submit_bh_wbc+0x141/0x180
[  341.689244]  __block_write_full_page+0x13d/0x3b0
[  341.695068]  ? I_BDEV+0x20/0x20
[  341.699244]  ? I_BDEV+0x20/0x20
[  341.703411]  block_write_full_page+0xdc/0x100
[  341.708946]  blkdev_writepage+0x18/0x20
[  341.713898]  __writepage+0x13/0x40
[  341.718357]  write_cache_pages+0x26f/0x510
[  341.723593]  ? compound_head+0x20/0x20
[  341.728447]  generic_writepages+0x51/0x80
[  341.733600]  ? __wake_up_common+0x55/0x90
[  341.738753]  blkdev_writepages+0x2f/0x40
[  341.743795]  do_writepages+0x1e/0x30
[  341.748429]  __writeback_single_inode+0x45/0x330
[  341.754221]  writeback_sb_inodes+0x280/0x570
[  341.759616]  __writeback_inodes_wb+0x8c/0xc0
[  341.765015]  wb_writeback+0x276/0x310
[  341.769742]  wb_workfn+0x19c/0x3b0
[  341.774178]  process_one_work+0x165/0x410
[  341.779274]  worker_thread+0x137/0x4c0
[  341.784061]  kthread+0x109/0x140
[  341.788243]  ? rescuer_thread+0x3b0/0x3b0
[  341.793279]  ? kthread_park+0x90/0x90
[  341.797905]  ret_from_fork+0x2c/0x40
[  341.802414] Code:  Bad RIP value.
[  341.806612] RIP: 0x1 RSP: ffffb6d08ed1f670
[  341.811663] CR2: 0000000000000001
[  341.815833] ---[ end trace 9a64941b3df0eb88 ]---
[  341.878226] Kernel panic - not syncing: Fatal exception
[  341.884533] Kernel Offset: 0x12800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  341.917376] ---[ end Kernel panic - not syncing: Fatal exception


Best Regards,
  Yi Zhang


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
@ 2017-04-20  6:03     ` Yi Zhang
  0 siblings, 0 replies; 14+ messages in thread
From: Yi Zhang @ 2017-04-20  6:03 UTC (permalink / raw)


Hi

I reproduced two different kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7, here is the steps and kernel log, thanks

Reproduce steps:
1. Configure NVMe over RDMA on target
#nvmetcli restore rdma.json
2. connect to target on client
#nvme connect-all -t rdam -a $IP -s 4420
3. reset_controller during IO on client
#!/bin/bash
num=0
fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -bs_unaligned -runtime=300 -size=-group_reporting -name=mytest -numjobs=60 &
sleep 5
while [ $num -lt 50 ]
do
	echo 1 >/sys/block/nvme0n1/device/reset_controller
	[ $? -eq 1 ] && echo "reset_controller operation failed: $num" && exit 1
	((num++))
	sleep 0.5
done

RMDA device:
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

Kernel log
[1]
[ 5968.515237] DMAR: DRHD: handling fault status reg 2
[ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[ 5968.519450] 00000000 00000000 00000000 00000000
[ 5968.519451] 00000000 00000000 00000000 00000000
[ 5968.519451] 00000000 00000000 00000000 00000000
[ 5968.519452] 00000000 02005104 00000316 a71710e3
[ 5968.546797] DMAR: [DMA Read] Request device [05:00.0] fault addr ab978000 [fault reason 06] PTE Read access is not set
[ 5999.693035] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[ 5999.701799] IP: 0x1
[ 5999.704142] PGD 0 
[ 5999.704143] 
[ 5999.708052] Oops: 0010 [#1] SMP
[ 5999.711562] Modules linked in: sch_mqprio bridge 8021q garp mrp stp llc nvme_rdma nvme_fabrics nvme_core rpcrdma ib_isert iscsi_target_mod ib_isee
[ 5999.791200]  drm tg3 devlink ahci libahci ptp libata crc32c_intel i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 5999.803558] CPU: 16 PID: 3839 Comm: kworker/16:1H Not tainted 4.11.0-rc7+ #1
[ 5999.811440] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016
[ 5999.819812] Workqueue: kblockd blk_mq_timeout_work
[ 5999.825848] task: ffff93c9258f8000 task.stack: ffffbc26a13f8000
[ 5999.833113] RIP: 0010:0x1
[ 5999.836684] RSP: 0018:ffffbc26a13fbca8 EFLAGS: 00010202
[ 5999.843170] RAX: ffff93c91a776000 RBX: ffff93c917a33600 RCX: ffff93ca3fa00000
[ 5999.851789] RDX: ffffbc26a13fbcb0 RSI: ffffbc26a13fbcb8 RDI: ffff93c91a777c00
[ 5999.860395] RBP: ffffbc26a13fbd08 R08: 000000000000ffff R09: 0000000000000000
[ 5999.869001] R10: 00000574e9925428 R11: 0000000000000020 R12: ffff93c92e22cbd0
[ 5999.877616] R13: ffff93da3e3fc000 R14: ffff93c9facd5000 R15: ffff93c9179efc00
[ 5999.886235] FS:  0000000000000000(0000) GS:ffff93ca3fa00000(0000) knlGS:0000000000000000
[ 5999.895934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5999.903021] CR2: 0000000000000001 CR3: 0000000fd5e72000 CR4: 00000000001406e0
[ 5999.911676] Call Trace:
[ 5999.915091]  ? nvme_rdma_unmap_data+0x126/0x1b0 [nvme_rdma]
[ 5999.922014]  nvme_rdma_complete_rq+0x1c/0xa0 [nvme_rdma]
[ 5999.928654]  __blk_mq_complete_request+0xb9/0x130
[ 5999.934615]  blk_mq_rq_timed_out+0x66/0x70
[ 5999.939900]  blk_mq_check_expired+0x37/0x60
[ 5999.945277]  bt_iter+0x48/0x50
[ 5999.949387]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[ 5999.955440]  ? blk_mq_rq_timed_out+0x70/0x70
[ 5999.960912]  ? blk_mq_rq_timed_out+0x70/0x70
[ 5999.966378]  blk_mq_timeout_work+0x88/0x170
[ 5999.971734]  process_one_work+0x165/0x410
[ 5999.976884]  worker_thread+0x137/0x4c0
[ 5999.981740]  kthread+0x109/0x140
[ 5999.986002]  ? rescuer_thread+0x3b0/0x3b0
[ 5999.991293]  ? kthread_park+0x90/0x90
[ 5999.996202]  ret_from_fork+0x2c/0x40
[ 6000.001023] Code:  Bad RIP value.
[ 6000.005395] RIP: 0x1 RSP: ffffbc26a13fbca8
[ 6000.010641] CR2: 0000000000000001
[ 6000.017674] ---[ end trace aefe12bb2d39bb6c ]---
[ 6000.025847] Kernel panic - not syncing: Fatal exception
[ 6000.032339] Kernel Offset: 0x7800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 6000.047129] ---[ end Kernel panic - not syncing: Fatal exception

[2]
[  181.885449] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.45.92:4420
[  182.051854] nvme nvme0: creating 40 I/O queues.
[  183.196669] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.45.92:4420
[  335.152533] DMAR: DRHD: handling fault status reg 2
[  335.155522] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[  335.155523] 00000000 00000000 00000000 00000000
[  335.155523] 00000000 00000000 00000000 00000000
[  335.155524] 00000000 00000000 00000000 00000000
[  335.155524] 00000000 02005104 00000313 2d56a1e3
[  335.184087] DMAR: [DMA Read] Request device [05:00.0] fault addr afe64000 [fault reason 06] PTE Read access is not set
[  335.565825] nvme nvme0: creating 40 I/O queues.
[  335.946585] DMAR: DRHD: handling fault status reg 102
[  335.948848] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[  335.948849] 00000000 00000000 00000000 00000000
[  335.948849] 00000000 00000000 00000000 00000000
[  335.948849] 00000000 00000000 00000000 00000000
[  335.948850] 00000000 02005104 0000033e 123982e2
[  335.978349] DMAR: [DMA Read] Request device [05:00.0] fault addr af0c6000 [fault reason 06] PTE Read access is not set
[  336.286112] nvme nvme0: creating 40 I/O queues.
[  336.976392] nvme nvme0: creating 40 I/O queues.
[  337.329610] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[  337.335456] 00000000 00000000 00000000 00000000
[  337.340521] 00000000 00000000 00000000 00000000
[  337.345586] 00000000 00000000 00000000 00000000
[  337.350651] 00000000 93005204 0000038c 052a29e3
[  337.623917] nvme nvme0: creating 40 I/O queues.
[  338.286747] nvme nvme0: creating 40 I/O queues.
[  338.647457] DMAR: DRHD: handling fault status reg 202
[  338.649077] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
[  338.649078] 00000000 00000000 00000000 00000000
[  338.649079] 00000000 00000000 00000000 00000000
[  338.649079] 00000000 00000000 00000000 00000000
[  338.649080] 00000000 02005104 000003dc 096258e2
[  338.681899] DMAR: [DMA Read] Request device [05:00.0] fault addr adaf8000 [fault reason 06] PTE Read access is not set
[  339.003086] nvme nvme0: creating 40 I/O queues.
[  341.419403] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[  341.428698] IP: 0x1
[  341.431518] PGD 0 
[  341.431519] 
[  341.436353] Oops: 0010 [#1] SMP
[  341.440319] Modules linked in: sch_mqprio bridge 8021q garp mrp stp llc nvme_rdma nvme_fabrics nvme_core rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srpe
[  341.523752]  drm tg3 ahci devlink libahci ptp libata crc32c_intel i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[  341.536724] CPU: 29 PID: 859 Comm: kworker/u82:2 Not tainted 4.11.0-rc7+ #1
[  341.545128] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016
[  341.554122] Workqueue: writeback wb_workfn (flush-259:0)
[  341.560683] task: ffff94c637484380 task.stack: ffffb6d08ed1c000
[  341.567928] RIP: 0010:0x1
[  341.571481] RSP: 0018:ffffb6d08ed1f670 EFLAGS: 00010282
[  341.577959] RAX: ffff94c63e6cb800 RBX: ffff94b53080cd20 RCX: 0000000000000001
[  341.586589] RDX: ffffb6d08ed1f678 RSI: ffff94c62c3413a8 RDI: ffff94c63e6ce400
[  341.595227] RBP: ffffb6d08ed1f6c0 R08: ffff94c62c3413a8 R09: 0000000000000000
[  341.603848] R10: 0000000000000000 R11: 0000000000000000 R12: ffff94b4e5620fc0
[  341.612459] R13: ffff94c63a39c000 R14: 0000000000000002 R15: ffff94c62c341200
[  341.621069] FS:  0000000000000000(0000) GS:ffff94c63f380000(0000) knlGS:0000000000000000
[  341.630754] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  341.637822] CR2: 0000000000000001 CR3: 00000008d5c09000 CR4: 00000000001406e0
[  341.646457] Call Trace:
[  341.649862]  ? nvme_rdma_post_send+0x9b/0x100 [nvme_rdma]
[  341.656577]  nvme_rdma_queue_rq+0x2fb/0x680 [nvme_rdma]
[  341.663085]  blk_mq_try_issue_directly+0xbb/0x110
[  341.668995]  blk_mq_make_request+0x354/0x620
[  341.674424]  generic_make_request+0x110/0x2c0
[  341.679947]  submit_bio+0x75/0x150
[  341.684400]  submit_bh_wbc+0x141/0x180
[  341.689244]  __block_write_full_page+0x13d/0x3b0
[  341.695068]  ? I_BDEV+0x20/0x20
[  341.699244]  ? I_BDEV+0x20/0x20
[  341.703411]  block_write_full_page+0xdc/0x100
[  341.708946]  blkdev_writepage+0x18/0x20
[  341.713898]  __writepage+0x13/0x40
[  341.718357]  write_cache_pages+0x26f/0x510
[  341.723593]  ? compound_head+0x20/0x20
[  341.728447]  generic_writepages+0x51/0x80
[  341.733600]  ? __wake_up_common+0x55/0x90
[  341.738753]  blkdev_writepages+0x2f/0x40
[  341.743795]  do_writepages+0x1e/0x30
[  341.748429]  __writeback_single_inode+0x45/0x330
[  341.754221]  writeback_sb_inodes+0x280/0x570
[  341.759616]  __writeback_inodes_wb+0x8c/0xc0
[  341.765015]  wb_writeback+0x276/0x310
[  341.769742]  wb_workfn+0x19c/0x3b0
[  341.774178]  process_one_work+0x165/0x410
[  341.779274]  worker_thread+0x137/0x4c0
[  341.784061]  kthread+0x109/0x140
[  341.788243]  ? rescuer_thread+0x3b0/0x3b0
[  341.793279]  ? kthread_park+0x90/0x90
[  341.797905]  ret_from_fork+0x2c/0x40
[  341.802414] Code:  Bad RIP value.
[  341.806612] RIP: 0x1 RSP: ffffb6d08ed1f670
[  341.811663] CR2: 0000000000000001
[  341.815833] ---[ end trace 9a64941b3df0eb88 ]---
[  341.878226] Kernel panic - not syncing: Fatal exception
[  341.884533] Kernel Offset: 0x12800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  341.917376] ---[ end Kernel panic - not syncing: Fatal exception


Best Regards,
  Yi Zhang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
  2017-04-20  6:03     ` Yi Zhang
@ 2017-04-20 16:21         ` Sagi Grimberg
  -1 siblings, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2017-04-20 16:21 UTC (permalink / raw)
  To: Yi Zhang, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: maxg-VPRAkNaXOzVWk0Htik3J/w


> [1]
> [ 5968.515237] DMAR: DRHD: handling fault status reg 2
> [ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
> [ 5968.519450] 00000000 00000000 00000000 00000000
> [ 5968.519451] 00000000 00000000 00000000 00000000
> [ 5968.519451] 00000000 00000000 00000000 00000000
> [ 5968.519452] 00000000 02005104 00000316 a71710e3

Max, Can you decode this for us?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
@ 2017-04-20 16:21         ` Sagi Grimberg
  0 siblings, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2017-04-20 16:21 UTC (permalink / raw)



> [1]
> [ 5968.515237] DMAR: DRHD: handling fault status reg 2
> [ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
> [ 5968.519450] 00000000 00000000 00000000 00000000
> [ 5968.519451] 00000000 00000000 00000000 00000000
> [ 5968.519451] 00000000 00000000 00000000 00000000
> [ 5968.519452] 00000000 02005104 00000316 a71710e3

Max, Can you decode this for us?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
  2017-04-20 16:21         ` Sagi Grimberg
@ 2017-04-25 18:06             ` Leon Romanovsky
  -1 siblings, 0 replies; 14+ messages in thread
From: Leon Romanovsky @ 2017-04-25 18:06 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Yi Zhang, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, maxg-VPRAkNaXOzVWk0Htik3J/w

[-- Attachment #1: Type: text/plain, Size: 788 bytes --]

On Thu, Apr 20, 2017 at 07:21:29PM +0300, Sagi Grimberg wrote:
>
> > [1]
> > [ 5968.515237] DMAR: DRHD: handling fault status reg 2
> > [ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
> > [ 5968.519450] 00000000 00000000 00000000 00000000
> > [ 5968.519451] 00000000 00000000 00000000 00000000
> > [ 5968.519451] 00000000 00000000 00000000 00000000
> > [ 5968.519452] 00000000 02005104 00000316 a71710e3
>
> Max, Can you decode this for us?

I'm not Max and maybe he will shed more light on it. I didn't find such
error in our documentation.

Thanks

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
@ 2017-04-25 18:06             ` Leon Romanovsky
  0 siblings, 0 replies; 14+ messages in thread
From: Leon Romanovsky @ 2017-04-25 18:06 UTC (permalink / raw)


On Thu, Apr 20, 2017@07:21:29PM +0300, Sagi Grimberg wrote:
>
> > [1]
> > [ 5968.515237] DMAR: DRHD: handling fault status reg 2
> > [ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
> > [ 5968.519450] 00000000 00000000 00000000 00000000
> > [ 5968.519451] 00000000 00000000 00000000 00000000
> > [ 5968.519451] 00000000 00000000 00000000 00000000
> > [ 5968.519452] 00000000 02005104 00000316 a71710e3
>
> Max, Can you decode this for us?

I'm not Max and maybe he will shed more light on it. I didn't find such
error in our documentation.

Thanks

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170425/8d9465e4/attachment.sig>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
  2017-04-25 18:06             ` Leon Romanovsky
@ 2017-08-24 12:11                 ` Max Gurtovoy
  -1 siblings, 0 replies; 14+ messages in thread
From: Max Gurtovoy @ 2017-08-24 12:11 UTC (permalink / raw)
  To: Leon Romanovsky, Sagi Grimberg
  Cc: Yi Zhang, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA



On 4/25/2017 9:06 PM, Leon Romanovsky wrote:
> On Thu, Apr 20, 2017 at 07:21:29PM +0300, Sagi Grimberg wrote:
>>
>>> [1]
>>> [ 5968.515237] DMAR: DRHD: handling fault status reg 2
>>> [ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
>>> [ 5968.519450] 00000000 00000000 00000000 00000000
>>> [ 5968.519451] 00000000 00000000 00000000 00000000
>>> [ 5968.519451] 00000000 00000000 00000000 00000000
>>> [ 5968.519452] 00000000 02005104 00000316 a71710e3
>>
>> Max, Can you decode this for us?
>
> I'm not Max and maybe he will shed more light on it. I didn't find such
> error in our documentation.


Sorry for the late response.

Yi Zhang,
Is it still repro ?

-Max.

>
> Thanks
>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
@ 2017-08-24 12:11                 ` Max Gurtovoy
  0 siblings, 0 replies; 14+ messages in thread
From: Max Gurtovoy @ 2017-08-24 12:11 UTC (permalink / raw)




On 4/25/2017 9:06 PM, Leon Romanovsky wrote:
> On Thu, Apr 20, 2017@07:21:29PM +0300, Sagi Grimberg wrote:
>>
>>> [1]
>>> [ 5968.515237] DMAR: DRHD: handling fault status reg 2
>>> [ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
>>> [ 5968.519450] 00000000 00000000 00000000 00000000
>>> [ 5968.519451] 00000000 00000000 00000000 00000000
>>> [ 5968.519451] 00000000 00000000 00000000 00000000
>>> [ 5968.519452] 00000000 02005104 00000316 a71710e3
>>
>> Max, Can you decode this for us?
>
> I'm not Max and maybe he will shed more light on it. I didn't find such
> error in our documentation.


Sorry for the late response.

Yi Zhang,
Is it still repro ?

-Max.

>
> Thanks
>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
  2017-08-24 12:11                 ` Max Gurtovoy
@ 2017-08-25 12:10                     ` Yi Zhang
  -1 siblings, 0 replies; 14+ messages in thread
From: Yi Zhang @ 2017-08-25 12:10 UTC (permalink / raw)
  To: Max Gurtovoy, Leon Romanovsky, Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r



On 08/24/2017 08:11 PM, Max Gurtovoy wrote:
>
>
> On 4/25/2017 9:06 PM, Leon Romanovsky wrote:
>> On Thu, Apr 20, 2017 at 07:21:29PM +0300, Sagi Grimberg wrote:
>>>
>>>> [1]
>>>> [ 5968.515237] DMAR: DRHD: handling fault status reg 2
>>>> [ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
>>>> [ 5968.519450] 00000000 00000000 00000000 00000000
>>>> [ 5968.519451] 00000000 00000000 00000000 00000000
>>>> [ 5968.519451] 00000000 00000000 00000000 00000000
>>>> [ 5968.519452] 00000000 02005104 00000316 a71710e3
>>>
>>> Max, Can you decode this for us?
>>
>> I'm not Max and maybe he will shed more light on it. I didn't find such
>> error in our documentation.
>
>
> Sorry for the late response.
>
> Yi Zhang,
> Is it still repro ?
>
Hi Max
The good news is the NULL pointer cannot be reproduced any more with 
4.13.0-rc6.

But I found bellow error on target and client side during the test.
Client side:
rdma-virt-03 login: [  927.033550] print_req_error: I/O error, dev 
nvme0n1, sector 140477384
[  927.033577] print_req_error: I/O error, dev nvme0n1, sector 271251016
[  927.033579] Buffer I/O error on dev nvme0n1, logical block 33906377, 
lost async page write
[  927.033583] Buffer I/O error on dev nvme0n1, logical block 33906378, 
lost async page write
[  927.033584] Buffer I/O error on dev nvme0n1, logical block 33906379, 
lost async page write
[  927.033585] Buffer I/O error on dev nvme0n1, logical block 33906380, 
lost async page write
[  927.033586] Buffer I/O error on dev nvme0n1, logical block 33906381, 
lost async page write
[  927.033586] Buffer I/O error on dev nvme0n1, logical block 33906382, 
lost async page write
[  927.033587] Buffer I/O error on dev nvme0n1, logical block 33906383, 
lost async page write
[  927.033588] Buffer I/O error on dev nvme0n1, logical block 33906384, 
lost async page write
[  927.033591] print_req_error: I/O error, dev nvme0n1, sector 271299456
[  927.033592] Buffer I/O error on dev nvme0n1, logical block 33912432, 
lost async page write
[  927.033593] Buffer I/O error on dev nvme0n1, logical block 33912433, 
lost async page write
[  927.033600] print_req_error: I/O error, dev nvme0n1, sector 271299664
[  927.033606] print_req_error: I/O error, dev nvme0n1, sector 271300200
[  927.033610] print_req_error: I/O error, dev nvme0n1, sector 271198824
[  927.033617] print_req_error: I/O error, dev nvme0n1, sector 271201256
[  927.033621] print_req_error: I/O error, dev nvme0n1, sector 271251224
[  927.033624] print_req_error: I/O error, dev nvme0n1, sector 271251280
[  927.033632] print_req_error: I/O error, dev nvme0n1, sector 271251696
[  957.561764] print_req_error: 243 callbacks suppressed
[  957.567643] print_req_error: I/O error, dev nvme0n1, sector 140682256
[  957.575049] buffer_io_error: 1965 callbacks suppressed
[  957.581006] Buffer I/O error on dev nvme0n1, logical block 17585282, 
lost async page write
[  957.590477] Buffer I/O error on dev nvme0n1, logical block 17585283, 
lost async page write
[  957.599946] Buffer I/O error on dev nvme0n1, logical block 17585284, 
lost async page write
[  957.609406] Buffer I/O error on dev nvme0n1, logical block 17585285, 
lost async page write
[  957.618874] Buffer I/O error on dev nvme0n1, logical block 17585286, 
lost async page write
[  957.628345] print_req_error: I/O error, dev nvme0n1, sector 140692416
[  957.635788] Buffer I/O error on dev nvme0n1, logical block 17586552, 
lost async page write
[  957.645290] Buffer I/O error on dev nvme0n1, logical block 17586553, 
lost async page write
[  957.654790] Buffer I/O error on dev nvme0n1, logical block 17586554, 
lost async page write
[  957.664292] print_req_error: I/O error, dev nvme0n1, sector 140693744
[  957.671767] Buffer I/O error on dev nvme0n1, logical block 17586718, 
lost async page write
[  957.681299] Buffer I/O error on dev nvme0n1, logical block 17586719, 
lost async page write
[  957.690833] print_req_error: I/O error, dev nvme0n1, sector 140697416
[  957.698345] print_req_error: I/O error, dev nvme0n1, sector 140697664
[  957.705855] print_req_error: I/O error, dev nvme0n1, sector 140698576
[  957.713367] print_req_error: I/O error, dev nvme0n1, sector 140699656
[  957.720877] print_req_error: I/O error, dev nvme0n1, sector 140701768
[  957.728390] print_req_error: I/O error, dev nvme0n1, sector 140702728
[  957.735902] print_req_error: I/O error, dev nvme0n1, sector 140705304
[  957.744235] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
[  957.750308] nvme nvme0: nvme_rdma_post_send failed with error code -12
[  957.757941] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
[  957.764030] nvme nvme0: Queueing INV WR for rkey 0x1a1d9f failed (-12)
[  957.771687] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
[  957.777799] nvme nvme0: nvme_rdma_post_send failed with error code -12
[  957.785465] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
[  957.791587] nvme nvme0: Queueing INV WR for rkey 0x1a1da0 failed (-12)
[  957.799262] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
[  957.805391] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
[  957.805396] nvme nvme0: nvme_rdma_post_send failed with error code -12
[  957.819307] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
[  957.819318] nvme nvme0: nvme_rdma_post_send failed with error code -12
[  957.833260] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
[  957.833268] nvme nvme0: Queueing INV WR for rkey 0x1a1da1 failed (-12)
[  957.847263] nvme nvme0: Queueing INV WR for rkey 0x1a1fa1 failed (-12)
[  957.855006] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
[  957.861254] nvme nvme0: nvme_rdma_post_send failed with error code -12
[  957.869004] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
[  957.875192] nvme nvme0: Queueing INV WR for rkey 0x1a1da2 failed (-12)
[  987.962014] print_req_error: 244 callbacks suppressed
[  987.968150] print_req_error: I/O error, dev nvme0n1, sector 140819704
[  987.975829] buffer_io_error: 1826 callbacks suppressed
[  987.982058] Buffer I/O error on dev nvme0n1, logical block 17602463, 
lost async page write
[  987.991803] Buffer I/O error on dev nvme0n1, logical block 17602464, 
lost async page write
[  988.001547] Buffer I/O error on dev nvme0n1, logical block 17602465, 
lost async page write

Target side:
[  875.657497] nvmet: creating controller 1 for subsystem testnqn for 
NQN 
nqn.2014-08.org.nvmexpress:NVMf:uuid:00000000-0000-0000-0000-000000000000.
[  878.243392] nvmet: adding queue 1 to ctrl 1.
[  878.248488] nvmet: adding queue 2 to ctrl 1.
[  878.253483] nvmet: adding queue 3 to ctrl 1.
[  878.258474] nvmet: adding queue 4 to ctrl 1.
[  878.263470] nvmet: adding queue 5 to ctrl 1.
[  878.268458] nvmet: adding queue 6 to ctrl 1.
[  878.273451] nvmet: adding queue 7 to ctrl 1.
[  878.278433] nvmet: adding queue 8 to ctrl 1.
[  878.283413] nvmet: adding queue 9 to ctrl 1.
[  878.288391] nvmet: adding queue 10 to ctrl 1.
[  878.293465] nvmet: adding queue 11 to ctrl 1.
[  878.298541] nvmet: adding queue 12 to ctrl 1.
[  878.303624] nvmet: adding queue 13 to ctrl 1.
[  878.308708] nvmet: adding queue 14 to ctrl 1.
[  878.313789] nvmet: adding queue 15 to ctrl 1.
[  878.318865] nvmet: adding queue 16 to ctrl 1.
[  878.323946] nvmet: adding queue 17 to ctrl 1.
[  878.329017] nvmet: adding queue 18 to ctrl 1.
[  878.334092] nvmet: adding queue 19 to ctrl 1.
[  878.339162] nvmet: adding queue 20 to ctrl 1.
[  878.344233] nvmet: adding queue 21 to ctrl 1.
[  878.349305] nvmet: adding queue 22 to ctrl 1.
[  878.354373] nvmet: adding queue 23 to ctrl 1.
[  878.359445] nvmet: adding queue 24 to ctrl 1.
[  878.364512] nvmet: adding queue 25 to ctrl 1.
[  878.369586] nvmet: adding queue 26 to ctrl 1.
[  878.374658] nvmet: adding queue 27 to ctrl 1.
[  878.379730] nvmet: adding queue 28 to ctrl 1.
[  878.384795] nvmet: adding queue 29 to ctrl 1.
[  878.389868] nvmet: adding queue 30 to ctrl 1.
[  878.394941] nvmet: adding queue 31 to ctrl 1.
[  878.400012] nvmet: adding queue 32 to ctrl 1.
[  878.405080] nvmet: adding queue 33 to ctrl 1.
[  878.410149] nvmet: adding queue 34 to ctrl 1.
[  878.415225] nvmet: adding queue 35 to ctrl 1.
[  878.420295] nvmet: adding queue 36 to ctrl 1.
[  878.425370] nvmet: adding queue 37 to ctrl 1.
[  878.430447] nvmet: adding queue 38 to ctrl 1.
[  878.435519] nvmet: adding queue 39 to ctrl 1.
[  878.440591] nvmet: adding queue 40 to ctrl 1.
[  890.970767] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[  890.977684] nvmet: ctrl 1 fatal error occurred!
[  890.983943] nvmet_rdma: freeing queue 0
[  890.988444] nvmet_rdma: freeing queue 1
[  890.992945] nvmet_rdma: freeing queue 2
[  890.997433] nvmet_rdma: freeing queue 3
[  891.001901] nvmet_rdma: freeing queue 4
[  891.006348] nvmet_rdma: freeing queue 5
[  891.010775] nvmet_rdma: freeing queue 6
[  891.015221] nvmet_rdma: freeing queue 7
[  891.019660] nvmet_rdma: freeing queue 8
[  891.024114] nvmet_rdma: freeing queue 9
[  891.028583] nvmet_rdma: freeing queue 10
[  891.033136] nvmet_rdma: freeing queue 11
[  891.037713] nvmet_rdma: freeing queue 12
[  891.042274] nvmet_rdma: freeing queue 13
[  891.046891] nvmet_rdma: freeing queue 14
[  891.051468] nvmet_rdma: freeing queue 15
[  891.056208] nvmet_rdma: freeing queue 16
[  891.060840] nvmet_rdma: freeing queue 17
[  891.065587] nvmet_rdma: freeing queue 18
[  891.070148] nvmet_rdma: freeing queue 19
[  891.075200] nvmet_rdma: freeing queue 20
[  891.079790] nvmet_rdma: freeing queue 21
[  891.102153] nvmet_rdma: freeing queue 22
[  891.106731] nvmet_rdma: freeing queue 23
[  891.111296] nvmet_rdma: freeing queue 24
[  891.116936] nvmet_rdma: freeing queue 25
[  891.121504] nvmet_rdma: freeing queue 26
[  891.126070] nvmet_rdma: freeing queue 27
[  891.130611] nvmet_rdma: freeing queue 28
[  891.135161] nvmet_rdma: freeing queue 29
[  891.140823] nvmet_rdma: freeing queue 30
[  891.145380] nvmet_rdma: freeing queue 31
[  891.149952] nvmet_rdma: freeing queue 32
[  891.154499] nvmet_rdma: freeing queue 33
[  891.159070] nvmet_rdma: freeing queue 34
[  891.163620] nvmet_rdma: freeing queue 35
[  891.168175] nvmet_rdma: freeing queue 36
[  891.173410] nvmet_rdma: freeing queue 37
[  891.177949] nvmet_rdma: freeing queue 38
[  891.182508] nvmet_rdma: freeing queue 39
[  891.187055] nvmet_rdma: freeing queue 40

> -Max.
>
>>
>> Thanks
>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
@ 2017-08-25 12:10                     ` Yi Zhang
  0 siblings, 0 replies; 14+ messages in thread
From: Yi Zhang @ 2017-08-25 12:10 UTC (permalink / raw)




On 08/24/2017 08:11 PM, Max Gurtovoy wrote:
>
>
> On 4/25/2017 9:06 PM, Leon Romanovsky wrote:
>> On Thu, Apr 20, 2017@07:21:29PM +0300, Sagi Grimberg wrote:
>>>
>>>> [1]
>>>> [ 5968.515237] DMAR: DRHD: handling fault status reg 2
>>>> [ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
>>>> [ 5968.519450] 00000000 00000000 00000000 00000000
>>>> [ 5968.519451] 00000000 00000000 00000000 00000000
>>>> [ 5968.519451] 00000000 00000000 00000000 00000000
>>>> [ 5968.519452] 00000000 02005104 00000316 a71710e3
>>>
>>> Max, Can you decode this for us?
>>
>> I'm not Max and maybe he will shed more light on it. I didn't find such
>> error in our documentation.
>
>
> Sorry for the late response.
>
> Yi Zhang,
> Is it still repro ?
>
Hi Max
The good news is the NULL pointer cannot be reproduced any more with 
4.13.0-rc6.

But I found bellow error on target and client side during the test.
Client side:
rdma-virt-03 login: [  927.033550] print_req_error: I/O error, dev 
nvme0n1, sector 140477384
[  927.033577] print_req_error: I/O error, dev nvme0n1, sector 271251016
[  927.033579] Buffer I/O error on dev nvme0n1, logical block 33906377, 
lost async page write
[  927.033583] Buffer I/O error on dev nvme0n1, logical block 33906378, 
lost async page write
[  927.033584] Buffer I/O error on dev nvme0n1, logical block 33906379, 
lost async page write
[  927.033585] Buffer I/O error on dev nvme0n1, logical block 33906380, 
lost async page write
[  927.033586] Buffer I/O error on dev nvme0n1, logical block 33906381, 
lost async page write
[  927.033586] Buffer I/O error on dev nvme0n1, logical block 33906382, 
lost async page write
[  927.033587] Buffer I/O error on dev nvme0n1, logical block 33906383, 
lost async page write
[  927.033588] Buffer I/O error on dev nvme0n1, logical block 33906384, 
lost async page write
[  927.033591] print_req_error: I/O error, dev nvme0n1, sector 271299456
[  927.033592] Buffer I/O error on dev nvme0n1, logical block 33912432, 
lost async page write
[  927.033593] Buffer I/O error on dev nvme0n1, logical block 33912433, 
lost async page write
[  927.033600] print_req_error: I/O error, dev nvme0n1, sector 271299664
[  927.033606] print_req_error: I/O error, dev nvme0n1, sector 271300200
[  927.033610] print_req_error: I/O error, dev nvme0n1, sector 271198824
[  927.033617] print_req_error: I/O error, dev nvme0n1, sector 271201256
[  927.033621] print_req_error: I/O error, dev nvme0n1, sector 271251224
[  927.033624] print_req_error: I/O error, dev nvme0n1, sector 271251280
[  927.033632] print_req_error: I/O error, dev nvme0n1, sector 271251696
[  957.561764] print_req_error: 243 callbacks suppressed
[  957.567643] print_req_error: I/O error, dev nvme0n1, sector 140682256
[  957.575049] buffer_io_error: 1965 callbacks suppressed
[  957.581006] Buffer I/O error on dev nvme0n1, logical block 17585282, 
lost async page write
[  957.590477] Buffer I/O error on dev nvme0n1, logical block 17585283, 
lost async page write
[  957.599946] Buffer I/O error on dev nvme0n1, logical block 17585284, 
lost async page write
[  957.609406] Buffer I/O error on dev nvme0n1, logical block 17585285, 
lost async page write
[  957.618874] Buffer I/O error on dev nvme0n1, logical block 17585286, 
lost async page write
[  957.628345] print_req_error: I/O error, dev nvme0n1, sector 140692416
[  957.635788] Buffer I/O error on dev nvme0n1, logical block 17586552, 
lost async page write
[  957.645290] Buffer I/O error on dev nvme0n1, logical block 17586553, 
lost async page write
[  957.654790] Buffer I/O error on dev nvme0n1, logical block 17586554, 
lost async page write
[  957.664292] print_req_error: I/O error, dev nvme0n1, sector 140693744
[  957.671767] Buffer I/O error on dev nvme0n1, logical block 17586718, 
lost async page write
[  957.681299] Buffer I/O error on dev nvme0n1, logical block 17586719, 
lost async page write
[  957.690833] print_req_error: I/O error, dev nvme0n1, sector 140697416
[  957.698345] print_req_error: I/O error, dev nvme0n1, sector 140697664
[  957.705855] print_req_error: I/O error, dev nvme0n1, sector 140698576
[  957.713367] print_req_error: I/O error, dev nvme0n1, sector 140699656
[  957.720877] print_req_error: I/O error, dev nvme0n1, sector 140701768
[  957.728390] print_req_error: I/O error, dev nvme0n1, sector 140702728
[  957.735902] print_req_error: I/O error, dev nvme0n1, sector 140705304
[  957.744235] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
[  957.750308] nvme nvme0: nvme_rdma_post_send failed with error code -12
[  957.757941] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
[  957.764030] nvme nvme0: Queueing INV WR for rkey 0x1a1d9f failed (-12)
[  957.771687] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
[  957.777799] nvme nvme0: nvme_rdma_post_send failed with error code -12
[  957.785465] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
[  957.791587] nvme nvme0: Queueing INV WR for rkey 0x1a1da0 failed (-12)
[  957.799262] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
[  957.805391] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
[  957.805396] nvme nvme0: nvme_rdma_post_send failed with error code -12
[  957.819307] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
[  957.819318] nvme nvme0: nvme_rdma_post_send failed with error code -12
[  957.833260] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
[  957.833268] nvme nvme0: Queueing INV WR for rkey 0x1a1da1 failed (-12)
[  957.847263] nvme nvme0: Queueing INV WR for rkey 0x1a1fa1 failed (-12)
[  957.855006] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
[  957.861254] nvme nvme0: nvme_rdma_post_send failed with error code -12
[  957.869004] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
[  957.875192] nvme nvme0: Queueing INV WR for rkey 0x1a1da2 failed (-12)
[  987.962014] print_req_error: 244 callbacks suppressed
[  987.968150] print_req_error: I/O error, dev nvme0n1, sector 140819704
[  987.975829] buffer_io_error: 1826 callbacks suppressed
[  987.982058] Buffer I/O error on dev nvme0n1, logical block 17602463, 
lost async page write
[  987.991803] Buffer I/O error on dev nvme0n1, logical block 17602464, 
lost async page write
[  988.001547] Buffer I/O error on dev nvme0n1, logical block 17602465, 
lost async page write

Target side:
[  875.657497] nvmet: creating controller 1 for subsystem testnqn for 
NQN 
nqn.2014-08.org.nvmexpress:NVMf:uuid:00000000-0000-0000-0000-000000000000.
[  878.243392] nvmet: adding queue 1 to ctrl 1.
[  878.248488] nvmet: adding queue 2 to ctrl 1.
[  878.253483] nvmet: adding queue 3 to ctrl 1.
[  878.258474] nvmet: adding queue 4 to ctrl 1.
[  878.263470] nvmet: adding queue 5 to ctrl 1.
[  878.268458] nvmet: adding queue 6 to ctrl 1.
[  878.273451] nvmet: adding queue 7 to ctrl 1.
[  878.278433] nvmet: adding queue 8 to ctrl 1.
[  878.283413] nvmet: adding queue 9 to ctrl 1.
[  878.288391] nvmet: adding queue 10 to ctrl 1.
[  878.293465] nvmet: adding queue 11 to ctrl 1.
[  878.298541] nvmet: adding queue 12 to ctrl 1.
[  878.303624] nvmet: adding queue 13 to ctrl 1.
[  878.308708] nvmet: adding queue 14 to ctrl 1.
[  878.313789] nvmet: adding queue 15 to ctrl 1.
[  878.318865] nvmet: adding queue 16 to ctrl 1.
[  878.323946] nvmet: adding queue 17 to ctrl 1.
[  878.329017] nvmet: adding queue 18 to ctrl 1.
[  878.334092] nvmet: adding queue 19 to ctrl 1.
[  878.339162] nvmet: adding queue 20 to ctrl 1.
[  878.344233] nvmet: adding queue 21 to ctrl 1.
[  878.349305] nvmet: adding queue 22 to ctrl 1.
[  878.354373] nvmet: adding queue 23 to ctrl 1.
[  878.359445] nvmet: adding queue 24 to ctrl 1.
[  878.364512] nvmet: adding queue 25 to ctrl 1.
[  878.369586] nvmet: adding queue 26 to ctrl 1.
[  878.374658] nvmet: adding queue 27 to ctrl 1.
[  878.379730] nvmet: adding queue 28 to ctrl 1.
[  878.384795] nvmet: adding queue 29 to ctrl 1.
[  878.389868] nvmet: adding queue 30 to ctrl 1.
[  878.394941] nvmet: adding queue 31 to ctrl 1.
[  878.400012] nvmet: adding queue 32 to ctrl 1.
[  878.405080] nvmet: adding queue 33 to ctrl 1.
[  878.410149] nvmet: adding queue 34 to ctrl 1.
[  878.415225] nvmet: adding queue 35 to ctrl 1.
[  878.420295] nvmet: adding queue 36 to ctrl 1.
[  878.425370] nvmet: adding queue 37 to ctrl 1.
[  878.430447] nvmet: adding queue 38 to ctrl 1.
[  878.435519] nvmet: adding queue 39 to ctrl 1.
[  878.440591] nvmet: adding queue 40 to ctrl 1.
[  890.970767] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[  890.977684] nvmet: ctrl 1 fatal error occurred!
[  890.983943] nvmet_rdma: freeing queue 0
[  890.988444] nvmet_rdma: freeing queue 1
[  890.992945] nvmet_rdma: freeing queue 2
[  890.997433] nvmet_rdma: freeing queue 3
[  891.001901] nvmet_rdma: freeing queue 4
[  891.006348] nvmet_rdma: freeing queue 5
[  891.010775] nvmet_rdma: freeing queue 6
[  891.015221] nvmet_rdma: freeing queue 7
[  891.019660] nvmet_rdma: freeing queue 8
[  891.024114] nvmet_rdma: freeing queue 9
[  891.028583] nvmet_rdma: freeing queue 10
[  891.033136] nvmet_rdma: freeing queue 11
[  891.037713] nvmet_rdma: freeing queue 12
[  891.042274] nvmet_rdma: freeing queue 13
[  891.046891] nvmet_rdma: freeing queue 14
[  891.051468] nvmet_rdma: freeing queue 15
[  891.056208] nvmet_rdma: freeing queue 16
[  891.060840] nvmet_rdma: freeing queue 17
[  891.065587] nvmet_rdma: freeing queue 18
[  891.070148] nvmet_rdma: freeing queue 19
[  891.075200] nvmet_rdma: freeing queue 20
[  891.079790] nvmet_rdma: freeing queue 21
[  891.102153] nvmet_rdma: freeing queue 22
[  891.106731] nvmet_rdma: freeing queue 23
[  891.111296] nvmet_rdma: freeing queue 24
[  891.116936] nvmet_rdma: freeing queue 25
[  891.121504] nvmet_rdma: freeing queue 26
[  891.126070] nvmet_rdma: freeing queue 27
[  891.130611] nvmet_rdma: freeing queue 28
[  891.135161] nvmet_rdma: freeing queue 29
[  891.140823] nvmet_rdma: freeing queue 30
[  891.145380] nvmet_rdma: freeing queue 31
[  891.149952] nvmet_rdma: freeing queue 32
[  891.154499] nvmet_rdma: freeing queue 33
[  891.159070] nvmet_rdma: freeing queue 34
[  891.163620] nvmet_rdma: freeing queue 35
[  891.168175] nvmet_rdma: freeing queue 36
[  891.173410] nvmet_rdma: freeing queue 37
[  891.177949] nvmet_rdma: freeing queue 38
[  891.182508] nvmet_rdma: freeing queue 39
[  891.187055] nvmet_rdma: freeing queue 40

> -Max.
>
>>
>> Thanks
>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> linux-rdma" in
>>> the body of a message to majordomo at vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
  2017-08-25 12:10                     ` Yi Zhang
@ 2017-08-25 22:57                         ` Max Gurtovoy
  -1 siblings, 0 replies; 14+ messages in thread
From: Max Gurtovoy @ 2017-08-25 22:57 UTC (permalink / raw)
  To: Yi Zhang, Leon Romanovsky, Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r



On 8/25/2017 3:10 PM, Yi Zhang wrote:
>
>
> On 08/24/2017 08:11 PM, Max Gurtovoy wrote:
>>
>>
>> On 4/25/2017 9:06 PM, Leon Romanovsky wrote:
>>> On Thu, Apr 20, 2017 at 07:21:29PM +0300, Sagi Grimberg wrote:
>>>>
>>>>> [1]
>>>>> [ 5968.515237] DMAR: DRHD: handling fault status reg 2
>>>>> [ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
>>>>> [ 5968.519450] 00000000 00000000 00000000 00000000
>>>>> [ 5968.519451] 00000000 00000000 00000000 00000000
>>>>> [ 5968.519451] 00000000 00000000 00000000 00000000
>>>>> [ 5968.519452] 00000000 02005104 00000316 a71710e3
>>>>
>>>> Max, Can you decode this for us?
>>>
>>> I'm not Max and maybe he will shed more light on it. I didn't find such
>>> error in our documentation.
>>
>>
>> Sorry for the late response.
>>
>> Yi Zhang,
>> Is it still repro ?
>>
> Hi Max
> The good news is the NULL pointer cannot be reproduced any more with
> 4.13.0-rc6.
>
> But I found bellow error on target and client side during the test.
> Client side:
> rdma-virt-03 login: [  927.033550] print_req_error: I/O error, dev
> nvme0n1, sector 140477384
> [  927.033577] print_req_error: I/O error, dev nvme0n1, sector 271251016
> [  927.033579] Buffer I/O error on dev nvme0n1, logical block 33906377,
> lost async page write
> [  927.033583] Buffer I/O error on dev nvme0n1, logical block 33906378,
> lost async page write
> [  927.033584] Buffer I/O error on dev nvme0n1, logical block 33906379,
> lost async page write
> [  927.033585] Buffer I/O error on dev nvme0n1, logical block 33906380,
> lost async page write
> [  927.033586] Buffer I/O error on dev nvme0n1, logical block 33906381,
> lost async page write
> [  927.033586] Buffer I/O error on dev nvme0n1, logical block 33906382,
> lost async page write
> [  927.033587] Buffer I/O error on dev nvme0n1, logical block 33906383,
> lost async page write
> [  927.033588] Buffer I/O error on dev nvme0n1, logical block 33906384,
> lost async page write
> [  927.033591] print_req_error: I/O error, dev nvme0n1, sector 271299456
> [  927.033592] Buffer I/O error on dev nvme0n1, logical block 33912432,
> lost async page write
> [  927.033593] Buffer I/O error on dev nvme0n1, logical block 33912433,
> lost async page write
> [  927.033600] print_req_error: I/O error, dev nvme0n1, sector 271299664
> [  927.033606] print_req_error: I/O error, dev nvme0n1, sector 271300200
> [  927.033610] print_req_error: I/O error, dev nvme0n1, sector 271198824
> [  927.033617] print_req_error: I/O error, dev nvme0n1, sector 271201256
> [  927.033621] print_req_error: I/O error, dev nvme0n1, sector 271251224
> [  927.033624] print_req_error: I/O error, dev nvme0n1, sector 271251280
> [  927.033632] print_req_error: I/O error, dev nvme0n1, sector 271251696
> [  957.561764] print_req_error: 243 callbacks suppressed
> [  957.567643] print_req_error: I/O error, dev nvme0n1, sector 140682256
> [  957.575049] buffer_io_error: 1965 callbacks suppressed
> [  957.581006] Buffer I/O error on dev nvme0n1, logical block 17585282,
> lost async page write
> [  957.590477] Buffer I/O error on dev nvme0n1, logical block 17585283,
> lost async page write
> [  957.599946] Buffer I/O error on dev nvme0n1, logical block 17585284,
> lost async page write
> [  957.609406] Buffer I/O error on dev nvme0n1, logical block 17585285,
> lost async page write
> [  957.618874] Buffer I/O error on dev nvme0n1, logical block 17585286,
> lost async page write
> [  957.628345] print_req_error: I/O error, dev nvme0n1, sector 140692416
> [  957.635788] Buffer I/O error on dev nvme0n1, logical block 17586552,
> lost async page write
> [  957.645290] Buffer I/O error on dev nvme0n1, logical block 17586553,
> lost async page write
> [  957.654790] Buffer I/O error on dev nvme0n1, logical block 17586554,
> lost async page write
> [  957.664292] print_req_error: I/O error, dev nvme0n1, sector 140693744
> [  957.671767] Buffer I/O error on dev nvme0n1, logical block 17586718,
> lost async page write
> [  957.681299] Buffer I/O error on dev nvme0n1, logical block 17586719,
> lost async page write
> [  957.690833] print_req_error: I/O error, dev nvme0n1, sector 140697416
> [  957.698345] print_req_error: I/O error, dev nvme0n1, sector 140697664
> [  957.705855] print_req_error: I/O error, dev nvme0n1, sector 140698576
> [  957.713367] print_req_error: I/O error, dev nvme0n1, sector 140699656
> [  957.720877] print_req_error: I/O error, dev nvme0n1, sector 140701768
> [  957.728390] print_req_error: I/O error, dev nvme0n1, sector 140702728
> [  957.735902] print_req_error: I/O error, dev nvme0n1, sector 140705304
> [  957.744235] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
> [  957.750308] nvme nvme0: nvme_rdma_post_send failed with error code -12
> [  957.757941] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
> [  957.764030] nvme nvme0: Queueing INV WR for rkey 0x1a1d9f failed (-12)
> [  957.771687] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
> [  957.777799] nvme nvme0: nvme_rdma_post_send failed with error code -12
> [  957.785465] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
> [  957.791587] nvme nvme0: Queueing INV WR for rkey 0x1a1da0 failed (-12)
> [  957.799262] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
> [  957.805391] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
> [  957.805396] nvme nvme0: nvme_rdma_post_send failed with error code -12
> [  957.819307] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
> [  957.819318] nvme nvme0: nvme_rdma_post_send failed with error code -12
> [  957.833260] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
> [  957.833268] nvme nvme0: Queueing INV WR for rkey 0x1a1da1 failed (-12)
> [  957.847263] nvme nvme0: Queueing INV WR for rkey 0x1a1fa1 failed (-12)
> [  957.855006] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
> [  957.861254] nvme nvme0: nvme_rdma_post_send failed with error code -12
> [  957.869004] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
> [  957.875192] nvme nvme0: Queueing INV WR for rkey 0x1a1da2 failed (-12)
> [  987.962014] print_req_error: 244 callbacks suppressed
> [  987.968150] print_req_error: I/O error, dev nvme0n1, sector 140819704
> [  987.975829] buffer_io_error: 1826 callbacks suppressed
> [  987.982058] Buffer I/O error on dev nvme0n1, logical block 17602463,
> lost async page write
> [  987.991803] Buffer I/O error on dev nvme0n1, logical block 17602464,
> lost async page write
> [  988.001547] Buffer I/O error on dev nvme0n1, logical block 17602465,
> lost async page write

I couldn't repro it, but for some reason you got an overflow in the QP 
send queue.
seems like something might be wrong with the calculation (probably 
signaling calculation).

please supply more details:
1. link layer ?
2. HCA type + FW versions on target/host sides ?
3. B2B connection ?

try this one as a first step:

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 82fcb07..1437306 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -88,6 +88,7 @@ struct nvme_rdma_queue {
         struct nvme_rdma_qe     *rsp_ring;
         atomic_t                sig_count;
         int                     queue_size;
+       int                     limit_mask;
         size_t                  cmnd_capsule_len;
         struct nvme_rdma_ctrl   *ctrl;
         struct nvme_rdma_device *device;
@@ -521,6 +522,7 @@ static int nvme_rdma_init_queue(struct 
nvme_rdma_ctrl *ctrl,

         queue->queue_size = queue_size;
         atomic_set(&queue->sig_count, 0);
+       queue->limit_mask = (min(32, 1 << ilog2((queue->queue_size + 1) 
/ 2))) - 1;

         queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, 
queue,
                         RDMA_PS_TCP, IB_QPT_RC);
@@ -1009,9 +1011,7 @@ static void nvme_rdma_send_done(struct ib_cq *cq, 
struct ib_wc *wc)
   */
  static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue 
*queue)
  {
-       int limit = 1 << ilog2((queue->queue_size + 1) / 2);
-
-       return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
+       return (atomic_inc_return(&queue->sig_count) & 
(queue->limit_mask)) == 0;
  }

  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
@ 2017-08-25 22:57                         ` Max Gurtovoy
  0 siblings, 0 replies; 14+ messages in thread
From: Max Gurtovoy @ 2017-08-25 22:57 UTC (permalink / raw)




On 8/25/2017 3:10 PM, Yi Zhang wrote:
>
>
> On 08/24/2017 08:11 PM, Max Gurtovoy wrote:
>>
>>
>> On 4/25/2017 9:06 PM, Leon Romanovsky wrote:
>>> On Thu, Apr 20, 2017@07:21:29PM +0300, Sagi Grimberg wrote:
>>>>
>>>>> [1]
>>>>> [ 5968.515237] DMAR: DRHD: handling fault status reg 2
>>>>> [ 5968.519449] mlx5_2:dump_cqe:262:(pid 0): dump error cqe
>>>>> [ 5968.519450] 00000000 00000000 00000000 00000000
>>>>> [ 5968.519451] 00000000 00000000 00000000 00000000
>>>>> [ 5968.519451] 00000000 00000000 00000000 00000000
>>>>> [ 5968.519452] 00000000 02005104 00000316 a71710e3
>>>>
>>>> Max, Can you decode this for us?
>>>
>>> I'm not Max and maybe he will shed more light on it. I didn't find such
>>> error in our documentation.
>>
>>
>> Sorry for the late response.
>>
>> Yi Zhang,
>> Is it still repro ?
>>
> Hi Max
> The good news is the NULL pointer cannot be reproduced any more with
> 4.13.0-rc6.
>
> But I found bellow error on target and client side during the test.
> Client side:
> rdma-virt-03 login: [  927.033550] print_req_error: I/O error, dev
> nvme0n1, sector 140477384
> [  927.033577] print_req_error: I/O error, dev nvme0n1, sector 271251016
> [  927.033579] Buffer I/O error on dev nvme0n1, logical block 33906377,
> lost async page write
> [  927.033583] Buffer I/O error on dev nvme0n1, logical block 33906378,
> lost async page write
> [  927.033584] Buffer I/O error on dev nvme0n1, logical block 33906379,
> lost async page write
> [  927.033585] Buffer I/O error on dev nvme0n1, logical block 33906380,
> lost async page write
> [  927.033586] Buffer I/O error on dev nvme0n1, logical block 33906381,
> lost async page write
> [  927.033586] Buffer I/O error on dev nvme0n1, logical block 33906382,
> lost async page write
> [  927.033587] Buffer I/O error on dev nvme0n1, logical block 33906383,
> lost async page write
> [  927.033588] Buffer I/O error on dev nvme0n1, logical block 33906384,
> lost async page write
> [  927.033591] print_req_error: I/O error, dev nvme0n1, sector 271299456
> [  927.033592] Buffer I/O error on dev nvme0n1, logical block 33912432,
> lost async page write
> [  927.033593] Buffer I/O error on dev nvme0n1, logical block 33912433,
> lost async page write
> [  927.033600] print_req_error: I/O error, dev nvme0n1, sector 271299664
> [  927.033606] print_req_error: I/O error, dev nvme0n1, sector 271300200
> [  927.033610] print_req_error: I/O error, dev nvme0n1, sector 271198824
> [  927.033617] print_req_error: I/O error, dev nvme0n1, sector 271201256
> [  927.033621] print_req_error: I/O error, dev nvme0n1, sector 271251224
> [  927.033624] print_req_error: I/O error, dev nvme0n1, sector 271251280
> [  927.033632] print_req_error: I/O error, dev nvme0n1, sector 271251696
> [  957.561764] print_req_error: 243 callbacks suppressed
> [  957.567643] print_req_error: I/O error, dev nvme0n1, sector 140682256
> [  957.575049] buffer_io_error: 1965 callbacks suppressed
> [  957.581006] Buffer I/O error on dev nvme0n1, logical block 17585282,
> lost async page write
> [  957.590477] Buffer I/O error on dev nvme0n1, logical block 17585283,
> lost async page write
> [  957.599946] Buffer I/O error on dev nvme0n1, logical block 17585284,
> lost async page write
> [  957.609406] Buffer I/O error on dev nvme0n1, logical block 17585285,
> lost async page write
> [  957.618874] Buffer I/O error on dev nvme0n1, logical block 17585286,
> lost async page write
> [  957.628345] print_req_error: I/O error, dev nvme0n1, sector 140692416
> [  957.635788] Buffer I/O error on dev nvme0n1, logical block 17586552,
> lost async page write
> [  957.645290] Buffer I/O error on dev nvme0n1, logical block 17586553,
> lost async page write
> [  957.654790] Buffer I/O error on dev nvme0n1, logical block 17586554,
> lost async page write
> [  957.664292] print_req_error: I/O error, dev nvme0n1, sector 140693744
> [  957.671767] Buffer I/O error on dev nvme0n1, logical block 17586718,
> lost async page write
> [  957.681299] Buffer I/O error on dev nvme0n1, logical block 17586719,
> lost async page write
> [  957.690833] print_req_error: I/O error, dev nvme0n1, sector 140697416
> [  957.698345] print_req_error: I/O error, dev nvme0n1, sector 140697664
> [  957.705855] print_req_error: I/O error, dev nvme0n1, sector 140698576
> [  957.713367] print_req_error: I/O error, dev nvme0n1, sector 140699656
> [  957.720877] print_req_error: I/O error, dev nvme0n1, sector 140701768
> [  957.728390] print_req_error: I/O error, dev nvme0n1, sector 140702728
> [  957.735902] print_req_error: I/O error, dev nvme0n1, sector 140705304
> [  957.744235] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
> [  957.750308] nvme nvme0: nvme_rdma_post_send failed with error code -12
> [  957.757941] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
> [  957.764030] nvme nvme0: Queueing INV WR for rkey 0x1a1d9f failed (-12)
> [  957.771687] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
> [  957.777799] nvme nvme0: nvme_rdma_post_send failed with error code -12
> [  957.785465] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
> [  957.791587] nvme nvme0: Queueing INV WR for rkey 0x1a1da0 failed (-12)
> [  957.799262] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
> [  957.805391] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
> [  957.805396] nvme nvme0: nvme_rdma_post_send failed with error code -12
> [  957.819307] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
> [  957.819318] nvme nvme0: nvme_rdma_post_send failed with error code -12
> [  957.833260] mlx5_2:mlx5_ib_post_send:3846:(pid 1007):
> [  957.833268] nvme nvme0: Queueing INV WR for rkey 0x1a1da1 failed (-12)
> [  957.847263] nvme nvme0: Queueing INV WR for rkey 0x1a1fa1 failed (-12)
> [  957.855006] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
> [  957.861254] nvme nvme0: nvme_rdma_post_send failed with error code -12
> [  957.869004] mlx5_2:mlx5_ib_post_send:3846:(pid 1254):
> [  957.875192] nvme nvme0: Queueing INV WR for rkey 0x1a1da2 failed (-12)
> [  987.962014] print_req_error: 244 callbacks suppressed
> [  987.968150] print_req_error: I/O error, dev nvme0n1, sector 140819704
> [  987.975829] buffer_io_error: 1826 callbacks suppressed
> [  987.982058] Buffer I/O error on dev nvme0n1, logical block 17602463,
> lost async page write
> [  987.991803] Buffer I/O error on dev nvme0n1, logical block 17602464,
> lost async page write
> [  988.001547] Buffer I/O error on dev nvme0n1, logical block 17602465,
> lost async page write

I couldn't repro it, but for some reason you got an overflow in the QP 
send queue.
seems like something might be wrong with the calculation (probably 
signaling calculation).

please supply more details:
1. link layer ?
2. HCA type + FW versions on target/host sides ?
3. B2B connection ?

try this one as a first step:

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 82fcb07..1437306 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -88,6 +88,7 @@ struct nvme_rdma_queue {
         struct nvme_rdma_qe     *rsp_ring;
         atomic_t                sig_count;
         int                     queue_size;
+       int                     limit_mask;
         size_t                  cmnd_capsule_len;
         struct nvme_rdma_ctrl   *ctrl;
         struct nvme_rdma_device *device;
@@ -521,6 +522,7 @@ static int nvme_rdma_init_queue(struct 
nvme_rdma_ctrl *ctrl,

         queue->queue_size = queue_size;
         atomic_set(&queue->sig_count, 0);
+       queue->limit_mask = (min(32, 1 << ilog2((queue->queue_size + 1) 
/ 2))) - 1;

         queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, 
queue,
                         RDMA_PS_TCP, IB_QPT_RC);
@@ -1009,9 +1011,7 @@ static void nvme_rdma_send_done(struct ib_cq *cq, 
struct ib_wc *wc)
   */
  static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue 
*queue)
  {
-       int limit = 1 << ilog2((queue->queue_size + 1) / 2);
-
-       return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
+       return (atomic_inc_return(&queue->sig_count) & 
(queue->limit_mask)) == 0;
  }

  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
  2017-08-25 22:57                         ` Max Gurtovoy
@ 2017-08-31  7:15                             ` Yi Zhang
  -1 siblings, 0 replies; 14+ messages in thread
From: Yi Zhang @ 2017-08-31  7:15 UTC (permalink / raw)
  To: Max Gurtovoy, Leon Romanovsky, Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


> I couldn't repro it, but for some reason you got an overflow in the QP 
> send queue.
> seems like something might be wrong with the calculation (probably 
> signaling calculation).
>
> please supply more details:
> 1. link layer ?
> 2. HCA type + FW versions on target/host sides ?
> 3. B2B connection ?
>
> try this one as a first step:
>
Hi Max
I retest this issue on 4.13.0-rc6/4.13.0-rc7 without your patch, found 
this issue cannot be reproduced any more.
Here is my environment:
link layer:mlx5_roce
HCA:
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family 
[ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family 
[ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family 
[ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family 
[ConnectX-4 Lx]
Firmware:
[   13.489854] mlx5_core 0000:04:00.0: firmware version: 12.18.1000
[   14.360121] mlx5_core 0000:04:00.1: firmware version: 12.18.1000
[   15.091088] mlx5_core 0000:05:00.0: firmware version: 14.18.1000
[   15.936417] mlx5_core 0000:05:00.1: firmware version: 14.18.1000
The two server connected by switch.

Will let you know and retest your patch when I reproduced it in the future.

Thanks
Yi

> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 82fcb07..1437306 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -88,6 +88,7 @@ struct nvme_rdma_queue {
>         struct nvme_rdma_qe     *rsp_ring;
>         atomic_t                sig_count;
>         int                     queue_size;
> +       int                     limit_mask;
>         size_t                  cmnd_capsule_len;
>         struct nvme_rdma_ctrl   *ctrl;
>         struct nvme_rdma_device *device;
> @@ -521,6 +522,7 @@ static int nvme_rdma_init_queue(struct 
> nvme_rdma_ctrl *ctrl,
>
>         queue->queue_size = queue_size;
>         atomic_set(&queue->sig_count, 0);
> +       queue->limit_mask = (min(32, 1 << ilog2((queue->queue_size + 
> 1) / 2))) - 1;
>
>         queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, 
> queue,
>                         RDMA_PS_TCP, IB_QPT_RC);
> @@ -1009,9 +1011,7 @@ static void nvme_rdma_send_done(struct ib_cq 
> *cq, struct ib_wc *wc)
>   */
>  static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue 
> *queue)
>  {
> -       int limit = 1 << ilog2((queue->queue_size + 1) / 2);
> -
> -       return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
> +       return (atomic_inc_return(&queue->sig_count) & 
> (queue->limit_mask)) == 0;
>  }
>
>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>
>
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7
@ 2017-08-31  7:15                             ` Yi Zhang
  0 siblings, 0 replies; 14+ messages in thread
From: Yi Zhang @ 2017-08-31  7:15 UTC (permalink / raw)



> I couldn't repro it, but for some reason you got an overflow in the QP 
> send queue.
> seems like something might be wrong with the calculation (probably 
> signaling calculation).
>
> please supply more details:
> 1. link layer ?
> 2. HCA type + FW versions on target/host sides ?
> 3. B2B connection ?
>
> try this one as a first step:
>
Hi Max
I retest this issue on 4.13.0-rc6/4.13.0-rc7 without your patch, found 
this issue cannot be reproduced any more.
Here is my environment:
link layer:mlx5_roce
HCA:
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family 
[ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family 
[ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family 
[ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family 
[ConnectX-4 Lx]
Firmware:
[   13.489854] mlx5_core 0000:04:00.0: firmware version: 12.18.1000
[   14.360121] mlx5_core 0000:04:00.1: firmware version: 12.18.1000
[   15.091088] mlx5_core 0000:05:00.0: firmware version: 14.18.1000
[   15.936417] mlx5_core 0000:05:00.1: firmware version: 14.18.1000
The two server connected by switch.

Will let you know and retest your patch when I reproduced it in the future.

Thanks
Yi

> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 82fcb07..1437306 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -88,6 +88,7 @@ struct nvme_rdma_queue {
>         struct nvme_rdma_qe     *rsp_ring;
>         atomic_t                sig_count;
>         int                     queue_size;
> +       int                     limit_mask;
>         size_t                  cmnd_capsule_len;
>         struct nvme_rdma_ctrl   *ctrl;
>         struct nvme_rdma_device *device;
> @@ -521,6 +522,7 @@ static int nvme_rdma_init_queue(struct 
> nvme_rdma_ctrl *ctrl,
>
>         queue->queue_size = queue_size;
>         atomic_set(&queue->sig_count, 0);
> +       queue->limit_mask = (min(32, 1 << ilog2((queue->queue_size + 
> 1) / 2))) - 1;
>
>         queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, 
> queue,
>                         RDMA_PS_TCP, IB_QPT_RC);
> @@ -1009,9 +1011,7 @@ static void nvme_rdma_send_done(struct ib_cq 
> *cq, struct ib_wc *wc)
>   */
>  static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue 
> *queue)
>  {
> -       int limit = 1 << ilog2((queue->queue_size + 1) / 2);
> -
> -       return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
> +       return (atomic_inc_return(&queue->sig_count) & 
> (queue->limit_mask)) == 0;
>  }
>
>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>
>
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-08-31  7:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1575599772.13802712.1492592011810.JavaMail.zimbra@redhat.com>
     [not found] ` <1575599772.13802712.1492592011810.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-04-20  6:03   ` kernel NULL pointer during reset_controller operation with IO on 4.11.0-rc7 Yi Zhang
2017-04-20  6:03     ` Yi Zhang
     [not found]     ` <1413097100.14743757.1492668219336.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-04-20 16:21       ` Sagi Grimberg
2017-04-20 16:21         ` Sagi Grimberg
     [not found]         ` <97bb90ec-4337-62f7-f08d-a673975a5637-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-04-25 18:06           ` Leon Romanovsky
2017-04-25 18:06             ` Leon Romanovsky
     [not found]             ` <20170425180630.GU14088-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-08-24 12:11               ` Max Gurtovoy
2017-08-24 12:11                 ` Max Gurtovoy
     [not found]                 ` <cbc43d35-27c9-331d-b345-b09530477d74-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-08-25 12:10                   ` Yi Zhang
2017-08-25 12:10                     ` Yi Zhang
     [not found]                     ` <39bb8b67-4018-09bd-9d7d-a8f8534084a7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-08-25 22:57                       ` Max Gurtovoy
2017-08-25 22:57                         ` Max Gurtovoy
     [not found]                         ` <7ceef67d-4424-97d5-02f5-7569a1f5a20e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-08-31  7:15                           ` Yi Zhang
2017-08-31  7:15                             ` Yi Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.