* Linux 4.9.8 + NVMe CiB Issue @ 2017-02-15 19:27 Marc Smith 2017-02-15 21:12 ` Keith Busch 0 siblings, 1 reply; 11+ messages in thread From: Marc Smith @ 2017-02-15 19:27 UTC (permalink / raw) Hi, I'm testing with a Supermicro SSG-2028R-DN2R40L NVMe CiB (cluster-in-a-box) solution. The performance is amazing so far, but I experienced an issue during a performance test while using the fio tool. Linux 4.9.8 fio 2.14 We have just (8) NVMe drives in the "enclosure", and it contains two server nodes, but right now we're just testing from one of the nodes. This is the command we ran: fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 After a few seconds, noticed the performance numbers started dropping, and started flaking out. This is what we saw in the kernel logs: --snip-- [70961.868655] nvme nvme0: I/O 1009 QID 1 timeout, aborting [70961.868666] nvme nvme0: I/O 1010 QID 1 timeout, aborting [70961.868670] nvme nvme0: I/O 1011 QID 1 timeout, aborting [70961.868673] nvme nvme0: I/O 1013 QID 1 timeout, aborting [70992.073974] nvme nvme0: I/O 1009 QID 1 timeout, reset controller [71022.727229] nvme nvme0: I/O 237 QID 0 timeout, reset controller [71052.589227] nvme nvme0: completing aborted command with status: 0007 [71052.589230] blk_update_request: I/O error, dev nvme0n1, sector 1699051304 [71052.589240] nvme nvme0: completing aborted command with status: 0007 [71052.589241] blk_update_request: I/O error, dev nvme0n1, sector 921069792 [71052.589243] nvme nvme0: completing aborted command with status: 0007 [71052.589244] blk_update_request: I/O error, dev nvme0n1, sector 503421912 [71052.589246] nvme nvme0: completing aborted command with status: 0007 [71052.589247] blk_update_request: I/O error, dev nvme0n1, sector 459191600 [71052.589249] nvme nvme0: completing aborted command with status: 0007 [71052.589250] blk_update_request: I/O error, dev nvme0n1, sector 541938152 [71052.589252] nvme nvme0: completing aborted command with status: 0007 [71052.589253] blk_update_request: I/O error, dev nvme0n1, sector 454021704 [71052.589255] nvme nvme0: completing aborted command with status: 0007 [71052.589255] blk_update_request: I/O error, dev nvme0n1, sector 170843976 [71052.589257] nvme nvme0: completing aborted command with status: 0007 [71052.589258] blk_update_request: I/O error, dev nvme0n1, sector 1632333960 [71052.589259] nvme nvme0: completing aborted command with status: 0007 [71052.589260] blk_update_request: I/O error, dev nvme0n1, sector 463726632 [71052.589262] nvme nvme0: completing aborted command with status: 0007 [71052.589262] blk_update_request: I/O error, dev nvme0n1, sector 1402584824 [71052.589264] nvme nvme0: completing aborted command with status: 0007 [71052.589267] nvme nvme0: completing aborted command with status: 0007 [71052.589273] nvme nvme0: completing aborted command with status: 0007 [71052.589275] nvme nvme0: completing aborted command with status: 0007 [71052.589277] nvme nvme0: completing aborted command with status: 0007 [71052.589280] nvme nvme0: completing aborted command with status: 0007 [71052.589282] nvme nvme0: completing aborted command with status: 0007 [71052.589284] nvme nvme0: completing aborted command with status: 0007 [71052.589286] nvme nvme0: completing aborted command with status: 0007 [71052.589288] nvme nvme0: completing aborted command with status: 0007 [71052.589290] nvme nvme0: completing aborted command with status: 0007 [71052.589292] nvme nvme0: completing aborted command with status: 0007 [71052.589294] nvme nvme0: completing aborted command with status: 0007 [71052.589297] nvme nvme0: completing aborted command with status: 0007 [71052.589303] nvme nvme0: completing aborted command with status: 0007 [71052.589305] nvme nvme0: completing aborted command with status: 0007 [71052.589307] nvme nvme0: completing aborted command with status: 0007 [71052.589309] nvme nvme0: completing aborted command with status: 0007 [71052.589312] nvme nvme0: completing aborted command with status: 0007 [71052.589314] nvme nvme0: completing aborted command with status: 0007 [71052.589316] nvme nvme0: completing aborted command with status: 0007 [71052.589318] nvme nvme0: completing aborted command with status: 0007 [71052.589321] nvme nvme0: completing aborted command with status: 0007 [71052.589323] nvme nvme0: completing aborted command with status: 0007 [71052.589325] nvme nvme0: completing aborted command with status: 0007 [71052.589328] nvme nvme0: completing aborted command with status: 0007 [71052.589333] fio[3834]: segfault at 8 ip 000000000046905d sp 00007ffd2d545590 error 4 [71052.589334] nvme nvme0: completing aborted command with status: 0007 [71052.589336] nvme nvme0: completing aborted command with status: 0007 [71052.589339] in fio[400000+9d000]<4>[71052.589339] nvme nvme0: completing aborted command with status: 0007 [71052.589342] nvme nvme0: completing aborted command with status: 0007 [71052.589344] nvme nvme0: completing aborted command with status: 0007 [71052.589347] nvme nvme0: completing aborted command with status: 0007 [71052.589349] nvme nvme0: completing aborted command with status: 0007 [71052.589352] nvme nvme0: completing aborted command with status: 0007 [71052.589354] nvme nvme0: completing aborted command with status: 0007 [71052.589360] nvme nvme0: completing aborted command with status: 0007 [71052.589365] fio[3832]: segfault at 8 ip 000000000046905d sp 00007ffd2d545590 error 4 [71052.589365] nvme nvme0: completing aborted command with status: 0007 [71052.589367] nvme nvme0: completing aborted command with status: 0007 [71052.589370] in fio[400000+9d000]<4>[71052.589370] nvme nvme0: completing aborted command with status: 0007 [71052.589373] nvme nvme0: completing aborted command with status: 0007 [71052.589375] nvme nvme0: completing aborted command with status: 0007 [71052.589377] nvme nvme0: completing aborted command with status: 0007 [71052.589379] nvme nvme0: completing aborted command with status: 0007 [71052.589382] nvme nvme0: completing aborted command with status: 0007 [71052.589383] nvme nvme0: completing aborted command with status: 0007 [71052.589386] nvme nvme0: completing aborted command with status: 0007 [71052.589390] nvme nvme0: completing aborted command with status: 0007 [71052.589392] nvme nvme0: completing aborted command with status: 0007 [71052.589394] fio[3831]: segfault at 8 ip 000000000046905d sp 00007ffd2d545590 error 4 [71052.589397] nvme nvme0: completing aborted command with status: 0007 [71052.589398] in fio[400000+9d000]<4>[71052.589403] nvme nvme0: completing aborted command with status: 0007 [71052.589408] nvme nvme0: completing aborted command with status: 0007 [71052.589410] nvme nvme0: completing aborted command with status: 0007 [71052.589412] nvme nvme0: completing aborted command with status: 0007 [71052.589414] nvme nvme0: completing aborted command with status: 0007 [71052.589417] nvme nvme0: completing aborted command with status: 0007 [71052.589419] nvme nvme0: completing aborted command with status: 0007 [71052.589422] nvme nvme0: completing aborted command with status: 0007 [71052.589424] nvme nvme0: completing aborted command with status: 0007 [71052.589428] fio[3836]: segfault at 8 ip 000000000046905d sp 00007ffd2d545590 error 4 [71052.589429] nvme nvme0: completing aborted command with status: 0007 [71052.589431] nvme nvme0: completing aborted command with status: 0007 [71052.589434] in fio[400000+9d000]<4>[71052.589434] nvme nvme0: completing aborted command with status: 0007 [71052.589437] nvme nvme0: completing aborted command with status: 0007 [71052.589442] nvme nvme0: completing aborted command with status: 0007 [71052.589444] nvme nvme0: completing aborted command with status: 0007 [71052.589446] nvme nvme0: completing aborted command with status: 0007 [71052.589449] nvme nvme0: completing aborted command with status: 0007 [71052.589451] nvme nvme0: completing aborted command with status: 0007 [71052.589453] nvme nvme0: completing aborted command with status: 0007 [71052.589456] nvme nvme0: completing aborted command with status: 0007 [71052.589459] nvme nvme0: completing aborted command with status: 0007 [71052.589461] nvme nvme0: completing aborted command with status: 0007 [71052.589464] fio[3844]: segfault at 8 ip 000000000046905d sp 00007ffd2d545590 error 4 [71052.589467] nvme nvme0: completing aborted command with status: 0007 [71052.589468] in fio[400000+9d000]<4>[71052.589471] nvme nvme0: completing aborted command with status: 0007 [71052.589476] nvme nvme0: completing aborted command with status: 0007 [71052.589481] nvme nvme0: completing aborted command with status: 0007 [71052.589484] nvme nvme0: completing aborted command with status: 0007 [71052.589487] nvme nvme0: completing aborted command with status: 0007 [71052.589490] nvme nvme0: completing aborted command with status: 0007 [71052.589492] nvme nvme0: completing aborted command with status: 0007 [71052.589494] nvme nvme0: completing aborted command with status: 0007 [71052.589499] fio[3841]: segfault at 8 ip 000000000046905d sp 00007ffd2d545590 error 4 [71052.589499] nvme nvme0: completing aborted command with status: 0007 [71052.589501] nvme nvme0: completing aborted command with status: 0007 [71052.589504] nvme nvme0: completing aborted command with status: 0007 [71052.589506] nvme nvme0: completing aborted command with status: 0007 [71052.589507] in fio[400000+9d000]<4>[71052.589510] nvme nvme0: completing aborted command with status: 0007 [71052.589513] nvme nvme0: completing aborted command with status: 0007 [71052.589518] nvme nvme0: completing aborted command with status: 0007 [71052.589523] nvme nvme0: completing aborted command with status: 0007 [71052.589538] nvme nvme0: completing aborted command with status: 0007 [71052.589540] nvme nvme0: completing aborted command with status: 0007 [71052.589543] fio[3830]: segfault at 8 ip 000000000046905d sp 00007ffd2d545590 error 4 in fio[400000+9d000] [71052.589548] nvme nvme0: completing aborted command with status: 0007 [71052.589551] nvme nvme0: completing aborted command with status: 0007 [71052.589553] nvme nvme0: completing aborted command with status: 0007 [71052.589556] nvme nvme0: completing aborted command with status: 0007 [71052.589560] fio[3835]: segfault at 8 ip 000000000046905d sp 00007ffd2d545590 error 4 [71052.589561] nvme nvme0: completing aborted command with status: 0007 [71052.589566] nvme nvme0: completing aborted command with status: 0007 [71052.589572] in fio[400000+9d000]<4>[71052.589576] nvme nvme0: completing aborted command with status: 0007 [71052.589584] nvme nvme0: completing aborted command with status: 0007 [71052.589591] fio[3845]: segfault at 8 ip 000000000046905d sp 00007ffd2d545590 error 4 [71052.589592] nvme nvme0: completing aborted command with status: 0007 [71052.589598] fio[3833]: segfault at 8 ip 000000000046905d sp 00007ffd2d545590 error 4 [71052.589599] nvme nvme0: completing aborted command with status: 0007 [71052.589601] in fio[400000+9d000] in fio[400000+9d000]<4>[71052.589610] nvme nvme0: completing aborted command with status: 0007 [71052.589616] nvme nvme0: completing aborted command with status: 0007 [71052.589620] nvme nvme0: completing aborted command with status: 0007 [71052.589624] nvme nvme0: completing aborted command with status: 0007 [71052.589629] nvme nvme0: completing aborted command with status: 0007 [71052.589634] nvme nvme0: completing aborted command with status: 0007 [71052.589638] nvme nvme0: completing aborted command with status: 0007 [71052.589640] nvme nvme0: completing aborted command with status: 0007 [71052.589642] nvme nvme0: completing aborted command with status: 0007 [71052.589643] nvme nvme0: completing aborted command with status: 0007 [71052.589645] nvme nvme0: completing aborted command with status: 0007 [71052.589647] nvme nvme0: completing aborted command with status: 0007 [71052.589650] nvme nvme0: completing aborted command with status: 0007 [71052.589652] nvme nvme0: completing aborted command with status: 0007 [71052.589654] nvme nvme0: completing aborted command with status: 0007 [71052.589656] nvme nvme0: completing aborted command with status: 0007 [71052.589657] nvme nvme0: completing aborted command with status: 0007 [71052.589659] nvme nvme0: completing aborted command with status: 0007 [71052.589661] nvme nvme0: completing aborted command with status: 0007 [71052.589663] nvme nvme0: completing aborted command with status: 0007 [71052.589666] nvme nvme0: completing aborted command with status: 0007 [71052.589683] nvme nvme0: completing aborted command with status: 0007 [71052.589685] nvme nvme0: completing aborted command with status: 0007 [71052.589687] nvme nvme0: completing aborted command with status: 0007 [71052.589692] nvme nvme0: Abort status: 0x7 [71052.589694] nvme nvme0: Abort status: 0x7 [71052.589695] nvme nvme0: Abort status: 0x7 [71052.589697] nvme nvme0: completing aborted command with status: 0007 [71052.589698] nvme nvme0: completing aborted command with status: 0007 [71052.589700] nvme nvme0: completing aborted command with status: 0007 [71052.589703] nvme nvme0: completing aborted command with status: 0007 [71052.589706] nvme nvme0: completing aborted command with status: 0007 [71052.589708] nvme nvme0: completing aborted command with status: fffffffc [71052.589710] nvme nvme0: completing aborted command with status: 0007 [71052.589714] nvme nvme0: completing aborted command with status: 0007 [71052.589715] nvme nvme0: completing aborted command with status: 0007 [71052.589716] nvme nvme0: completing aborted command with status: 0007 [71052.589718] nvme nvme0: completing aborted command with status: 0007 [71052.589720] nvme nvme0: completing aborted command with status: 0007 [71052.589723] nvme nvme0: completing aborted command with status: 0007 [71052.589724] nvme nvme0: completing aborted command with status: 0007 [71052.589725] nvme nvme0: completing aborted command with status: 0007 [71052.589726] nvme nvme0: completing aborted command with status: 0007 [71052.589728] nvme nvme0: completing aborted command with status: 0007 [71052.589729] nvme nvme0: completing aborted command with status: 0007 [71052.589731] nvme nvme0: completing aborted command with status: 0007 [71052.589732] nvme nvme0: completing aborted command with status: 0007 [71052.589734] nvme nvme0: completing aborted command with status: 0007 [71052.589736] nvme nvme0: completing aborted command with status: 0007 [71052.589737] nvme nvme0: completing aborted command with status: 0007 [71052.589739] nvme nvme0: completing aborted command with status: 0007 [71052.589741] nvme nvme0: completing aborted command with status: 0007 [71052.589743] nvme nvme0: completing aborted command with status: 0007 [71052.589744] nvme nvme0: completing aborted command with status: 0007 [71052.589746] nvme nvme0: completing aborted command with status: 0007 [71052.589747] nvme nvme0: completing aborted command with status: 0007 [71052.589748] nvme nvme0: completing aborted command with status: 0007 [71052.589750] nvme nvme0: completing aborted command with status: 0007 [71052.589752] nvme nvme0: completing aborted command with status: 0007 [71052.589753] nvme nvme0: completing aborted command with status: 0007 [71052.589754] nvme nvme0: completing aborted command with status: 0007 [71052.589756] nvme nvme0: completing aborted command with status: 0007 [71052.589757] nvme nvme0: completing aborted command with status: 0007 [71052.589759] nvme nvme0: completing aborted command with status: 0007 [71052.589761] nvme nvme0: completing aborted command with status: 0007 [71052.589762] nvme nvme0: completing aborted command with status: 0007 [71052.589763] nvme nvme0: completing aborted command with status: 0007 [71052.589764] nvme nvme0: completing aborted command with status: 0007 [71052.589766] nvme nvme0: completing aborted command with status: 0007 [71052.589768] nvme nvme0: completing aborted command with status: 0007 [71052.589769] nvme nvme0: completing aborted command with status: 0007 [71052.589771] nvme nvme0: completing aborted command with status: 0007 [71052.589773] nvme nvme0: completing aborted command with status: 0007 [71052.589774] nvme nvme0: completing aborted command with status: 0007 [71052.589775] nvme nvme0: completing aborted command with status: 0007 [71052.589777] nvme nvme0: completing aborted command with status: 0007 [71052.589779] nvme nvme0: completing aborted command with status: 0007 [71052.589780] nvme nvme0: completing aborted command with status: 0007 [71052.589782] nvme nvme0: completing aborted command with status: 0007 [71052.589785] nvme nvme0: completing aborted command with status: 0007 [71052.589786] nvme nvme0: completing aborted command with status: 0007 [71052.589789] nvme nvme0: completing aborted command with status: 0007 [71052.589792] nvme nvme0: Abort status: 0xfffc [71052.589796] nvme nvme0: completing aborted command with status: 0007 [71052.589799] nvme nvme0: completing aborted command with status: 0007 [71052.589800] nvme nvme0: completing aborted command with status: 0007 [71052.590177] ------------[ cut here ]------------ [71052.590188] WARNING: CPU: 4 PID: 3771 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x53/0x5f [71052.590190] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:04.0/0000:05:00.0/nvme/nvme0/cmb' [71052.590191] Modules linked in: fcst(O) scst_changer(O) scst_tape(O) scst_vdisk(O) scst_disk(O) ib_srpt(O) iscsi_scst(O) qla2x00tgt(O) scst(O) qla2xxx bonding mlx5_core bna ib_umad rdma_ucm ib_uverbs ib_srp iw_nes iw_cxgb4 cxgb4 iw_cxgb3 ib_qib rdmavt mlx4_ib ib_mthca [71052.590213] CPU: 4 PID: 3771 Comm: kworker/u113:2 Tainted: G O 4.9.8-esos.prod #1 [71052.590213] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS, BIOS 2.0 10/28/2016 [71052.590221] Workqueue: nvme nvme_reset_work [71052.590223] 0000000000000000 ffffffff81397af8 ffffc9002440fcf0 0000000000000000 [71052.590225] ffffffff81065550 ffff880806aa0000 ffffc9002440fd48 ffff88085bb855a0 [71052.590226] ffff88085b755000 00000000b2000000 ffff880858b90138 ffffffff810655af [71052.590229] Call Trace: [71052.590238] [<ffffffff81397af8>] ? dump_stack+0x46/0x59 [71052.590244] [<ffffffff81065550>] ? __warn+0xc8/0xe1 [71052.590245] [<ffffffff810655af>] ? warn_slowpath_fmt+0x46/0x4e [71052.590248] [<ffffffff81182ce0>] ? kernfs_path_from_node+0x4e/0x58 [71052.590250] [<ffffffff81184e5c>] ? sysfs_warn_dup+0x53/0x5f [71052.590253] [<ffffffff81184c39>] ? sysfs_add_file_mode_ns+0xd1/0x14d [71052.590254] [<ffffffff81184d7c>] ? sysfs_add_file_to_group+0x3c/0x4b [71052.590256] [<ffffffff815dae5f>] ? nvme_reset_work+0x415/0xb25 [71052.590260] [<ffffffff81076a58>] ? process_one_work+0x192/0x29b [71052.590262] [<ffffffff81077096>] ? worker_thread+0x26e/0x356 [71052.590264] [<ffffffff81076e28>] ? rescuer_thread+0x2a0/0x2a0 [71052.590266] [<ffffffff810686e8>] ? do_group_exit+0x39/0x91 [71052.590268] [<ffffffff8107abe2>] ? kthread+0xc2/0xca [71052.590269] [<ffffffff8107ab20>] ? kthread_park+0x4e/0x4e [71052.590275] [<ffffffff81a7ed22>] ? ret_from_fork+0x22/0x30 [71052.590276] ---[ end trace 25be46e93007ecdb ]--- [71052.590278] nvme 0000:05:00.0: failed to add sysfs attribute for CMB [71112.831251] nvme nvme0: I/O 110 QID 0 timeout, disable controller [71112.831259] ------------[ cut here ]------------ [71112.831267] WARNING: CPU: 4 PID: 788 at kernel/irq/manage.c:1478 __free_irq+0x93/0x1ed [71112.831268] Trying to free already-free IRQ 109 [71112.831269] Modules linked in: fcst(O) scst_changer(O) scst_tape(O) scst_vdisk(O) scst_disk(O) ib_srpt(O) iscsi_scst(O) qla2x00tgt(O) scst(O) qla2xxx bonding mlx5_core bna ib_umad rdma_ucm ib_uverbs ib_srp iw_nes iw_cxgb4 cxgb4 iw_cxgb3 ib_qib rdmavt mlx4_ib ib_mthca [71112.831290] CPU: 4 PID: 788 Comm: kworker/4:1H Tainted: G W O 4.9.8-esos.prod #1 [71112.831291] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS, BIOS 2.0 10/28/2016 [71112.831297] Workqueue: kblockd blk_mq_timeout_work [71112.831299] 0000000000000000 ffffffff81397af8 ffffc900098a7b88 0000000000000000 [71112.831301] ffffffff81065550 ffff880850b26e00 ffffc900098a7be0 000000000000006d [71112.831303] ffff880850b26e9c ffff88085857ec80 0000000000000246 ffffffff810655af [71112.831305] Call Trace: [71112.831311] [<ffffffff81397af8>] ? dump_stack+0x46/0x59 [71112.831315] [<ffffffff81065550>] ? __warn+0xc8/0xe1 [71112.831316] [<ffffffff810655af>] ? warn_slowpath_fmt+0x46/0x4e [71112.831319] [<ffffffff8139c224>] ? __radix_tree_lookup+0x2c/0x93 [71112.831320] [<ffffffff8109d5b0>] ? __free_irq+0x93/0x1ed [71112.831322] [<ffffffff8109d7a5>] ? free_irq+0x61/0x72 [71112.831327] [<ffffffff815d9e2b>] ? nvme_suspend_queue+0x66/0x6b [71112.831328] [<ffffffff815da001>] ? nvme_dev_disable+0x96/0x30b [71112.831332] [<ffffffff8148ebd6>] ? dev_warn+0x50/0x58 [71112.831334] [<ffffffff815da368>] ? nvme_timeout+0x59/0x186 [71112.831337] [<ffffffff810900e9>] ? complete+0x2b/0x3a [71112.831341] [<ffffffff81083ff6>] ? sched_clock_cpu+0xc/0x95 [71112.831343] [<ffffffff8138691f>] ? blk_mq_rq_timed_out+0x27/0x5e [71112.831345] [<ffffffff8138816f>] ? bt_for_each+0xaf/0xca [71112.831346] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e [71112.831347] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e [71112.831349] [<ffffffff813885df>] ? blk_mq_queue_tag_busy_iter+0x7b/0x88 [71112.831350] [<ffffffff81385677>] ? blk_mq_timeout_work+0x7d/0x102 [71112.831354] [<ffffffff81076a58>] ? process_one_work+0x192/0x29b [71112.831355] [<ffffffff81077096>] ? worker_thread+0x26e/0x356 [71112.831357] [<ffffffff81076e28>] ? rescuer_thread+0x2a0/0x2a0 [71112.831358] [<ffffffff8107abe2>] ? kthread+0xc2/0xca [71112.831360] [<ffffffff8107ab20>] ? kthread_park+0x4e/0x4e [71112.831364] [<ffffffff81a7ed22>] ? ret_from_fork+0x22/0x30 [71112.831365] ---[ end trace 25be46e93007ecdc ]--- [71112.831664] nvme nvme0: Removing after probe failure status: -4 [71112.831684] nvme0n1: detected capacity change from 1000204886016 to 0 --snip-- The nvme0 and nvme0n1 devices disappeared from /dev and on the other server connected to the same NVMe drives, we see this in the kernel messages: --snip-- [70533.983041] nvme 0000:05:00.0: Failed status: 0x3, reset controller. [70533.984225] ------------[ cut here ]------------ [70533.984237] WARNING: CPU: 1 PID: 2691 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x53/0x5f [70533.984238] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:04.0/0000:05:00.0/nvme/nvme0/cmb' [70533.984239] Modules linked in: fcst(O) scst_changer(O) scst_tape(O) scst_vdisk(O) scst_disk(O) ib_srpt(O) iscsi_scst(O) qla2x00tgt(O) scst(O) qla2xxx bonding mlx5_core bna ib_umad rdma_ucm ib_uverbs ib_srp iw_nes iw_cxgb4 cxgb4 iw_cxgb3 ib_qib rdmavt mlx4_ib ib_mthca [70533.984259] CPU: 1 PID: 2691 Comm: kworker/u113:1 Tainted: G O 4.9.8-esos.prod #1 [70533.984260] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS, BIOS 2.0 10/28/2016 [70533.984268] Workqueue: nvme nvme_reset_work [70533.984270] 0000000000000000 ffffffff81397af8 ffffc900207abcf0 0000000000000000 [70533.984272] ffffffff81065550 ffff88085045f000 ffffc900207abd48 ffff88105b70b5a0 [70533.984273] ffff88085b735000 00000000b2000000 ffff880858bd0138 ffffffff810655af [70533.984275] Call Trace: [70533.984285] [<ffffffff81397af8>] ? dump_stack+0x46/0x59 [70533.984291] [<ffffffff81065550>] ? __warn+0xc8/0xe1 [70533.984293] [<ffffffff810655af>] ? warn_slowpath_fmt+0x46/0x4e [70533.984296] [<ffffffff81182ce0>] ? kernfs_path_from_node+0x4e/0x58 [70533.984297] [<ffffffff81184e5c>] ? sysfs_warn_dup+0x53/0x5f [70533.984300] [<ffffffff81184c39>] ? sysfs_add_file_mode_ns+0xd1/0x14d [70533.984301] [<ffffffff81184d7c>] ? sysfs_add_file_to_group+0x3c/0x4b [70533.984303] [<ffffffff815dae5f>] ? nvme_reset_work+0x415/0xb25 [70533.984308] [<ffffffff81076a58>] ? process_one_work+0x192/0x29b [70533.984309] [<ffffffff81077096>] ? worker_thread+0x26e/0x356 [70533.984311] [<ffffffff81076e28>] ? rescuer_thread+0x2a0/0x2a0 [70533.984314] [<ffffffff8107abe2>] ? kthread+0xc2/0xca [70533.984315] [<ffffffff8107ab20>] ? kthread_park+0x4e/0x4e [70533.984321] [<ffffffff81a7ed22>] ? ret_from_fork+0x22/0x30 [70533.984322] ---[ end trace 7dbcf09b49326265 ]--- [70533.984323] nvme 0000:05:00.0: failed to add sysfs attribute for CMB [70539.311337] nvme nvme0: Removing after probe failure status: -19 [70539.311471] nvme0n1: detected capacity change from 1000204886016 to 0 --snip-- Is it possible maybe one of our NVMe drives (nvme0) is bad? Or is this something else? Again, we weren't trying to run any concurrent I/O from the second server, only running fio against the drives on just one of the servers. Any help or advice would be greatly appreciated. And I'm new to this list, so if this is the wrong forum, please let me know. Thanks, Marc ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux 4.9.8 + NVMe CiB Issue 2017-02-15 19:27 Linux 4.9.8 + NVMe CiB Issue Marc Smith @ 2017-02-15 21:12 ` Keith Busch 2017-02-16 15:37 ` Marc Smith 0 siblings, 1 reply; 11+ messages in thread From: Keith Busch @ 2017-02-15 21:12 UTC (permalink / raw) On Wed, Feb 15, 2017@02:27:13PM -0500, Marc Smith wrote: > Hi, > > I'm testing with a Supermicro SSG-2028R-DN2R40L NVMe CiB > (cluster-in-a-box) solution. The performance is amazing so far, but I > experienced an issue during a performance test while using the fio > tool. > > Linux 4.9.8 > fio 2.14 > > We have just (8) NVMe drives in the "enclosure", and it contains two > server nodes, but right now we're just testing from one of the nodes. > > This is the command we ran: > fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 > --numjobs=16 --name=/dev/nvme0n1 --name=/dev/nvme1n1 > --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 > --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 > > After a few seconds, noticed the performance numbers started dropping, > and started flaking out. This is what we saw in the kernel logs: It looks like your controller stopped posting completions to commands. There is some excessive kernel spamming going on here, though, but that fix is already staged for 4.11 inclusion here: http://git.infradead.org/nvme.git/commitdiff/7bf7d778620d83f14fcd92d0938fb97c7d78bf19?hp=9a69b0ed6257ae5e71c99bf21ce53f98c558476a As to why the driver was triggered to abort IO in the first place, that appears to be the device not posting completions on time. As far as I can tell, blk-mq's timeout handling won't mistakenly time out a command on the initial abort, and the default 30 second timeout should be more than enough for your workload. There does appear to be a small window where blk-mq can miss a completion, though: blk-mq's timeout handler sets the REQ_ATOM_COMPLETE flag while running the timeout handler, which blocks a natural completion from occuring while set. So if a real completion did occur, then that completion is lost, which will force the subseqent timeout handler to issue a controller reset. But I don't think that's what's happening here. You are getting time outs on admin commands as well, so that really looks like your controller just stopped responding. > --snip-- > [70961.868655] nvme nvme0: I/O 1009 QID 1 timeout, aborting > [70961.868666] nvme nvme0: I/O 1010 QID 1 timeout, aborting > [70961.868670] nvme nvme0: I/O 1011 QID 1 timeout, aborting > [70961.868673] nvme nvme0: I/O 1013 QID 1 timeout, aborting > [70992.073974] nvme nvme0: I/O 1009 QID 1 timeout, reset controller > [71022.727229] nvme nvme0: I/O 237 QID 0 timeout, reset controller ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux 4.9.8 + NVMe CiB Issue 2017-02-15 21:12 ` Keith Busch @ 2017-02-16 15:37 ` Marc Smith 2017-02-20 16:43 ` Marc Smith 0 siblings, 1 reply; 11+ messages in thread From: Marc Smith @ 2017-02-16 15:37 UTC (permalink / raw) On Wed, Feb 15, 2017@4:12 PM, Keith Busch <keith.busch@intel.com> wrote: > On Wed, Feb 15, 2017@02:27:13PM -0500, Marc Smith wrote: >> Hi, >> >> I'm testing with a Supermicro SSG-2028R-DN2R40L NVMe CiB >> (cluster-in-a-box) solution. The performance is amazing so far, but I >> experienced an issue during a performance test while using the fio >> tool. >> >> Linux 4.9.8 >> fio 2.14 >> >> We have just (8) NVMe drives in the "enclosure", and it contains two >> server nodes, but right now we're just testing from one of the nodes. >> >> This is the command we ran: >> fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 >> --numjobs=16 --name=/dev/nvme0n1 --name=/dev/nvme1n1 >> --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 >> --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 >> >> After a few seconds, noticed the performance numbers started dropping, >> and started flaking out. This is what we saw in the kernel logs: > > It looks like your controller stopped posting completions to commands. > > There is some excessive kernel spamming going on here, though, but that > fix is already staged for 4.11 inclusion here: > > http://git.infradead.org/nvme.git/commitdiff/7bf7d778620d83f14fcd92d0938fb97c7d78bf19?hp=9a69b0ed6257ae5e71c99bf21ce53f98c558476a > > As to why the driver was triggered to abort IO in the first place, that > appears to be the device not posting completions on time. As far as I > can tell, blk-mq's timeout handling won't mistakenly time out a command > on the initial abort, and the default 30 second timeout should be more > than enough for your workload. > > There does appear to be a small window where blk-mq can miss a completion, > though: blk-mq's timeout handler sets the REQ_ATOM_COMPLETE flag > while running the timeout handler, which blocks a natural completion > from occuring while set. So if a real completion did occur, then that > completion is lost, which will force the subseqent timeout handler to > issue a controller reset. > > But I don't think that's what's happening here. You are getting time > outs on admin commands as well, so that really looks like your > controller just stopped responding. Thanks for the quick reply; so, I rebooted both nodes that are attached to the NVMe drives, and all (8) NVMe devices appear on both nodes after reboot, including the nvme0 device that had disappeared before. I then hammered the nvme0n1 device, by itself, using fio to try and reproduce this same issue, but couldn't do so. I ran fio with huge I/O depths and jobs for hours, and still couldn't reproduce this. Then, I ran the same test again that caused the original issue, running fio across all (8) of the NVMe drives: fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 Within a minute or two, I noticed similar kernel messages but with a different NVMe drive this time, and as before, the NVMe drive disappears from both servers. Here are the logs: --snip-- [ 5329.400622] nvme nvme7: I/O 930 QID 1 timeout, aborting [ 5329.400628] nvme nvme7: I/O 931 QID 1 timeout, aborting [ 5329.400630] nvme nvme7: I/O 932 QID 1 timeout, aborting [ 5329.400631] nvme nvme7: I/O 933 QID 1 timeout, aborting [ 5359.413460] nvme nvme7: I/O 930 QID 1 timeout, reset controller [ 5389.618277] nvme nvme7: I/O 130 QID 0 timeout, reset controller [ 5419.927512] nvme nvme7: completing aborted command with status: 0007 [ 5419.927513] nvme nvme7: completing aborted command with status: 0007 [ 5419.927515] nvme nvme7: completing aborted command with status: 0007 [ 5419.927517] blk_update_request: I/O error, dev nvme7n1, sector 635173168 [ 5419.927519] nvme nvme7: completing aborted command with status: 0007 [ 5419.927521] blk_update_request: I/O error, dev nvme7n1, sector 1500458320 [ 5419.927523] nvme nvme7: completing aborted command with status: 0007 [ 5419.927524] blk_update_request: I/O error, dev nvme7n1, sector 1402863672 [ 5419.927526] blk_update_request: I/O error, dev nvme7n1, sector 1930822624 [ 5419.927527] nvme nvme7: completing aborted command with status: 0007 [ 5419.927529] nvme nvme7: completing aborted command with status: 0007 [ 5419.927531] blk_update_request: I/O error, dev nvme7n1, sector 1060940040 [ 5419.927532] nvme nvme7: completing aborted command with status: 0007 [ 5419.927533] nvme nvme7: completing aborted command with status: 0007 [ 5419.927535] nvme nvme7: completing aborted command with status: 0007 [ 5419.927536] nvme nvme7: completing aborted command with status: 0007 [ 5419.927537] blk_update_request: I/O error, dev nvme7n1, sector 697098256 [ 5419.927538] nvme nvme7: completing aborted command with status: 0007 [ 5419.927540] nvme nvme7: completing aborted command with status: 0007 [ 5419.927541] nvme nvme7: completing aborted command with status: 0007 [ 5419.927542] nvme nvme7: completing aborted command with status: 0007 [ 5419.927543] blk_update_request: I/O error, dev nvme7n1, sector 1803921928 [ 5419.927546] nvme nvme7: completing aborted command with status: 0007 [ 5419.927548] blk_update_request: I/O error, dev nvme7n1, sector 231221800 [ 5419.927549] nvme nvme7: completing aborted command with status: 0007 [ 5419.927550] blk_update_request: I/O error, dev nvme7n1, sector 1411290456 [ 5419.927551] blk_update_request: I/O error, dev nvme7n1, sector 1888075000 [ 5419.927553] nvme nvme7: completing aborted command with status: 0007 [ 5419.927554] nvme nvme7: completing aborted command with status: 0007 [ 5419.927555] nvme nvme7: completing aborted command with status: 0007 [ 5419.927556] nvme nvme7: completing aborted command with status: 0007 [ 5419.927557] nvme nvme7: completing aborted command with status: 0007 [ 5419.927559] nvme nvme7: completing aborted command with status: 0007 [ 5419.927560] nvme nvme7: completing aborted command with status: 0007 [ 5419.927561] nvme nvme7: completing aborted command with status: 0007 [ 5419.927562] nvme nvme7: completing aborted command with status: 0007 [ 5419.927564] nvme nvme7: completing aborted command with status: 0007 [ 5419.927565] nvme nvme7: completing aborted command with status: 0007 [ 5419.927566] nvme nvme7: completing aborted command with status: 0007 [ 5419.927568] nvme nvme7: completing aborted command with status: 0007 [ 5419.927570] nvme nvme7: completing aborted command with status: 0007 [ 5419.927570] nvme nvme7: completing aborted command with status: 0007 [ 5419.927571] nvme nvme7: completing aborted command with status: 0007 [ 5419.927572] nvme nvme7: completing aborted command with status: 0007 [ 5419.927573] nvme nvme7: completing aborted command with status: 0007 [ 5419.927575] nvme nvme7: completing aborted command with status: 0007 [ 5419.927576] nvme nvme7: completing aborted command with status: 0007 [ 5419.927577] nvme nvme7: completing aborted command with status: 0007 [ 5419.927578] nvme nvme7: completing aborted command with status: 0007 [ 5419.927579] nvme nvme7: completing aborted command with status: 0007 [ 5419.927581] nvme nvme7: completing aborted command with status: 0007 [ 5419.927582] nvme nvme7: completing aborted command with status: 0007 [ 5419.927583] nvme nvme7: completing aborted command with status: 0007 [ 5419.927584] nvme nvme7: completing aborted command with status: 0007 [ 5419.927585] nvme nvme7: completing aborted command with status: 0007 [ 5419.927586] nvme nvme7: completing aborted command with status: 0007 [ 5419.927587] nvme nvme7: completing aborted command with status: 0007 [ 5419.927590] nvme nvme7: completing aborted command with status: 0007 [ 5419.927592] nvme nvme7: completing aborted command with status: 0007 [ 5419.927593] nvme nvme7: completing aborted command with status: 0007 [ 5419.927595] nvme nvme7: completing aborted command with status: 0007 [ 5419.927596] nvme nvme7: completing aborted command with status: 0007 [ 5419.927597] nvme nvme7: completing aborted command with status: 0007 [ 5419.927598] nvme nvme7: completing aborted command with status: 0007 [ 5419.927599] nvme nvme7: completing aborted command with status: 0007 [ 5419.927600] nvme nvme7: completing aborted command with status: 0007 [ 5419.927601] nvme nvme7: completing aborted command with status: 0007 [ 5419.927604] nvme nvme7: completing aborted command with status: 0007 [ 5419.927606] nvme nvme7: completing aborted command with status: 0007 [ 5419.927607] nvme nvme7: completing aborted command with status: 0007 [ 5419.927609] nvme nvme7: completing aborted command with status: 0007 [ 5419.927610] nvme nvme7: completing aborted command with status: 0007 [ 5419.927611] nvme nvme7: completing aborted command with status: 0007 [ 5419.927612] nvme nvme7: completing aborted command with status: 0007 [ 5419.927615] nvme nvme7: completing aborted command with status: 0007 [ 5419.927616] nvme nvme7: completing aborted command with status: 0007 [ 5419.927617] nvme nvme7: completing aborted command with status: 0007 [ 5419.927619] nvme nvme7: completing aborted command with status: 0007 [ 5419.927620] nvme nvme7: completing aborted command with status: 0007 [ 5419.927621] nvme nvme7: completing aborted command with status: 0007 [ 5419.927621] nvme nvme7: completing aborted command with status: 0007 [ 5419.927624] nvme nvme7: completing aborted command with status: 0007 [ 5419.927626] nvme nvme7: completing aborted command with status: 0007 [ 5419.927627] nvme nvme7: completing aborted command with status: 0007 [ 5419.927628] nvme nvme7: completing aborted command with status: 0007 [ 5419.927630] nvme nvme7: completing aborted command with status: 0007 [ 5419.927631] nvme nvme7: completing aborted command with status: 0007 [ 5419.927634] nvme nvme7: completing aborted command with status: 0007 [ 5419.927635] nvme nvme7: completing aborted command with status: 0007 [ 5419.927636] nvme nvme7: completing aborted command with status: 0007 [ 5419.927637] nvme nvme7: completing aborted command with status: 0007 [ 5419.927639] nvme nvme7: completing aborted command with status: 0007 [ 5419.927640] nvme nvme7: completing aborted command with status: 0007 [ 5419.927642] nvme nvme7: completing aborted command with status: 0007 [ 5419.927644] nvme nvme7: completing aborted command with status: 0007 [ 5419.927645] nvme nvme7: completing aborted command with status: 0007 [ 5419.927647] nvme nvme7: completing aborted command with status: 0007 [ 5419.927648] nvme nvme7: completing aborted command with status: 0007 [ 5419.927649] nvme nvme7: completing aborted command with status: 0007 [ 5419.927651] nvme nvme7: completing aborted command with status: 0007 [ 5419.927654] nvme nvme7: completing aborted command with status: 0007 [ 5419.927655] nvme nvme7: completing aborted command with status: 0007 [ 5419.927656] nvme nvme7: completing aborted command with status: 0007 [ 5419.927657] nvme nvme7: completing aborted command with status: 0007 [ 5419.927658] nvme nvme7: completing aborted command with status: 0007 [ 5419.927659] nvme nvme7: completing aborted command with status: 0007 [ 5419.927660] nvme nvme7: completing aborted command with status: 0007 [ 5419.927663] nvme nvme7: completing aborted command with status: 0007 [ 5419.927664] nvme nvme7: completing aborted command with status: 0007 [ 5419.927665] nvme nvme7: completing aborted command with status: 0007 [ 5419.927666] nvme nvme7: completing aborted command with status: 0007 [ 5419.927667] nvme nvme7: completing aborted command with status: 0007 [ 5419.927668] nvme nvme7: completing aborted command with status: 0007 [ 5419.927671] nvme nvme7: completing aborted command with status: 0007 [ 5419.927672] nvme nvme7: completing aborted command with status: 0007 [ 5419.927674] nvme nvme7: completing aborted command with status: 0007 [ 5419.927675] nvme nvme7: completing aborted command with status: 0007 [ 5419.927677] nvme nvme7: completing aborted command with status: 0007 [ 5419.927678] nvme nvme7: completing aborted command with status: 0007 [ 5419.927681] nvme nvme7: completing aborted command with status: 0007 [ 5419.927682] nvme nvme7: completing aborted command with status: 0007 [ 5419.927683] nvme nvme7: completing aborted command with status: 0007 [ 5419.927687] nvme nvme7: completing aborted command with status: 0007 [ 5419.927688] nvme nvme7: completing aborted command with status: 0007 [ 5419.927691] nvme nvme7: completing aborted command with status: 0007 [ 5419.927693] nvme nvme7: completing aborted command with status: 0007 [ 5419.927695] nvme nvme7: completing aborted command with status: 0007 [ 5419.927699] nvme nvme7: completing aborted command with status: 0007 [ 5419.927701] nvme nvme7: completing aborted command with status: 0007 [ 5419.927704] nvme nvme7: completing aborted command with status: 0007 [ 5419.927706] nvme nvme7: completing aborted command with status: 0007 [ 5419.927709] nvme nvme7: completing aborted command with status: 0007 [ 5419.927714] nvme nvme7: completing aborted command with status: 0007 [ 5419.927718] nvme nvme7: completing aborted command with status: 0007 [ 5419.927724] nvme nvme7: completing aborted command with status: 0007 [ 5419.927730] nvme nvme7: completing aborted command with status: 0007 [ 5419.927733] nvme nvme7: completing aborted command with status: 0007 [ 5419.927737] nvme nvme7: completing aborted command with status: 0007 [ 5419.927742] nvme nvme7: completing aborted command with status: 0007 [ 5419.927745] nvme nvme7: completing aborted command with status: 0007 [ 5419.927749] nvme nvme7: completing aborted command with status: 0007 [ 5419.927752] nvme nvme7: completing aborted command with status: 0007 [ 5419.927753] nvme nvme7: completing aborted command with status: 0007 [ 5419.927755] nvme nvme7: completing aborted command with status: 0007 [ 5419.927757] nvme nvme7: completing aborted command with status: 0007 [ 5419.927758] nvme nvme7: completing aborted command with status: 0007 [ 5419.927760] nvme nvme7: completing aborted command with status: 0007 [ 5419.927764] nvme nvme7: completing aborted command with status: 0007 [ 5419.927765] nvme nvme7: completing aborted command with status: 0007 [ 5419.927767] nvme nvme7: completing aborted command with status: 0007 [ 5419.927769] nvme nvme7: completing aborted command with status: 0007 [ 5419.927771] nvme nvme7: completing aborted command with status: 0007 [ 5419.927772] nvme nvme7: completing aborted command with status: 0007 [ 5419.927774] nvme nvme7: completing aborted command with status: 0007 [ 5419.927776] nvme nvme7: completing aborted command with status: 0007 [ 5419.927778] nvme nvme7: completing aborted command with status: 0007 [ 5419.927780] nvme nvme7: completing aborted command with status: 0007 [ 5419.927781] nvme nvme7: completing aborted command with status: 0007 [ 5419.927784] nvme nvme7: completing aborted command with status: 0007 [ 5419.927790] nvme nvme7: completing aborted command with status: 0007 [ 5419.927792] nvme nvme7: completing aborted command with status: 0007 [ 5419.927794] nvme nvme7: completing aborted command with status: 0007 [ 5419.927796] nvme nvme7: completing aborted command with status: 0007 [ 5419.927797] nvme nvme7: completing aborted command with status: 0007 [ 5419.927799] nvme nvme7: completing aborted command with status: 0007 [ 5419.927801] nvme nvme7: completing aborted command with status: 0007 [ 5419.927802] nvme nvme7: completing aborted command with status: 0007 [ 5419.927804] nvme nvme7: completing aborted command with status: 0007 [ 5419.927806] nvme nvme7: completing aborted command with status: 0007 [ 5419.927808] nvme nvme7: completing aborted command with status: 0007 [ 5419.927810] nvme nvme7: completing aborted command with status: 0007 [ 5419.927815] nvme nvme7: completing aborted command with status: 0007 [ 5419.927817] nvme nvme7: completing aborted command with status: 0007 [ 5419.927819] nvme nvme7: completing aborted command with status: 0007 [ 5419.927820] nvme nvme7: completing aborted command with status: 0007 [ 5419.927822] nvme nvme7: completing aborted command with status: 0007 [ 5419.927824] nvme nvme7: completing aborted command with status: 0007 [ 5419.927826] nvme nvme7: completing aborted command with status: 0007 [ 5419.927827] nvme nvme7: completing aborted command with status: 0007 [ 5419.927829] nvme nvme7: completing aborted command with status: 0007 [ 5419.927831] nvme nvme7: completing aborted command with status: 0007 [ 5419.927832] nvme nvme7: completing aborted command with status: 0007 [ 5419.927835] nvme nvme7: completing aborted command with status: 0007 [ 5419.927839] nvme nvme7: completing aborted command with status: 0007 [ 5419.927840] nvme nvme7: completing aborted command with status: 0007 [ 5419.927842] nvme nvme7: completing aborted command with status: 0007 [ 5419.927844] nvme nvme7: completing aborted command with status: 0007 [ 5419.927846] nvme nvme7: completing aborted command with status: 0007 [ 5419.927847] nvme nvme7: completing aborted command with status: 0007 [ 5419.927849] nvme nvme7: completing aborted command with status: 0007 [ 5419.927850] nvme nvme7: completing aborted command with status: 0007 [ 5419.927852] nvme nvme7: completing aborted command with status: 0007 [ 5419.927854] nvme nvme7: completing aborted command with status: 0007 [ 5419.927855] nvme nvme7: completing aborted command with status: 0007 [ 5419.927859] nvme nvme7: completing aborted command with status: 0007 [ 5419.927861] nvme nvme7: completing aborted command with status: 0007 [ 5419.927863] nvme nvme7: completing aborted command with status: 0007 [ 5419.927865] nvme nvme7: completing aborted command with status: 0007 [ 5419.927867] nvme nvme7: completing aborted command with status: 0007 [ 5419.927868] nvme nvme7: completing aborted command with status: 0007 [ 5419.927870] nvme nvme7: completing aborted command with status: 0007 [ 5419.927873] nvme nvme7: Abort status: 0x7 [ 5419.927874] nvme nvme7: Abort status: 0x7 [ 5419.927875] nvme nvme7: Abort status: 0x7 [ 5419.927896] nvme nvme7: completing aborted command with status: fffffffc [ 5419.927966] nvme nvme7: Abort status: 0xfffc [ 5419.928223] ------------[ cut here ]------------ [ 5419.928233] WARNING: CPU: 19 PID: 1438 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x53/0x5f [ 5419.928234] sysfs: cannot create duplicate filename '/devices/pci0000:80/0000:80:03.0/0000:82:00.0/0000:83:07.0/0000:87:00.0/nvme/nvme7/cmb' [ 5419.928235] Modules linked in: qla2xxx bonding mlx5_core bna [ 5419.928241] CPU: 19 PID: 1438 Comm: kworker/u114:1 Not tainted 4.9.8-esos.prod #1 [ 5419.928242] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS, BIOS 2.0 10/28/2016 [ 5419.928248] Workqueue: nvme nvme_reset_work [ 5419.928250] 0000000000000000 ffffffff81397af8 ffffc9000b00fcf0 0000000000000000 [ 5419.928252] ffffffff81065550 ffff88085a139000 ffffc9000b00fd48 ffff880858b21d20 [ 5419.928254] ffff88105b34b000 00000000e6000000 ffff881059993938 ffffffff810655af [ 5419.928256] Call Trace: [ 5419.928264] [<ffffffff81397af8>] ? dump_stack+0x46/0x59 [ 5419.928269] [<ffffffff81065550>] ? __warn+0xc8/0xe1 [ 5419.928271] [<ffffffff810655af>] ? warn_slowpath_fmt+0x46/0x4e [ 5419.928275] [<ffffffff81182ce0>] ? kernfs_path_from_node+0x4e/0x58 [ 5419.928277] [<ffffffff81184e5c>] ? sysfs_warn_dup+0x53/0x5f [ 5419.928279] [<ffffffff81184c39>] ? sysfs_add_file_mode_ns+0xd1/0x14d [ 5419.928281] [<ffffffff81184d7c>] ? sysfs_add_file_to_group+0x3c/0x4b [ 5419.928282] [<ffffffff815dae5f>] ? nvme_reset_work+0x415/0xb25 [ 5419.928286] [<ffffffff81076a58>] ? process_one_work+0x192/0x29b [ 5419.928287] [<ffffffff81077096>] ? worker_thread+0x26e/0x356 [ 5419.928289] [<ffffffff81076e28>] ? rescuer_thread+0x2a0/0x2a0 [ 5419.928292] [<ffffffff810686e8>] ? do_group_exit+0x39/0x91 [ 5419.928295] [<ffffffff8107abe2>] ? kthread+0xc2/0xca [ 5419.928296] [<ffffffff8107ab20>] ? kthread_park+0x4e/0x4e [ 5419.928300] [<ffffffff81a7ed22>] ? ret_from_fork+0x22/0x30 [ 5419.928310] ---[ end trace 9d1818416c7876e0 ]--- [ 5419.928313] nvme 0000:87:00.0: failed to add sysfs attribute for CMB [ 5420.031107] do_IRQ: 1.66 No irq handler for vector [ 5480.488746] nvme nvme7: I/O 215 QID 0 timeout, disable controller [ 5480.488755] ------------[ cut here ]------------ [ 5480.488766] WARNING: CPU: 19 PID: 1233 at kernel/irq/manage.c:1478 __free_irq+0x93/0x1ed [ 5480.488767] Trying to free already-free IRQ 443 [ 5480.488768] Modules linked in: qla2xxx bonding mlx5_core bna [ 5480.488775] CPU: 19 PID: 1233 Comm: kworker/19:1H Tainted: G W 4.9.8-esos.prod #1 [ 5480.488776] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS, BIOS 2.0 10/28/2016 [ 5480.488785] Workqueue: kblockd blk_mq_timeout_work [ 5480.488786] 0000000000000000 ffffffff81397af8 ffffc9000be9fb88 0000000000000000 [ 5480.488789] ffffffff81065550 ffff880859f08800 ffffc9000be9fbe0 00000000000001bb [ 5480.488792] ffff880859f0889c ffff8810584abb00 0000000000000246 ffffffff810655af [ 5480.488794] Call Trace: [ 5480.488801] [<ffffffff81397af8>] ? dump_stack+0x46/0x59 [ 5480.488808] [<ffffffff81065550>] ? __warn+0xc8/0xe1 [ 5480.488810] [<ffffffff810655af>] ? warn_slowpath_fmt+0x46/0x4e [ 5480.488812] [<ffffffff8139c224>] ? __radix_tree_lookup+0x2c/0x93 [ 5480.488814] [<ffffffff8109d5b0>] ? __free_irq+0x93/0x1ed [ 5480.488816] [<ffffffff8109d7a5>] ? free_irq+0x61/0x72 [ 5480.488823] [<ffffffff815d9e2b>] ? nvme_suspend_queue+0x66/0x6b [ 5480.488825] [<ffffffff815da001>] ? nvme_dev_disable+0x96/0x30b [ 5480.488831] [<ffffffff8148ebd6>] ? dev_warn+0x50/0x58 [ 5480.488833] [<ffffffff815da368>] ? nvme_timeout+0x59/0x186 [ 5480.488835] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e [ 5480.488836] [<ffffffff8138691f>] ? blk_mq_rq_timed_out+0x27/0x5e [ 5480.488838] [<ffffffff8138816f>] ? bt_for_each+0xaf/0xca [ 5480.488840] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e [ 5480.488841] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e [ 5480.488842] [<ffffffff813885df>] ? blk_mq_queue_tag_busy_iter+0x7b/0x88 [ 5480.488844] [<ffffffff81385677>] ? blk_mq_timeout_work+0x7d/0x102 [ 5480.488848] [<ffffffff81076a58>] ? process_one_work+0x192/0x29b [ 5480.488850] [<ffffffff81077096>] ? worker_thread+0x26e/0x356 [ 5480.488851] [<ffffffff81076e28>] ? rescuer_thread+0x2a0/0x2a0 [ 5480.488853] [<ffffffff8107abe2>] ? kthread+0xc2/0xca [ 5480.488855] [<ffffffff8107ab20>] ? kthread_park+0x4e/0x4e [ 5480.488861] [<ffffffff81a7ed22>] ? ret_from_fork+0x22/0x30 [ 5480.488862] ---[ end trace 9d1818416c7876e1 ]--- [ 5480.489161] nvme nvme7: Removing after probe failure status: -4 [ 5480.489181] nvme7n1: detected capacity change from 1000204886016 to 0 --snip-- While running this test, I see fio showing 100% random, 100% read 4K IOPS numbers of ~2.9M IOPS, and then when I see it suddenly drop to ~2.5M IOPS, I then look at the kernel logs and see the above. Thanks, Marc > >> --snip-- >> [70961.868655] nvme nvme0: I/O 1009 QID 1 timeout, aborting >> [70961.868666] nvme nvme0: I/O 1010 QID 1 timeout, aborting >> [70961.868670] nvme nvme0: I/O 1011 QID 1 timeout, aborting >> [70961.868673] nvme nvme0: I/O 1013 QID 1 timeout, aborting >> [70992.073974] nvme nvme0: I/O 1009 QID 1 timeout, reset controller >> [71022.727229] nvme nvme0: I/O 237 QID 0 timeout, reset controller ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux 4.9.8 + NVMe CiB Issue 2017-02-16 15:37 ` Marc Smith @ 2017-02-20 16:43 ` Marc Smith 2017-02-21 16:42 ` Marc Smith 0 siblings, 1 reply; 11+ messages in thread From: Marc Smith @ 2017-02-20 16:43 UTC (permalink / raw) Hi, A little more information... I attempted the same test using a Linux 4.9.8 kernel with a number of debugging options set, but I am unable to reproduce the NVMe timeout issue with that kernel configuration. With the debug kernel, I'm only able to obtain about 1.5 million IOPS, which is expected with all the extra options set. So it doesn't appear its an issue with using all eight of the NVMe devices at once, it appears to be load/IO related. Its a different NVMe drive each time too that gives the timeout error, so not a single device causing this, at least from what I can tell. I also tried the same test on the second server connected to the NVMe drives and I'm able to reproduce the failure on that machine as well, so I don't think its related to the physical servers. Anything else I can check? I realize we'll probably never push 3 million IOPS in production, but I'd like to get any stability issues sorted out before continuing. =) Thanks for your time. --Marc On Thu, Feb 16, 2017@10:37 AM, Marc Smith <marc.smith@mcc.edu> wrote: > On Wed, Feb 15, 2017@4:12 PM, Keith Busch <keith.busch@intel.com> wrote: >> On Wed, Feb 15, 2017@02:27:13PM -0500, Marc Smith wrote: >>> Hi, >>> >>> I'm testing with a Supermicro SSG-2028R-DN2R40L NVMe CiB >>> (cluster-in-a-box) solution. The performance is amazing so far, but I >>> experienced an issue during a performance test while using the fio >>> tool. >>> >>> Linux 4.9.8 >>> fio 2.14 >>> >>> We have just (8) NVMe drives in the "enclosure", and it contains two >>> server nodes, but right now we're just testing from one of the nodes. >>> >>> This is the command we ran: >>> fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 >>> --numjobs=16 --name=/dev/nvme0n1 --name=/dev/nvme1n1 >>> --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 >>> --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 >>> >>> After a few seconds, noticed the performance numbers started dropping, >>> and started flaking out. This is what we saw in the kernel logs: >> >> It looks like your controller stopped posting completions to commands. >> >> There is some excessive kernel spamming going on here, though, but that >> fix is already staged for 4.11 inclusion here: >> >> http://git.infradead.org/nvme.git/commitdiff/7bf7d778620d83f14fcd92d0938fb97c7d78bf19?hp=9a69b0ed6257ae5e71c99bf21ce53f98c558476a >> >> As to why the driver was triggered to abort IO in the first place, that >> appears to be the device not posting completions on time. As far as I >> can tell, blk-mq's timeout handling won't mistakenly time out a command >> on the initial abort, and the default 30 second timeout should be more >> than enough for your workload. >> >> There does appear to be a small window where blk-mq can miss a completion, >> though: blk-mq's timeout handler sets the REQ_ATOM_COMPLETE flag >> while running the timeout handler, which blocks a natural completion >> from occuring while set. So if a real completion did occur, then that >> completion is lost, which will force the subseqent timeout handler to >> issue a controller reset. >> >> But I don't think that's what's happening here. You are getting time >> outs on admin commands as well, so that really looks like your >> controller just stopped responding. > > Thanks for the quick reply; so, I rebooted both nodes that are > attached to the NVMe drives, and all (8) NVMe devices appear on both > nodes after reboot, including the nvme0 device that had disappeared > before. I then hammered the nvme0n1 device, by itself, using fio to > try and reproduce this same issue, but couldn't do so. I ran fio with > huge I/O depths and jobs for hours, and still couldn't reproduce this. > > Then, I ran the same test again that caused the original issue, > running fio across all (8) of the NVMe drives: > fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 > --numjobs=16 --name=/dev/nvme0n1 --name=/dev/nvme1n1 > --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 > --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 > > Within a minute or two, I noticed similar kernel messages but with a > different NVMe drive this time, and as before, the NVMe drive > disappears from both servers. Here are the logs: > > --snip-- > [ 5329.400622] nvme nvme7: I/O 930 QID 1 timeout, aborting > [ 5329.400628] nvme nvme7: I/O 931 QID 1 timeout, aborting > [ 5329.400630] nvme nvme7: I/O 932 QID 1 timeout, aborting > [ 5329.400631] nvme nvme7: I/O 933 QID 1 timeout, aborting > [ 5359.413460] nvme nvme7: I/O 930 QID 1 timeout, reset controller > [ 5389.618277] nvme nvme7: I/O 130 QID 0 timeout, reset controller > [ 5419.927512] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927513] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927515] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927517] blk_update_request: I/O error, dev nvme7n1, sector 635173168 > [ 5419.927519] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927521] blk_update_request: I/O error, dev nvme7n1, sector 1500458320 > [ 5419.927523] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927524] blk_update_request: I/O error, dev nvme7n1, sector 1402863672 > [ 5419.927526] blk_update_request: I/O error, dev nvme7n1, sector 1930822624 > [ 5419.927527] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927529] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927531] blk_update_request: I/O error, dev nvme7n1, sector 1060940040 > [ 5419.927532] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927533] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927535] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927536] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927537] blk_update_request: I/O error, dev nvme7n1, sector 697098256 > [ 5419.927538] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927540] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927541] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927542] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927543] blk_update_request: I/O error, dev nvme7n1, sector 1803921928 > [ 5419.927546] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927548] blk_update_request: I/O error, dev nvme7n1, sector 231221800 > [ 5419.927549] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927550] blk_update_request: I/O error, dev nvme7n1, sector 1411290456 > [ 5419.927551] blk_update_request: I/O error, dev nvme7n1, sector 1888075000 > [ 5419.927553] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927554] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927555] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927556] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927557] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927559] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927560] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927561] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927562] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927564] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927565] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927566] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927568] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927570] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927570] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927571] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927572] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927573] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927575] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927576] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927577] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927578] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927579] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927581] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927582] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927583] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927584] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927585] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927586] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927587] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927590] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927592] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927593] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927595] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927596] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927597] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927598] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927599] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927600] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927601] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927604] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927606] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927607] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927609] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927610] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927611] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927612] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927615] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927616] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927617] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927619] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927620] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927621] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927621] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927624] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927626] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927627] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927628] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927630] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927631] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927634] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927635] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927636] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927637] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927639] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927640] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927642] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927644] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927645] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927647] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927648] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927649] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927651] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927654] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927655] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927656] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927657] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927658] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927659] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927660] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927663] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927664] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927665] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927666] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927667] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927668] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927671] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927672] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927674] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927675] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927677] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927678] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927681] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927682] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927683] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927687] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927688] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927691] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927693] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927695] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927699] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927701] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927704] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927706] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927709] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927714] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927718] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927724] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927730] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927733] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927737] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927742] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927745] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927749] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927752] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927753] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927755] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927757] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927758] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927760] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927764] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927765] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927767] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927769] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927771] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927772] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927774] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927776] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927778] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927780] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927781] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927784] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927790] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927792] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927794] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927796] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927797] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927799] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927801] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927802] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927804] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927806] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927808] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927810] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927815] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927817] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927819] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927820] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927822] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927824] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927826] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927827] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927829] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927831] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927832] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927835] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927839] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927840] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927842] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927844] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927846] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927847] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927849] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927850] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927852] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927854] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927855] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927859] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927861] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927863] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927865] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927867] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927868] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927870] nvme nvme7: completing aborted command with status: 0007 > [ 5419.927873] nvme nvme7: Abort status: 0x7 > [ 5419.927874] nvme nvme7: Abort status: 0x7 > [ 5419.927875] nvme nvme7: Abort status: 0x7 > [ 5419.927896] nvme nvme7: completing aborted command with status: fffffffc > [ 5419.927966] nvme nvme7: Abort status: 0xfffc > [ 5419.928223] ------------[ cut here ]------------ > [ 5419.928233] WARNING: CPU: 19 PID: 1438 at fs/sysfs/dir.c:31 > sysfs_warn_dup+0x53/0x5f > [ 5419.928234] sysfs: cannot create duplicate filename > '/devices/pci0000:80/0000:80:03.0/0000:82:00.0/0000:83:07.0/0000:87:00.0/nvme/nvme7/cmb' > [ 5419.928235] Modules linked in: qla2xxx bonding mlx5_core bna > [ 5419.928241] CPU: 19 PID: 1438 Comm: kworker/u114:1 Not tainted > 4.9.8-esos.prod #1 > [ 5419.928242] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS, > BIOS 2.0 10/28/2016 > [ 5419.928248] Workqueue: nvme nvme_reset_work > [ 5419.928250] 0000000000000000 ffffffff81397af8 ffffc9000b00fcf0 > 0000000000000000 > [ 5419.928252] ffffffff81065550 ffff88085a139000 ffffc9000b00fd48 > ffff880858b21d20 > [ 5419.928254] ffff88105b34b000 00000000e6000000 ffff881059993938 > ffffffff810655af > [ 5419.928256] Call Trace: > [ 5419.928264] [<ffffffff81397af8>] ? dump_stack+0x46/0x59 > [ 5419.928269] [<ffffffff81065550>] ? __warn+0xc8/0xe1 > [ 5419.928271] [<ffffffff810655af>] ? warn_slowpath_fmt+0x46/0x4e > [ 5419.928275] [<ffffffff81182ce0>] ? kernfs_path_from_node+0x4e/0x58 > [ 5419.928277] [<ffffffff81184e5c>] ? sysfs_warn_dup+0x53/0x5f > [ 5419.928279] [<ffffffff81184c39>] ? sysfs_add_file_mode_ns+0xd1/0x14d > [ 5419.928281] [<ffffffff81184d7c>] ? sysfs_add_file_to_group+0x3c/0x4b > [ 5419.928282] [<ffffffff815dae5f>] ? nvme_reset_work+0x415/0xb25 > [ 5419.928286] [<ffffffff81076a58>] ? process_one_work+0x192/0x29b > [ 5419.928287] [<ffffffff81077096>] ? worker_thread+0x26e/0x356 > [ 5419.928289] [<ffffffff81076e28>] ? rescuer_thread+0x2a0/0x2a0 > [ 5419.928292] [<ffffffff810686e8>] ? do_group_exit+0x39/0x91 > [ 5419.928295] [<ffffffff8107abe2>] ? kthread+0xc2/0xca > [ 5419.928296] [<ffffffff8107ab20>] ? kthread_park+0x4e/0x4e > [ 5419.928300] [<ffffffff81a7ed22>] ? ret_from_fork+0x22/0x30 > [ 5419.928310] ---[ end trace 9d1818416c7876e0 ]--- > [ 5419.928313] nvme 0000:87:00.0: failed to add sysfs attribute for CMB > [ 5420.031107] do_IRQ: 1.66 No irq handler for vector > [ 5480.488746] nvme nvme7: I/O 215 QID 0 timeout, disable controller > [ 5480.488755] ------------[ cut here ]------------ > [ 5480.488766] WARNING: CPU: 19 PID: 1233 at kernel/irq/manage.c:1478 > __free_irq+0x93/0x1ed > [ 5480.488767] Trying to free already-free IRQ 443 > [ 5480.488768] Modules linked in: qla2xxx bonding mlx5_core bna > [ 5480.488775] CPU: 19 PID: 1233 Comm: kworker/19:1H Tainted: G > W 4.9.8-esos.prod #1 > [ 5480.488776] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS, > BIOS 2.0 10/28/2016 > [ 5480.488785] Workqueue: kblockd blk_mq_timeout_work > [ 5480.488786] 0000000000000000 ffffffff81397af8 ffffc9000be9fb88 > 0000000000000000 > [ 5480.488789] ffffffff81065550 ffff880859f08800 ffffc9000be9fbe0 > 00000000000001bb > [ 5480.488792] ffff880859f0889c ffff8810584abb00 0000000000000246 > ffffffff810655af > [ 5480.488794] Call Trace: > [ 5480.488801] [<ffffffff81397af8>] ? dump_stack+0x46/0x59 > [ 5480.488808] [<ffffffff81065550>] ? __warn+0xc8/0xe1 > [ 5480.488810] [<ffffffff810655af>] ? warn_slowpath_fmt+0x46/0x4e > [ 5480.488812] [<ffffffff8139c224>] ? __radix_tree_lookup+0x2c/0x93 > [ 5480.488814] [<ffffffff8109d5b0>] ? __free_irq+0x93/0x1ed > [ 5480.488816] [<ffffffff8109d7a5>] ? free_irq+0x61/0x72 > [ 5480.488823] [<ffffffff815d9e2b>] ? nvme_suspend_queue+0x66/0x6b > [ 5480.488825] [<ffffffff815da001>] ? nvme_dev_disable+0x96/0x30b > [ 5480.488831] [<ffffffff8148ebd6>] ? dev_warn+0x50/0x58 > [ 5480.488833] [<ffffffff815da368>] ? nvme_timeout+0x59/0x186 > [ 5480.488835] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e > [ 5480.488836] [<ffffffff8138691f>] ? blk_mq_rq_timed_out+0x27/0x5e > [ 5480.488838] [<ffffffff8138816f>] ? bt_for_each+0xaf/0xca > [ 5480.488840] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e > [ 5480.488841] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e > [ 5480.488842] [<ffffffff813885df>] ? blk_mq_queue_tag_busy_iter+0x7b/0x88 > [ 5480.488844] [<ffffffff81385677>] ? blk_mq_timeout_work+0x7d/0x102 > [ 5480.488848] [<ffffffff81076a58>] ? process_one_work+0x192/0x29b > [ 5480.488850] [<ffffffff81077096>] ? worker_thread+0x26e/0x356 > [ 5480.488851] [<ffffffff81076e28>] ? rescuer_thread+0x2a0/0x2a0 > [ 5480.488853] [<ffffffff8107abe2>] ? kthread+0xc2/0xca > [ 5480.488855] [<ffffffff8107ab20>] ? kthread_park+0x4e/0x4e > [ 5480.488861] [<ffffffff81a7ed22>] ? ret_from_fork+0x22/0x30 > [ 5480.488862] ---[ end trace 9d1818416c7876e1 ]--- > [ 5480.489161] nvme nvme7: Removing after probe failure status: -4 > [ 5480.489181] nvme7n1: detected capacity change from 1000204886016 to 0 > --snip-- > > While running this test, I see fio showing 100% random, 100% read 4K > IOPS numbers of ~2.9M IOPS, and then when I see it suddenly drop to > ~2.5M IOPS, I then look at the kernel logs and see the above. > > > Thanks, > > Marc > >> >>> --snip-- >>> [70961.868655] nvme nvme0: I/O 1009 QID 1 timeout, aborting >>> [70961.868666] nvme nvme0: I/O 1010 QID 1 timeout, aborting >>> [70961.868670] nvme nvme0: I/O 1011 QID 1 timeout, aborting >>> [70961.868673] nvme nvme0: I/O 1013 QID 1 timeout, aborting >>> [70992.073974] nvme nvme0: I/O 1009 QID 1 timeout, reset controller >>> [71022.727229] nvme nvme0: I/O 237 QID 0 timeout, reset controller ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux 4.9.8 + NVMe CiB Issue 2017-02-20 16:43 ` Marc Smith @ 2017-02-21 16:42 ` Marc Smith 2017-03-09 20:25 ` Marc Smith 0 siblings, 1 reply; 11+ messages in thread From: Marc Smith @ 2017-02-21 16:42 UTC (permalink / raw) I thought I just resolved this, but it choked again... I should have checked this first, but didn't: The firmware version / available updates. Our drives were on 8DV101D0 and then we updated to 8DV101H0. And even in the release notes for the firmware it lists: 8DV101H0 - Multiple PCIe Link Dynamic Equalization updates - SQ Header Pointer Mismatch fix - Drive Drop issue under heavy IO testing - Drive unresponsive during concurrent IO and NVMe-MI Basic Management Commands - XN022 Assert Fixes - Sequential read performance improvement at end of life from firmware 8DV101D0 I saw the "drive drop issue under heavy IO testing" and was hopeful it was related. The test seemed to go longer before a drive dropped out, but still only after a few minutes, several of the NVMe drives disappeared. Should we be opening a support request with Intel (hardware), or is linux-nvme the appropriate forum? Thanks, Marc On Mon, Feb 20, 2017@11:43 AM, Marc Smith <marc.smith@mcc.edu> wrote: > Hi, > > A little more information... I attempted the same test using a Linux > 4.9.8 kernel with a number of debugging options set, but I am unable > to reproduce the NVMe timeout issue with that kernel configuration. > With the debug kernel, I'm only able to obtain about 1.5 million IOPS, > which is expected with all the extra options set. So it doesn't appear > its an issue with using all eight of the NVMe devices at once, it > appears to be load/IO related. Its a different NVMe drive each time > too that gives the timeout error, so not a single device causing this, > at least from what I can tell. > > I also tried the same test on the second server connected to the NVMe > drives and I'm able to reproduce the failure on that machine as well, > so I don't think its related to the physical servers. > > Anything else I can check? I realize we'll probably never push 3 > million IOPS in production, but I'd like to get any stability issues > sorted out before continuing. =) > > Thanks for your time. > > > --Marc > > On Thu, Feb 16, 2017@10:37 AM, Marc Smith <marc.smith@mcc.edu> wrote: >> On Wed, Feb 15, 2017@4:12 PM, Keith Busch <keith.busch@intel.com> wrote: >>> On Wed, Feb 15, 2017@02:27:13PM -0500, Marc Smith wrote: >>>> Hi, >>>> >>>> I'm testing with a Supermicro SSG-2028R-DN2R40L NVMe CiB >>>> (cluster-in-a-box) solution. The performance is amazing so far, but I >>>> experienced an issue during a performance test while using the fio >>>> tool. >>>> >>>> Linux 4.9.8 >>>> fio 2.14 >>>> >>>> We have just (8) NVMe drives in the "enclosure", and it contains two >>>> server nodes, but right now we're just testing from one of the nodes. >>>> >>>> This is the command we ran: >>>> fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 >>>> --numjobs=16 --name=/dev/nvme0n1 --name=/dev/nvme1n1 >>>> --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 >>>> --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 >>>> >>>> After a few seconds, noticed the performance numbers started dropping, >>>> and started flaking out. This is what we saw in the kernel logs: >>> >>> It looks like your controller stopped posting completions to commands. >>> >>> There is some excessive kernel spamming going on here, though, but that >>> fix is already staged for 4.11 inclusion here: >>> >>> http://git.infradead.org/nvme.git/commitdiff/7bf7d778620d83f14fcd92d0938fb97c7d78bf19?hp=9a69b0ed6257ae5e71c99bf21ce53f98c558476a >>> >>> As to why the driver was triggered to abort IO in the first place, that >>> appears to be the device not posting completions on time. As far as I >>> can tell, blk-mq's timeout handling won't mistakenly time out a command >>> on the initial abort, and the default 30 second timeout should be more >>> than enough for your workload. >>> >>> There does appear to be a small window where blk-mq can miss a completion, >>> though: blk-mq's timeout handler sets the REQ_ATOM_COMPLETE flag >>> while running the timeout handler, which blocks a natural completion >>> from occuring while set. So if a real completion did occur, then that >>> completion is lost, which will force the subseqent timeout handler to >>> issue a controller reset. >>> >>> But I don't think that's what's happening here. You are getting time >>> outs on admin commands as well, so that really looks like your >>> controller just stopped responding. >> >> Thanks for the quick reply; so, I rebooted both nodes that are >> attached to the NVMe drives, and all (8) NVMe devices appear on both >> nodes after reboot, including the nvme0 device that had disappeared >> before. I then hammered the nvme0n1 device, by itself, using fio to >> try and reproduce this same issue, but couldn't do so. I ran fio with >> huge I/O depths and jobs for hours, and still couldn't reproduce this. >> >> Then, I ran the same test again that caused the original issue, >> running fio across all (8) of the NVMe drives: >> fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 >> --numjobs=16 --name=/dev/nvme0n1 --name=/dev/nvme1n1 >> --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 >> --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 >> >> Within a minute or two, I noticed similar kernel messages but with a >> different NVMe drive this time, and as before, the NVMe drive >> disappears from both servers. Here are the logs: >> >> --snip-- >> [ 5329.400622] nvme nvme7: I/O 930 QID 1 timeout, aborting >> [ 5329.400628] nvme nvme7: I/O 931 QID 1 timeout, aborting >> [ 5329.400630] nvme nvme7: I/O 932 QID 1 timeout, aborting >> [ 5329.400631] nvme nvme7: I/O 933 QID 1 timeout, aborting >> [ 5359.413460] nvme nvme7: I/O 930 QID 1 timeout, reset controller >> [ 5389.618277] nvme nvme7: I/O 130 QID 0 timeout, reset controller >> [ 5419.927512] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927513] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927515] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927517] blk_update_request: I/O error, dev nvme7n1, sector 635173168 >> [ 5419.927519] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927521] blk_update_request: I/O error, dev nvme7n1, sector 1500458320 >> [ 5419.927523] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927524] blk_update_request: I/O error, dev nvme7n1, sector 1402863672 >> [ 5419.927526] blk_update_request: I/O error, dev nvme7n1, sector 1930822624 >> [ 5419.927527] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927529] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927531] blk_update_request: I/O error, dev nvme7n1, sector 1060940040 >> [ 5419.927532] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927533] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927535] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927536] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927537] blk_update_request: I/O error, dev nvme7n1, sector 697098256 >> [ 5419.927538] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927540] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927541] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927542] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927543] blk_update_request: I/O error, dev nvme7n1, sector 1803921928 >> [ 5419.927546] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927548] blk_update_request: I/O error, dev nvme7n1, sector 231221800 >> [ 5419.927549] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927550] blk_update_request: I/O error, dev nvme7n1, sector 1411290456 >> [ 5419.927551] blk_update_request: I/O error, dev nvme7n1, sector 1888075000 >> [ 5419.927553] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927554] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927555] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927556] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927557] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927559] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927560] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927561] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927562] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927564] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927565] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927566] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927568] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927570] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927570] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927571] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927572] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927573] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927575] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927576] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927577] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927578] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927579] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927581] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927582] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927583] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927584] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927585] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927586] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927587] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927590] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927592] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927593] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927595] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927596] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927597] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927598] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927599] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927600] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927601] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927604] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927606] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927607] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927609] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927610] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927611] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927612] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927615] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927616] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927617] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927619] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927620] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927621] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927621] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927624] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927626] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927627] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927628] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927630] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927631] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927634] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927635] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927636] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927637] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927639] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927640] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927642] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927644] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927645] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927647] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927648] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927649] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927651] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927654] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927655] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927656] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927657] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927658] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927659] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927660] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927663] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927664] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927665] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927666] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927667] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927668] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927671] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927672] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927674] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927675] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927677] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927678] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927681] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927682] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927683] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927687] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927688] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927691] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927693] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927695] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927699] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927701] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927704] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927706] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927709] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927714] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927718] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927724] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927730] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927733] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927737] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927742] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927745] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927749] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927752] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927753] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927755] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927757] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927758] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927760] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927764] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927765] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927767] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927769] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927771] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927772] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927774] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927776] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927778] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927780] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927781] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927784] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927790] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927792] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927794] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927796] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927797] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927799] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927801] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927802] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927804] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927806] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927808] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927810] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927815] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927817] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927819] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927820] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927822] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927824] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927826] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927827] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927829] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927831] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927832] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927835] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927839] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927840] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927842] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927844] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927846] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927847] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927849] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927850] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927852] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927854] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927855] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927859] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927861] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927863] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927865] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927867] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927868] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927870] nvme nvme7: completing aborted command with status: 0007 >> [ 5419.927873] nvme nvme7: Abort status: 0x7 >> [ 5419.927874] nvme nvme7: Abort status: 0x7 >> [ 5419.927875] nvme nvme7: Abort status: 0x7 >> [ 5419.927896] nvme nvme7: completing aborted command with status: fffffffc >> [ 5419.927966] nvme nvme7: Abort status: 0xfffc >> [ 5419.928223] ------------[ cut here ]------------ >> [ 5419.928233] WARNING: CPU: 19 PID: 1438 at fs/sysfs/dir.c:31 >> sysfs_warn_dup+0x53/0x5f >> [ 5419.928234] sysfs: cannot create duplicate filename >> '/devices/pci0000:80/0000:80:03.0/0000:82:00.0/0000:83:07.0/0000:87:00.0/nvme/nvme7/cmb' >> [ 5419.928235] Modules linked in: qla2xxx bonding mlx5_core bna >> [ 5419.928241] CPU: 19 PID: 1438 Comm: kworker/u114:1 Not tainted >> 4.9.8-esos.prod #1 >> [ 5419.928242] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS, >> BIOS 2.0 10/28/2016 >> [ 5419.928248] Workqueue: nvme nvme_reset_work >> [ 5419.928250] 0000000000000000 ffffffff81397af8 ffffc9000b00fcf0 >> 0000000000000000 >> [ 5419.928252] ffffffff81065550 ffff88085a139000 ffffc9000b00fd48 >> ffff880858b21d20 >> [ 5419.928254] ffff88105b34b000 00000000e6000000 ffff881059993938 >> ffffffff810655af >> [ 5419.928256] Call Trace: >> [ 5419.928264] [<ffffffff81397af8>] ? dump_stack+0x46/0x59 >> [ 5419.928269] [<ffffffff81065550>] ? __warn+0xc8/0xe1 >> [ 5419.928271] [<ffffffff810655af>] ? warn_slowpath_fmt+0x46/0x4e >> [ 5419.928275] [<ffffffff81182ce0>] ? kernfs_path_from_node+0x4e/0x58 >> [ 5419.928277] [<ffffffff81184e5c>] ? sysfs_warn_dup+0x53/0x5f >> [ 5419.928279] [<ffffffff81184c39>] ? sysfs_add_file_mode_ns+0xd1/0x14d >> [ 5419.928281] [<ffffffff81184d7c>] ? sysfs_add_file_to_group+0x3c/0x4b >> [ 5419.928282] [<ffffffff815dae5f>] ? nvme_reset_work+0x415/0xb25 >> [ 5419.928286] [<ffffffff81076a58>] ? process_one_work+0x192/0x29b >> [ 5419.928287] [<ffffffff81077096>] ? worker_thread+0x26e/0x356 >> [ 5419.928289] [<ffffffff81076e28>] ? rescuer_thread+0x2a0/0x2a0 >> [ 5419.928292] [<ffffffff810686e8>] ? do_group_exit+0x39/0x91 >> [ 5419.928295] [<ffffffff8107abe2>] ? kthread+0xc2/0xca >> [ 5419.928296] [<ffffffff8107ab20>] ? kthread_park+0x4e/0x4e >> [ 5419.928300] [<ffffffff81a7ed22>] ? ret_from_fork+0x22/0x30 >> [ 5419.928310] ---[ end trace 9d1818416c7876e0 ]--- >> [ 5419.928313] nvme 0000:87:00.0: failed to add sysfs attribute for CMB >> [ 5420.031107] do_IRQ: 1.66 No irq handler for vector >> [ 5480.488746] nvme nvme7: I/O 215 QID 0 timeout, disable controller >> [ 5480.488755] ------------[ cut here ]------------ >> [ 5480.488766] WARNING: CPU: 19 PID: 1233 at kernel/irq/manage.c:1478 >> __free_irq+0x93/0x1ed >> [ 5480.488767] Trying to free already-free IRQ 443 >> [ 5480.488768] Modules linked in: qla2xxx bonding mlx5_core bna >> [ 5480.488775] CPU: 19 PID: 1233 Comm: kworker/19:1H Tainted: G >> W 4.9.8-esos.prod #1 >> [ 5480.488776] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS, >> BIOS 2.0 10/28/2016 >> [ 5480.488785] Workqueue: kblockd blk_mq_timeout_work >> [ 5480.488786] 0000000000000000 ffffffff81397af8 ffffc9000be9fb88 >> 0000000000000000 >> [ 5480.488789] ffffffff81065550 ffff880859f08800 ffffc9000be9fbe0 >> 00000000000001bb >> [ 5480.488792] ffff880859f0889c ffff8810584abb00 0000000000000246 >> ffffffff810655af >> [ 5480.488794] Call Trace: >> [ 5480.488801] [<ffffffff81397af8>] ? dump_stack+0x46/0x59 >> [ 5480.488808] [<ffffffff81065550>] ? __warn+0xc8/0xe1 >> [ 5480.488810] [<ffffffff810655af>] ? warn_slowpath_fmt+0x46/0x4e >> [ 5480.488812] [<ffffffff8139c224>] ? __radix_tree_lookup+0x2c/0x93 >> [ 5480.488814] [<ffffffff8109d5b0>] ? __free_irq+0x93/0x1ed >> [ 5480.488816] [<ffffffff8109d7a5>] ? free_irq+0x61/0x72 >> [ 5480.488823] [<ffffffff815d9e2b>] ? nvme_suspend_queue+0x66/0x6b >> [ 5480.488825] [<ffffffff815da001>] ? nvme_dev_disable+0x96/0x30b >> [ 5480.488831] [<ffffffff8148ebd6>] ? dev_warn+0x50/0x58 >> [ 5480.488833] [<ffffffff815da368>] ? nvme_timeout+0x59/0x186 >> [ 5480.488835] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e >> [ 5480.488836] [<ffffffff8138691f>] ? blk_mq_rq_timed_out+0x27/0x5e >> [ 5480.488838] [<ffffffff8138816f>] ? bt_for_each+0xaf/0xca >> [ 5480.488840] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e >> [ 5480.488841] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e >> [ 5480.488842] [<ffffffff813885df>] ? blk_mq_queue_tag_busy_iter+0x7b/0x88 >> [ 5480.488844] [<ffffffff81385677>] ? blk_mq_timeout_work+0x7d/0x102 >> [ 5480.488848] [<ffffffff81076a58>] ? process_one_work+0x192/0x29b >> [ 5480.488850] [<ffffffff81077096>] ? worker_thread+0x26e/0x356 >> [ 5480.488851] [<ffffffff81076e28>] ? rescuer_thread+0x2a0/0x2a0 >> [ 5480.488853] [<ffffffff8107abe2>] ? kthread+0xc2/0xca >> [ 5480.488855] [<ffffffff8107ab20>] ? kthread_park+0x4e/0x4e >> [ 5480.488861] [<ffffffff81a7ed22>] ? ret_from_fork+0x22/0x30 >> [ 5480.488862] ---[ end trace 9d1818416c7876e1 ]--- >> [ 5480.489161] nvme nvme7: Removing after probe failure status: -4 >> [ 5480.489181] nvme7n1: detected capacity change from 1000204886016 to 0 >> --snip-- >> >> While running this test, I see fio showing 100% random, 100% read 4K >> IOPS numbers of ~2.9M IOPS, and then when I see it suddenly drop to >> ~2.5M IOPS, I then look at the kernel logs and see the above. >> >> >> Thanks, >> >> Marc >> >>> >>>> --snip-- >>>> [70961.868655] nvme nvme0: I/O 1009 QID 1 timeout, aborting >>>> [70961.868666] nvme nvme0: I/O 1010 QID 1 timeout, aborting >>>> [70961.868670] nvme nvme0: I/O 1011 QID 1 timeout, aborting >>>> [70961.868673] nvme nvme0: I/O 1013 QID 1 timeout, aborting >>>> [70992.073974] nvme nvme0: I/O 1009 QID 1 timeout, reset controller >>>> [71022.727229] nvme nvme0: I/O 237 QID 0 timeout, reset controller ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux 4.9.8 + NVMe CiB Issue 2017-02-21 16:42 ` Marc Smith @ 2017-03-09 20:25 ` Marc Smith 2017-03-10 0:27 ` Keith Busch 0 siblings, 1 reply; 11+ messages in thread From: Marc Smith @ 2017-03-09 20:25 UTC (permalink / raw) Hi, We opened a support case with Intel (# 02641251) but we aren't making much progress... they looked at log files from /var/log/ on our system and now seem to be blaming warning/error log lines from Corosync/Pacemaker on the issue. =) We're still able to reproduce this issue with high load... the NVMe drives "drop out", randomly, not always the same drive, when pushing I/O across all of our NVMe drives. We don't mind to keep pushing on the Intel NVMe hardware support front, but looking for confirmation that its believed based on the kernel log messages posted, that this is an issue with the NVMe drives themselves (eg, firmware)? Thanks for your time. --Marc On Tue, Feb 21, 2017@11:42 AM, Marc Smith <marc.smith@mcc.edu> wrote: > I thought I just resolved this, but it choked again... I should have > checked this first, but didn't: The firmware version / available > updates. Our drives were on 8DV101D0 and then we updated to 8DV101H0. > And even in the release notes for the firmware it lists: > > 8DV101H0 - Multiple PCIe Link Dynamic Equalization updates > - SQ Header Pointer Mismatch fix > - Drive Drop issue under heavy IO testing > - Drive unresponsive during concurrent IO and NVMe-MI Basic Management Commands > - XN022 Assert Fixes > - Sequential read performance improvement at end of life from firmware 8DV101D0 > > I saw the "drive drop issue under heavy IO testing" and was hopeful it > was related. The test seemed to go longer before a drive dropped out, > but still only after a few minutes, several of the NVMe drives > disappeared. > > Should we be opening a support request with Intel (hardware), or is > linux-nvme the appropriate forum? > > > Thanks, > > Marc > > > > On Mon, Feb 20, 2017@11:43 AM, Marc Smith <marc.smith@mcc.edu> wrote: >> Hi, >> >> A little more information... I attempted the same test using a Linux >> 4.9.8 kernel with a number of debugging options set, but I am unable >> to reproduce the NVMe timeout issue with that kernel configuration. >> With the debug kernel, I'm only able to obtain about 1.5 million IOPS, >> which is expected with all the extra options set. So it doesn't appear >> its an issue with using all eight of the NVMe devices at once, it >> appears to be load/IO related. Its a different NVMe drive each time >> too that gives the timeout error, so not a single device causing this, >> at least from what I can tell. >> >> I also tried the same test on the second server connected to the NVMe >> drives and I'm able to reproduce the failure on that machine as well, >> so I don't think its related to the physical servers. >> >> Anything else I can check? I realize we'll probably never push 3 >> million IOPS in production, but I'd like to get any stability issues >> sorted out before continuing. =) >> >> Thanks for your time. >> >> >> --Marc >> >> On Thu, Feb 16, 2017@10:37 AM, Marc Smith <marc.smith@mcc.edu> wrote: >>> On Wed, Feb 15, 2017@4:12 PM, Keith Busch <keith.busch@intel.com> wrote: >>>> On Wed, Feb 15, 2017@02:27:13PM -0500, Marc Smith wrote: >>>>> Hi, >>>>> >>>>> I'm testing with a Supermicro SSG-2028R-DN2R40L NVMe CiB >>>>> (cluster-in-a-box) solution. The performance is amazing so far, but I >>>>> experienced an issue during a performance test while using the fio >>>>> tool. >>>>> >>>>> Linux 4.9.8 >>>>> fio 2.14 >>>>> >>>>> We have just (8) NVMe drives in the "enclosure", and it contains two >>>>> server nodes, but right now we're just testing from one of the nodes. >>>>> >>>>> This is the command we ran: >>>>> fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 >>>>> --numjobs=16 --name=/dev/nvme0n1 --name=/dev/nvme1n1 >>>>> --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 >>>>> --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 >>>>> >>>>> After a few seconds, noticed the performance numbers started dropping, >>>>> and started flaking out. This is what we saw in the kernel logs: >>>> >>>> It looks like your controller stopped posting completions to commands. >>>> >>>> There is some excessive kernel spamming going on here, though, but that >>>> fix is already staged for 4.11 inclusion here: >>>> >>>> http://git.infradead.org/nvme.git/commitdiff/7bf7d778620d83f14fcd92d0938fb97c7d78bf19?hp=9a69b0ed6257ae5e71c99bf21ce53f98c558476a >>>> >>>> As to why the driver was triggered to abort IO in the first place, that >>>> appears to be the device not posting completions on time. As far as I >>>> can tell, blk-mq's timeout handling won't mistakenly time out a command >>>> on the initial abort, and the default 30 second timeout should be more >>>> than enough for your workload. >>>> >>>> There does appear to be a small window where blk-mq can miss a completion, >>>> though: blk-mq's timeout handler sets the REQ_ATOM_COMPLETE flag >>>> while running the timeout handler, which blocks a natural completion >>>> from occuring while set. So if a real completion did occur, then that >>>> completion is lost, which will force the subseqent timeout handler to >>>> issue a controller reset. >>>> >>>> But I don't think that's what's happening here. You are getting time >>>> outs on admin commands as well, so that really looks like your >>>> controller just stopped responding. >>> >>> Thanks for the quick reply; so, I rebooted both nodes that are >>> attached to the NVMe drives, and all (8) NVMe devices appear on both >>> nodes after reboot, including the nvme0 device that had disappeared >>> before. I then hammered the nvme0n1 device, by itself, using fio to >>> try and reproduce this same issue, but couldn't do so. I ran fio with >>> huge I/O depths and jobs for hours, and still couldn't reproduce this. >>> >>> Then, I ran the same test again that caused the original issue, >>> running fio across all (8) of the NVMe drives: >>> fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 >>> --numjobs=16 --name=/dev/nvme0n1 --name=/dev/nvme1n1 >>> --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 >>> --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 >>> >>> Within a minute or two, I noticed similar kernel messages but with a >>> different NVMe drive this time, and as before, the NVMe drive >>> disappears from both servers. Here are the logs: >>> >>> --snip-- >>> [ 5329.400622] nvme nvme7: I/O 930 QID 1 timeout, aborting >>> [ 5329.400628] nvme nvme7: I/O 931 QID 1 timeout, aborting >>> [ 5329.400630] nvme nvme7: I/O 932 QID 1 timeout, aborting >>> [ 5329.400631] nvme nvme7: I/O 933 QID 1 timeout, aborting >>> [ 5359.413460] nvme nvme7: I/O 930 QID 1 timeout, reset controller >>> [ 5389.618277] nvme nvme7: I/O 130 QID 0 timeout, reset controller >>> [ 5419.927512] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927513] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927515] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927517] blk_update_request: I/O error, dev nvme7n1, sector 635173168 >>> [ 5419.927519] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927521] blk_update_request: I/O error, dev nvme7n1, sector 1500458320 >>> [ 5419.927523] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927524] blk_update_request: I/O error, dev nvme7n1, sector 1402863672 >>> [ 5419.927526] blk_update_request: I/O error, dev nvme7n1, sector 1930822624 >>> [ 5419.927527] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927529] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927531] blk_update_request: I/O error, dev nvme7n1, sector 1060940040 >>> [ 5419.927532] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927533] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927535] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927536] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927537] blk_update_request: I/O error, dev nvme7n1, sector 697098256 >>> [ 5419.927538] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927540] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927541] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927542] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927543] blk_update_request: I/O error, dev nvme7n1, sector 1803921928 >>> [ 5419.927546] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927548] blk_update_request: I/O error, dev nvme7n1, sector 231221800 >>> [ 5419.927549] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927550] blk_update_request: I/O error, dev nvme7n1, sector 1411290456 >>> [ 5419.927551] blk_update_request: I/O error, dev nvme7n1, sector 1888075000 >>> [ 5419.927553] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927554] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927555] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927556] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927557] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927559] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927560] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927561] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927562] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927564] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927565] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927566] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927568] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927570] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927570] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927571] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927572] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927573] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927575] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927576] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927577] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927578] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927579] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927581] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927582] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927583] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927584] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927585] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927586] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927587] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927590] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927592] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927593] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927595] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927596] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927597] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927598] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927599] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927600] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927601] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927604] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927606] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927607] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927609] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927610] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927611] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927612] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927615] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927616] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927617] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927619] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927620] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927621] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927621] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927624] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927626] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927627] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927628] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927630] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927631] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927634] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927635] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927636] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927637] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927639] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927640] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927642] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927644] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927645] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927647] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927648] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927649] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927651] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927654] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927655] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927656] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927657] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927658] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927659] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927660] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927663] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927664] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927665] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927666] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927667] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927668] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927671] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927672] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927674] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927675] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927677] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927678] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927681] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927682] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927683] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927687] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927688] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927691] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927693] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927695] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927699] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927701] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927704] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927706] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927709] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927714] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927718] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927724] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927730] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927733] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927737] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927742] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927745] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927749] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927752] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927753] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927755] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927757] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927758] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927760] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927764] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927765] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927767] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927769] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927771] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927772] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927774] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927776] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927778] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927780] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927781] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927784] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927790] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927792] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927794] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927796] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927797] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927799] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927801] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927802] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927804] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927806] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927808] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927810] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927815] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927817] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927819] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927820] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927822] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927824] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927826] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927827] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927829] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927831] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927832] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927835] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927839] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927840] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927842] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927844] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927846] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927847] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927849] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927850] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927852] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927854] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927855] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927859] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927861] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927863] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927865] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927867] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927868] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927870] nvme nvme7: completing aborted command with status: 0007 >>> [ 5419.927873] nvme nvme7: Abort status: 0x7 >>> [ 5419.927874] nvme nvme7: Abort status: 0x7 >>> [ 5419.927875] nvme nvme7: Abort status: 0x7 >>> [ 5419.927896] nvme nvme7: completing aborted command with status: fffffffc >>> [ 5419.927966] nvme nvme7: Abort status: 0xfffc >>> [ 5419.928223] ------------[ cut here ]------------ >>> [ 5419.928233] WARNING: CPU: 19 PID: 1438 at fs/sysfs/dir.c:31 >>> sysfs_warn_dup+0x53/0x5f >>> [ 5419.928234] sysfs: cannot create duplicate filename >>> '/devices/pci0000:80/0000:80:03.0/0000:82:00.0/0000:83:07.0/0000:87:00.0/nvme/nvme7/cmb' >>> [ 5419.928235] Modules linked in: qla2xxx bonding mlx5_core bna >>> [ 5419.928241] CPU: 19 PID: 1438 Comm: kworker/u114:1 Not tainted >>> 4.9.8-esos.prod #1 >>> [ 5419.928242] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS, >>> BIOS 2.0 10/28/2016 >>> [ 5419.928248] Workqueue: nvme nvme_reset_work >>> [ 5419.928250] 0000000000000000 ffffffff81397af8 ffffc9000b00fcf0 >>> 0000000000000000 >>> [ 5419.928252] ffffffff81065550 ffff88085a139000 ffffc9000b00fd48 >>> ffff880858b21d20 >>> [ 5419.928254] ffff88105b34b000 00000000e6000000 ffff881059993938 >>> ffffffff810655af >>> [ 5419.928256] Call Trace: >>> [ 5419.928264] [<ffffffff81397af8>] ? dump_stack+0x46/0x59 >>> [ 5419.928269] [<ffffffff81065550>] ? __warn+0xc8/0xe1 >>> [ 5419.928271] [<ffffffff810655af>] ? warn_slowpath_fmt+0x46/0x4e >>> [ 5419.928275] [<ffffffff81182ce0>] ? kernfs_path_from_node+0x4e/0x58 >>> [ 5419.928277] [<ffffffff81184e5c>] ? sysfs_warn_dup+0x53/0x5f >>> [ 5419.928279] [<ffffffff81184c39>] ? sysfs_add_file_mode_ns+0xd1/0x14d >>> [ 5419.928281] [<ffffffff81184d7c>] ? sysfs_add_file_to_group+0x3c/0x4b >>> [ 5419.928282] [<ffffffff815dae5f>] ? nvme_reset_work+0x415/0xb25 >>> [ 5419.928286] [<ffffffff81076a58>] ? process_one_work+0x192/0x29b >>> [ 5419.928287] [<ffffffff81077096>] ? worker_thread+0x26e/0x356 >>> [ 5419.928289] [<ffffffff81076e28>] ? rescuer_thread+0x2a0/0x2a0 >>> [ 5419.928292] [<ffffffff810686e8>] ? do_group_exit+0x39/0x91 >>> [ 5419.928295] [<ffffffff8107abe2>] ? kthread+0xc2/0xca >>> [ 5419.928296] [<ffffffff8107ab20>] ? kthread_park+0x4e/0x4e >>> [ 5419.928300] [<ffffffff81a7ed22>] ? ret_from_fork+0x22/0x30 >>> [ 5419.928310] ---[ end trace 9d1818416c7876e0 ]--- >>> [ 5419.928313] nvme 0000:87:00.0: failed to add sysfs attribute for CMB >>> [ 5420.031107] do_IRQ: 1.66 No irq handler for vector >>> [ 5480.488746] nvme nvme7: I/O 215 QID 0 timeout, disable controller >>> [ 5480.488755] ------------[ cut here ]------------ >>> [ 5480.488766] WARNING: CPU: 19 PID: 1233 at kernel/irq/manage.c:1478 >>> __free_irq+0x93/0x1ed >>> [ 5480.488767] Trying to free already-free IRQ 443 >>> [ 5480.488768] Modules linked in: qla2xxx bonding mlx5_core bna >>> [ 5480.488775] CPU: 19 PID: 1233 Comm: kworker/19:1H Tainted: G >>> W 4.9.8-esos.prod #1 >>> [ 5480.488776] Hardware name: Supermicro SSG-2028R-DN2R40L/X10DSN-TS, >>> BIOS 2.0 10/28/2016 >>> [ 5480.488785] Workqueue: kblockd blk_mq_timeout_work >>> [ 5480.488786] 0000000000000000 ffffffff81397af8 ffffc9000be9fb88 >>> 0000000000000000 >>> [ 5480.488789] ffffffff81065550 ffff880859f08800 ffffc9000be9fbe0 >>> 00000000000001bb >>> [ 5480.488792] ffff880859f0889c ffff8810584abb00 0000000000000246 >>> ffffffff810655af >>> [ 5480.488794] Call Trace: >>> [ 5480.488801] [<ffffffff81397af8>] ? dump_stack+0x46/0x59 >>> [ 5480.488808] [<ffffffff81065550>] ? __warn+0xc8/0xe1 >>> [ 5480.488810] [<ffffffff810655af>] ? warn_slowpath_fmt+0x46/0x4e >>> [ 5480.488812] [<ffffffff8139c224>] ? __radix_tree_lookup+0x2c/0x93 >>> [ 5480.488814] [<ffffffff8109d5b0>] ? __free_irq+0x93/0x1ed >>> [ 5480.488816] [<ffffffff8109d7a5>] ? free_irq+0x61/0x72 >>> [ 5480.488823] [<ffffffff815d9e2b>] ? nvme_suspend_queue+0x66/0x6b >>> [ 5480.488825] [<ffffffff815da001>] ? nvme_dev_disable+0x96/0x30b >>> [ 5480.488831] [<ffffffff8148ebd6>] ? dev_warn+0x50/0x58 >>> [ 5480.488833] [<ffffffff815da368>] ? nvme_timeout+0x59/0x186 >>> [ 5480.488835] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e >>> [ 5480.488836] [<ffffffff8138691f>] ? blk_mq_rq_timed_out+0x27/0x5e >>> [ 5480.488838] [<ffffffff8138816f>] ? bt_for_each+0xaf/0xca >>> [ 5480.488840] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e >>> [ 5480.488841] [<ffffffff81386956>] ? blk_mq_rq_timed_out+0x5e/0x5e >>> [ 5480.488842] [<ffffffff813885df>] ? blk_mq_queue_tag_busy_iter+0x7b/0x88 >>> [ 5480.488844] [<ffffffff81385677>] ? blk_mq_timeout_work+0x7d/0x102 >>> [ 5480.488848] [<ffffffff81076a58>] ? process_one_work+0x192/0x29b >>> [ 5480.488850] [<ffffffff81077096>] ? worker_thread+0x26e/0x356 >>> [ 5480.488851] [<ffffffff81076e28>] ? rescuer_thread+0x2a0/0x2a0 >>> [ 5480.488853] [<ffffffff8107abe2>] ? kthread+0xc2/0xca >>> [ 5480.488855] [<ffffffff8107ab20>] ? kthread_park+0x4e/0x4e >>> [ 5480.488861] [<ffffffff81a7ed22>] ? ret_from_fork+0x22/0x30 >>> [ 5480.488862] ---[ end trace 9d1818416c7876e1 ]--- >>> [ 5480.489161] nvme nvme7: Removing after probe failure status: -4 >>> [ 5480.489181] nvme7n1: detected capacity change from 1000204886016 to 0 >>> --snip-- >>> >>> While running this test, I see fio showing 100% random, 100% read 4K >>> IOPS numbers of ~2.9M IOPS, and then when I see it suddenly drop to >>> ~2.5M IOPS, I then look at the kernel logs and see the above. >>> >>> >>> Thanks, >>> >>> Marc >>> >>>> >>>>> --snip-- >>>>> [70961.868655] nvme nvme0: I/O 1009 QID 1 timeout, aborting >>>>> [70961.868666] nvme nvme0: I/O 1010 QID 1 timeout, aborting >>>>> [70961.868670] nvme nvme0: I/O 1011 QID 1 timeout, aborting >>>>> [70961.868673] nvme nvme0: I/O 1013 QID 1 timeout, aborting >>>>> [70992.073974] nvme nvme0: I/O 1009 QID 1 timeout, reset controller >>>>> [71022.727229] nvme nvme0: I/O 237 QID 0 timeout, reset controller ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux 4.9.8 + NVMe CiB Issue 2017-03-09 20:25 ` Marc Smith @ 2017-03-10 0:27 ` Keith Busch 2017-03-31 13:26 ` Marc Smith 0 siblings, 1 reply; 11+ messages in thread From: Keith Busch @ 2017-03-10 0:27 UTC (permalink / raw) On Thu, Mar 09, 2017@03:25:01PM -0500, Marc Smith wrote: > Hi, > > We opened a support case with Intel (# 02641251) but we aren't making > much progress... they looked at log files from /var/log/ on our system > and now seem to be blaming warning/error log lines from > Corosync/Pacemaker on the issue. =) > > We're still able to reproduce this issue with high load... the NVMe > drives "drop out", randomly, not always the same drive, when pushing > I/O across all of our NVMe drives. > > We don't mind to keep pushing on the Intel NVMe hardware support > front, but looking for confirmation that its believed based on the > kernel log messages posted, that this is an issue with the NVMe drives > themselves (eg, firmware)? This genuinely looks like the drives have stopped responding and needs to be directed to the drive vendor. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux 4.9.8 + NVMe CiB Issue 2017-03-10 0:27 ` Keith Busch @ 2017-03-31 13:26 ` Marc Smith 2017-03-31 14:07 ` Keith Busch 0 siblings, 1 reply; 11+ messages in thread From: Marc Smith @ 2017-03-31 13:26 UTC (permalink / raw) Just an update in case anyone else stumbles upon this thread with the same issue: Intel reviewed the logs from the drives and determined the NVMe drives are hitting their maximum operating temperature and then shutting down. --Marc On Thu, Mar 9, 2017@7:27 PM, Keith Busch <keith.busch@intel.com> wrote: > On Thu, Mar 09, 2017@03:25:01PM -0500, Marc Smith wrote: >> Hi, >> >> We opened a support case with Intel (# 02641251) but we aren't making >> much progress... they looked at log files from /var/log/ on our system >> and now seem to be blaming warning/error log lines from >> Corosync/Pacemaker on the issue. =) >> >> We're still able to reproduce this issue with high load... the NVMe >> drives "drop out", randomly, not always the same drive, when pushing >> I/O across all of our NVMe drives. >> >> We don't mind to keep pushing on the Intel NVMe hardware support >> front, but looking for confirmation that its believed based on the >> kernel log messages posted, that this is an issue with the NVMe drives >> themselves (eg, firmware)? > > This genuinely looks like the drives have stopped responding and needs > to be directed to the drive vendor. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux 4.9.8 + NVMe CiB Issue 2017-03-31 13:26 ` Marc Smith @ 2017-03-31 14:07 ` Keith Busch 2017-03-31 14:33 ` Marc Smith 0 siblings, 1 reply; 11+ messages in thread From: Keith Busch @ 2017-03-31 14:07 UTC (permalink / raw) On Fri, Mar 31, 2017@09:26:32AM -0400, Marc Smith wrote: > NVMe drives are hitting their maximum operating temperature and then > shutting down. That's not cool. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux 4.9.8 + NVMe CiB Issue 2017-03-31 14:07 ` Keith Busch @ 2017-03-31 14:33 ` Marc Smith 2017-04-01 7:13 ` Dennis Mungai 0 siblings, 1 reply; 11+ messages in thread From: Marc Smith @ 2017-03-31 14:33 UTC (permalink / raw) Exactly! =) Thanks for your help. --Marc On Fri, Mar 31, 2017@10:07 AM, Keith Busch <keith.busch@intel.com> wrote: > On Fri, Mar 31, 2017@09:26:32AM -0400, Marc Smith wrote: >> NVMe drives are hitting their maximum operating temperature and then >> shutting down. > > That's not cool. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Linux 4.9.8 + NVMe CiB Issue 2017-03-31 14:33 ` Marc Smith @ 2017-04-01 7:13 ` Dennis Mungai 0 siblings, 0 replies; 11+ messages in thread From: Dennis Mungai @ 2017-04-01 7:13 UTC (permalink / raw) And to add a few comments on this: NVMe drives run hot, and one of their safety mechanisms is to first throttle (which drops IO performance ) in an attempt to lower operating temperature, and secondly, if all else fails, shut down completely. The problem the user is encountering here has to do with cooling. Let me explain: As a systems builder, one thing I've noticed with worrying concern is how quickly NVMe SSDs heat up significantly under load, yet so little is done by vendors to include either proper heat sink assemblies or throttling mechanisms in hardware (I'm looking at you, Samsung and Toshiba!), compounded with hilarious firmware bugs such as declaring support for asynchronous trim when in fact it's not supported, resulting in unneeded blacklists. The solution here, for the end user is simple: 1. On installation, deploy the NVMe SSDs in such a way that they have adequate cooling and airflow. And if that's not possible, acquire a PCIe adapter as their clearance guarantees airflow which is critical for cooling. 2. Use a third party silicon cooling strip on the drive. They act as good heat sinks that can keep the drive running under prolonged load. A simple Google search will show you what to get . And for manufacturers 1. Implement heat sink shields on these drives. Samsung has done an excellent job on the 960 Evo and Pro NVMes, and even in a closed up space such as a gaming laptop, the heat sink assembly keeps the drive below the thermal throttling point. 2. Firmware: Please implement throttling mechanisms that trip at reasonable temperatures. I've seen a drive out here (hint: Phison controller based) set to trip at 85 degrees, meaning that it often tends to shut down first before the thermal throttle kicks in. These hints might help. Regards, Dennis. On 31 March 2017@17:33, Marc Smith <marc.smith@mcc.edu> wrote: > Exactly! =) > > Thanks for your help. > > > --Marc > > On Fri, Mar 31, 2017@10:07 AM, Keith Busch <keith.busch@intel.com> wrote: >> On Fri, Mar 31, 2017@09:26:32AM -0400, Marc Smith wrote: >>> NVMe drives are hitting their maximum operating temperature and then >>> shutting down. >> >> That's not cool. > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2017-04-01 7:13 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-02-15 19:27 Linux 4.9.8 + NVMe CiB Issue Marc Smith 2017-02-15 21:12 ` Keith Busch 2017-02-16 15:37 ` Marc Smith 2017-02-20 16:43 ` Marc Smith 2017-02-21 16:42 ` Marc Smith 2017-03-09 20:25 ` Marc Smith 2017-03-10 0:27 ` Keith Busch 2017-03-31 13:26 ` Marc Smith 2017-03-31 14:07 ` Keith Busch 2017-03-31 14:33 ` Marc Smith 2017-04-01 7:13 ` Dennis Mungai
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.