* [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer @ 2021-04-30 16:40 Yi Zhang 2021-05-01 0:55 ` Sagi Grimberg 0 siblings, 1 reply; 13+ messages in thread From: Yi Zhang @ 2021-04-30 16:40 UTC (permalink / raw) To: linux-nvme; +Cc: skt-results-master, Bruno Goncalves Hello Recently CKI reproduced this WARNING and NULL pointer with linux-block/for-next on aarch64, seems it's one regression, I will try if I can bisect the culprit. blktests: nvme/022 (test NVMe reset command on NVMeOF file-backed ns) [ 1879.759978] run blktests nvme/022 at 2021-04-30 12:30:36 [ 1879.804283] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [ 1879.819087] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0da758a0-4d84-4133-82dd-9801235b55cd. [ 1879.833081] nvmet: unhandled identify cns 6 on qid 0 [ 1879.838079] nvme nvme0: creating 128 I/O queues. [ 1879.852353] nvme nvme0: new ctrl: "blktests-subsystem-1" [ 1880.879731] nvme nvme0: resetting controller [ 1889.940458] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! [ 1889.946377] nvmet: ctrl 1 fatal error occurred! [ 1889.950928] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" [ 1892.810813] - [ 1892.815427] WARNING: CPU: 30 PID: 5492 at drivers/nvme/target/loop.c:466 nvme_loop_reset_ctrl_work+0x48/0xf0 [nvme_loop] [ 1892.826293] Modules linked in: nvme_loop nvme_fabrics nvme_core nvmet loop rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser vfat libiscsi fat scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs i2c_smbus ib_core crct10dif_ce ghash_ce sha1_ce acpi_ipmi ipmi_ssif ipmi_devintf ipmi_msghandler thunderx2_pmu ip_tables xfs libcrc32c sr_mod cdrom mlx5_core ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm drm uas mlxfw sha2_ce tls sha256_arm64 usb_storage sg psample gpio_xlp i2c_xlp9xx dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvmet] [ 1892.885150] CPU: 30 PID: 5492 Comm: kworker/u513:5 Not tainted 5.12.0+ #1 [ 1892.891926] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.16 07/29/2020 [ 1892.901654] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop] [ 1892.908519] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) [ 1892.914513] pc : nvme_loop_reset_ctrl_work+0x48/0xf0 [nvme_loop] [ 1892.920508] lr : nvme_loop_reset_ctrl_work+0x40/0xf0 [nvme_loop] [ 1892.926502] sp : fffffe0031b6fd70 [ 1892.929803] x29: fffffe0031b6fd70 x28: 0000000000000000 [ 1892.935105] x27: fffffc081065c0c0 x26: fffffc0807c2c26c [ 1892.940405] x25: 0000000000000000 x24: fffffc084ae24898 [ 1892.945705] x23: 0000000000000000 x22: fffffc09410c1d00 [ 1892.951004] x21: fffffc084ae24890 x20: fffffc084ae244a0 [ 1892.956305] x19: fffffc084ae24000 x18: 0000000000000012 [ 1892.961604] x17: 0000000000000001 x16: 0000000000000019 [ 1892.966904] x15: fffffe0011d7e7e0 x14: fffffc8ba09cedf8 [ 1892.972204] x13: 0000000000000000 x12: 0000000000000003 [ 1892.977504] x11: fffffc8ba09ced40 x10: fffffe0011d7e7e8 [ 1892.982804] x9 : fffffe000ad60c58 x8 : fffffe8b650f0000 [ 1892.988104] x7 : 0000000000000008 x6 : fffffc0000000000 [ 1892.993403] x5 : 0000000000000000 x4 : ffffffff22f337e0 [ 1892.998703] x3 : fffffc084ae244ac x2 : 0000000000000001 [ 1893.004003] x1 : fffffc084ae244ac x0 : 0000000000000000 [ 1893.009303] Call trace: [ 1893.011737] nvme_loop_reset_ctrl_work+0x48/0xf0 [nvme_loop] [ 1893.017384] process_one_work+0x1d0/0x438 [ 1893.021385] worker_thread+0x1f8/0x4d8 [ 1893.025123] kthread+0x114/0x118 [ 1893.028341] ret_from_fork+0x10/0x18 [ 1893.031907] ---[ end trace 883109425327ab60 ]--- [ 1893.301843] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [ 1893.310620] Mem abort info: [ 1893.313401] ESR = 0x96000006 [ 1893.316442] EC = 0x25: DABT (current EL), IL = 32 bits [ 1893.321741] SET = 0, FnV = 0 [ 1893.324783] EA = 0, S1PTW = 0 [ 1893.327911] Data abort info: [ 1893.330778] ISV = 0, ISS = 0x00000006 [ 1893.334600] CM = 0, WnR = 0 [ 1893.337555] user pgtable: 64k pages, 42-bit VAs, pgdp=0000000a0b750000 [ 1893.344069] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000, pmd=0000000000000000 [ 1893.354669] Internal error: Oops: 96000006 [#1] SMP [ 1893.359535] Modules linked in: nvme_loop nvme_fabrics nvme_core nvmet loop rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_ ib_ipoib iw_cm intf ipmi_msghandrm_kms_helper s64 usb_storage s.418386] CPU: 0 PID: 12871 Comm: kworker/u513:0 Tainted: G W 5.12.0+ #1 [ 1893.426551] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.16 07/29/2020 [ 1893.436277] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] [ 1893.442892] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO BTYPE=--) [ 1893.448886] pc : percpu_ref_kill_and_confirm+0x15c/0x178 [ 1893.454189] lr : nvmet_sq_destroy+0xec/0x1f0 [nvmet] [ 1893.459149] sp : fffffe006466fc70 [ 1893.462449] x29: fffffe006466fc70 x28: 0000000000000000 [ 1893.467750] x27: fffffc8bcee4e5c0 x26: fffffc0807f57a6c [ 1893.473051] x25: 0000000000000000 x24: fffffc084ae248b8 [ 1893.478351] x23: fffffc0854e600d0 x22: fffffe000ac40600 [ 1893.483650] x21: 0000000000000000 x20: fffffc0854e60090 [ 1893.488950] x19: fffffe0011d7e000 x18: 0000000000000016 [ 1893.494250] x17: 0000000000000001 x16: fffffc084e210598 [ 1893.499550] x15: fffffe0011d7e7e0 x14: fffffc8ba09cee18 [ 1893.504850] x13: 0000000000000000 x12: fffffe006466fc68 [ 1893.510149] x11: 0000000000000040 x10: fffffe001158b2f8 [ 1893.515449] x9 : fffffe000ad608ec x8 : fffffc082000cfc0 [ 1893.520749] x7 : 0000000000000001 x6 : fffffc0000000000 [ 1893.526049] x5 : 0000000000000080 x4 : 0000000000000000 [ 1893.531349] x3 : 0000000000000000 x2 : fffffe0011a4e000 [ 1893.536649] x1 : fffffe0010bdec08 x0 : fffffe0010e95000 [ 1893.541949] Call trace: [ 1893.544383] percpu_ref_kill_and_confirm+0x15c/0x178 [ 1893.549335] nvmet_sq_destroy+0xec/0x1f0 [nvmet] [ 1893.553945] nvme_loop_destroy_io_queues+0x64/0x90 [nvme_loop] [ 1893.559767] nvme_loop_shutdown_ctrl+0x60/0xb8 [nvme_loop] [ 1893.565240] nvme_loop_delete_ctrl_host+0x18/0x20 [nvme_loop] [ 1893.570973] nvme_do_delete_ctrl+0x58/0x6c [nvme_core] [ 1893.576106] nvme_delete_ctrl_work+0x18/0x38 [nvme_core] [ 1893.581411] process_one_work+0x1d0/0x438 [ 1893.585410] worker_thread+0x150/0x4d8 [ 1893.589148] kthread+0x114/0x118 [ 1893.592364] ret_from_fork+0x10/0x18 [ 1893.595929] Code: 39244840 f0003781 d0004d40 91302021 (f9400462) [ 1893.602139] ---[ end trace 883109425327ab61 ]--- [ 1893.606744] Kernel panic - not syncing: Oops: Fatal exception [ 1893.612512] SMP: stopping secondary CPUs [ 1893.616474] Kernel Offset: disabled [ 1893.619949] CPU features: 0x00046002,63000838 [ 1893.624293] Memory Limit: none [ 1893.627362] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]--- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer 2021-04-30 16:40 [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer Yi Zhang @ 2021-05-01 0:55 ` Sagi Grimberg 2021-05-01 9:58 ` Yi Zhang 0 siblings, 1 reply; 13+ messages in thread From: Sagi Grimberg @ 2021-05-01 0:55 UTC (permalink / raw) To: Yi Zhang, linux-nvme; +Cc: skt-results-master, Bruno Goncalves > Hello > Recently CKI reproduced this WARNING and NULL pointer with > linux-block/for-next on aarch64, seems it's one regression, I will try > if I can bisect the culprit. > > blktests: nvme/022 (test NVMe reset command on NVMeOF file-backed ns) > > [ 1879.759978] run blktests nvme/022 at 2021-04-30 12:30:36 > [ 1879.804283] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > [ 1879.819087] nvmet: creating controller 1 for subsystem > blktests-subsystem-1 for NQN > nqn.2014-08.org.nvmexpress:uuid:0da758a0-4d84-4133-82dd-9801235b55cd. > [ 1879.833081] nvmet: unhandled identify cns 6 on qid 0 > [ 1879.838079] nvme nvme0: creating 128 I/O queues. > [ 1879.852353] nvme nvme0: new ctrl: "blktests-subsystem-1" > [ 1880.879731] nvme nvme0: resetting controller > [ 1889.940458] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! > [ 1889.946377] nvmet: ctrl 1 fatal error occurred! > [ 1889.950928] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" It appears that we are somehow now expire the kato after/during a reset sequence and then seem to race reset and remove... bisection will help definitely. > [ 1892.810813] - > [ 1892.815427] WARNING: CPU: 30 PID: 5492 at > drivers/nvme/target/loop.c:466 nvme_loop_reset_ctrl_work+0x48/0xf0 > [nvme_loop] > [ 1892.826293] Modules linked in: nvme_loop nvme_fabrics nvme_core > nvmet loop rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert > iscsi_target_mod target_core_mod ib_iser vfat libiscsi fat > scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib > ib_uverbs i2c_smbus ib_core crct10dif_ce ghash_ce sha1_ce acpi_ipmi > ipmi_ssif ipmi_devintf ipmi_msghandler thunderx2_pmu ip_tables xfs > libcrc32c sr_mod cdrom mlx5_core ast i2c_algo_bit drm_vram_helper > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops > drm_ttm_helper ttm drm uas mlxfw sha2_ce tls sha256_arm64 usb_storage > sg psample gpio_xlp i2c_xlp9xx dm_mirror dm_region_hash dm_log dm_mod > [last unloaded: nvmet] > [ 1892.885150] CPU: 30 PID: 5492 Comm: kworker/u513:5 Not tainted 5.12.0+ #1 > [ 1892.891926] Hardware name: HPE Apollo 70 /C01_APACHE_MB > , BIOS L50_5.13_1.16 07/29/2020 > [ 1892.901654] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop] > [ 1892.908519] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) > [ 1892.914513] pc : nvme_loop_reset_ctrl_work+0x48/0xf0 [nvme_loop] > [ 1892.920508] lr : nvme_loop_reset_ctrl_work+0x40/0xf0 [nvme_loop] > [ 1892.926502] sp : fffffe0031b6fd70 > [ 1892.929803] x29: fffffe0031b6fd70 x28: 0000000000000000 > [ 1892.935105] x27: fffffc081065c0c0 x26: fffffc0807c2c26c > [ 1892.940405] x25: 0000000000000000 x24: fffffc084ae24898 > [ 1892.945705] x23: 0000000000000000 x22: fffffc09410c1d00 > [ 1892.951004] x21: fffffc084ae24890 x20: fffffc084ae244a0 > [ 1892.956305] x19: fffffc084ae24000 x18: 0000000000000012 > [ 1892.961604] x17: 0000000000000001 x16: 0000000000000019 > [ 1892.966904] x15: fffffe0011d7e7e0 x14: fffffc8ba09cedf8 > [ 1892.972204] x13: 0000000000000000 x12: 0000000000000003 > [ 1892.977504] x11: fffffc8ba09ced40 x10: fffffe0011d7e7e8 > [ 1892.982804] x9 : fffffe000ad60c58 x8 : fffffe8b650f0000 > [ 1892.988104] x7 : 0000000000000008 x6 : fffffc0000000000 > [ 1892.993403] x5 : 0000000000000000 x4 : ffffffff22f337e0 > [ 1892.998703] x3 : fffffc084ae244ac x2 : 0000000000000001 > [ 1893.004003] x1 : fffffc084ae244ac x0 : 0000000000000000 > [ 1893.009303] Call trace: > [ 1893.011737] nvme_loop_reset_ctrl_work+0x48/0xf0 [nvme_loop] > [ 1893.017384] process_one_work+0x1d0/0x438 > [ 1893.021385] worker_thread+0x1f8/0x4d8 > [ 1893.025123] kthread+0x114/0x118 > [ 1893.028341] ret_from_fork+0x10/0x18 > [ 1893.031907] ---[ end trace 883109425327ab60 ]--- > [ 1893.301843] Unable to handle kernel NULL pointer dereference at > virtual address 0000000000000008 > [ 1893.310620] Mem abort info: > [ 1893.313401] ESR = 0x96000006 > [ 1893.316442] EC = 0x25: DABT (current EL), IL = 32 bits > [ 1893.321741] SET = 0, FnV = 0 > [ 1893.324783] EA = 0, S1PTW = 0 > [ 1893.327911] Data abort info: > [ 1893.330778] ISV = 0, ISS = 0x00000006 > [ 1893.334600] CM = 0, WnR = 0 > [ 1893.337555] user pgtable: 64k pages, 42-bit VAs, pgdp=0000000a0b750000 > [ 1893.344069] [0000000000000008] pgd=0000000000000000, > p4d=0000000000000000, pud=0000000000000000, pmd=0000000000000000 > [ 1893.354669] Internal error: Oops: 96000006 [#1] SMP > [ 1893.359535] Modules linked in: nvme_loop nvme_fabrics nvme_core > nvmet loop rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_ ib_ipoib iw_cm > intf ipmi_msghandrm_kms_helper s64 usb_storage s.418386] CPU: 0 PID: > 12871 Comm: kworker/u513:0 Tainted: G W 5.12.0+ #1 > [ 1893.426551] Hardware name: HPE Apollo 70 /C01_APACHE_MB > , BIOS L50_5.13_1.16 07/29/2020 > [ 1893.436277] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] > [ 1893.442892] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO BTYPE=--) > [ 1893.448886] pc : percpu_ref_kill_and_confirm+0x15c/0x178 > [ 1893.454189] lr : nvmet_sq_destroy+0xec/0x1f0 [nvmet] > [ 1893.459149] sp : fffffe006466fc70 > [ 1893.462449] x29: fffffe006466fc70 x28: 0000000000000000 > [ 1893.467750] x27: fffffc8bcee4e5c0 x26: fffffc0807f57a6c > [ 1893.473051] x25: 0000000000000000 x24: fffffc084ae248b8 > [ 1893.478351] x23: fffffc0854e600d0 x22: fffffe000ac40600 > [ 1893.483650] x21: 0000000000000000 x20: fffffc0854e60090 > [ 1893.488950] x19: fffffe0011d7e000 x18: 0000000000000016 > [ 1893.494250] x17: 0000000000000001 x16: fffffc084e210598 > [ 1893.499550] x15: fffffe0011d7e7e0 x14: fffffc8ba09cee18 > [ 1893.504850] x13: 0000000000000000 x12: fffffe006466fc68 > [ 1893.510149] x11: 0000000000000040 x10: fffffe001158b2f8 > [ 1893.515449] x9 : fffffe000ad608ec x8 : fffffc082000cfc0 > [ 1893.520749] x7 : 0000000000000001 x6 : fffffc0000000000 > [ 1893.526049] x5 : 0000000000000080 x4 : 0000000000000000 > [ 1893.531349] x3 : 0000000000000000 x2 : fffffe0011a4e000 > [ 1893.536649] x1 : fffffe0010bdec08 x0 : fffffe0010e95000 > [ 1893.541949] Call trace: > [ 1893.544383] percpu_ref_kill_and_confirm+0x15c/0x178 > [ 1893.549335] nvmet_sq_destroy+0xec/0x1f0 [nvmet] > [ 1893.553945] nvme_loop_destroy_io_queues+0x64/0x90 [nvme_loop] > [ 1893.559767] nvme_loop_shutdown_ctrl+0x60/0xb8 [nvme_loop] > [ 1893.565240] nvme_loop_delete_ctrl_host+0x18/0x20 [nvme_loop] > [ 1893.570973] nvme_do_delete_ctrl+0x58/0x6c [nvme_core] > [ 1893.576106] nvme_delete_ctrl_work+0x18/0x38 [nvme_core] > [ 1893.581411] process_one_work+0x1d0/0x438 > [ 1893.585410] worker_thread+0x150/0x4d8 > [ 1893.589148] kthread+0x114/0x118 > [ 1893.592364] ret_from_fork+0x10/0x18 > [ 1893.595929] Code: 39244840 f0003781 d0004d40 91302021 (f9400462) > [ 1893.602139] ---[ end trace 883109425327ab61 ]--- > [ 1893.606744] Kernel panic - not syncing: Oops: Fatal exception > [ 1893.612512] SMP: stopping secondary CPUs > [ 1893.616474] Kernel Offset: disabled > [ 1893.619949] CPU features: 0x00046002,63000838 > [ 1893.624293] Memory Limit: none > [ 1893.627362] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]--- > > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer 2021-05-01 0:55 ` Sagi Grimberg @ 2021-05-01 9:58 ` Yi Zhang 2021-05-07 8:35 ` Yi Zhang 0 siblings, 1 reply; 13+ messages in thread From: Yi Zhang @ 2021-05-01 9:58 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, skt-results-master, Bruno Goncalves On Sat, May 1, 2021 at 8:55 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > Hello > > Recently CKI reproduced this WARNING and NULL pointer with > > linux-block/for-next on aarch64, seems it's one regression, I will try > > if I can bisect the culprit. > > > > blktests: nvme/022 (test NVMe reset command on NVMeOF file-backed ns) > > > > [ 1879.759978] run blktests nvme/022 at 2021-04-30 12:30:36 > > [ 1879.804283] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > > [ 1879.819087] nvmet: creating controller 1 for subsystem > > blktests-subsystem-1 for NQN > > nqn.2014-08.org.nvmexpress:uuid:0da758a0-4d84-4133-82dd-9801235b55cd. > > [ 1879.833081] nvmet: unhandled identify cns 6 on qid 0 > > [ 1879.838079] nvme nvme0: creating 128 I/O queues. > > [ 1879.852353] nvme nvme0: new ctrl: "blktests-subsystem-1" > > [ 1880.879731] nvme nvme0: resetting controller > > [ 1889.940458] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! > > [ 1889.946377] nvmet: ctrl 1 fatal error occurred! > > [ 1889.950928] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" > > It appears that we are somehow now expire the kato after/during a reset > sequence and then seem to race reset and remove... > > bisection will help definitely. > Hi Sagi Bisect shows bellow commits was the first bad commit. commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad) Author: Hannes Reinecke <hare@suse.de> Date: Fri Apr 16 13:46:20 2021 +0200 nvme: sanitize KATO setting Thanks Yi _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer 2021-05-01 9:58 ` Yi Zhang @ 2021-05-07 8:35 ` Yi Zhang 2021-05-07 19:50 ` Sagi Grimberg 0 siblings, 1 reply; 13+ messages in thread From: Yi Zhang @ 2021-05-07 8:35 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, skt-results-master, Bruno Goncalves, hare On Sat, May 1, 2021 at 5:58 PM Yi Zhang <yi.zhang@redhat.com> wrote: > > On Sat, May 1, 2021 at 8:55 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > > > > Hello > > > Recently CKI reproduced this WARNING and NULL pointer with > > > linux-block/for-next on aarch64, seems it's one regression, I will try > > > if I can bisect the culprit. > > > > > > blktests: nvme/022 (test NVMe reset command on NVMeOF file-backed ns) > > > > > > [ 1879.759978] run blktests nvme/022 at 2021-04-30 12:30:36 > > > [ 1879.804283] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > > > [ 1879.819087] nvmet: creating controller 1 for subsystem > > > blktests-subsystem-1 for NQN > > > nqn.2014-08.org.nvmexpress:uuid:0da758a0-4d84-4133-82dd-9801235b55cd. > > > [ 1879.833081] nvmet: unhandled identify cns 6 on qid 0 > > > [ 1879.838079] nvme nvme0: creating 128 I/O queues. > > > [ 1879.852353] nvme nvme0: new ctrl: "blktests-subsystem-1" > > > [ 1880.879731] nvme nvme0: resetting controller > > > [ 1889.940458] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! > > > [ 1889.946377] nvmet: ctrl 1 fatal error occurred! > > > [ 1889.950928] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" > > > > It appears that we are somehow now expire the kato after/during a reset > > sequence and then seem to race reset and remove... > > > > bisection will help definitely. > > > Hi Sagi > > Bisect shows bellow commits was the first bad commit. > > commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad) > Author: Hannes Reinecke <hare@suse.de> > Date: Fri Apr 16 13:46:20 2021 +0200 > > nvme: sanitize KATO setting > Hi Sagi This issue has been consistently reproduced with blktests on recent linux-block/for-next, do you have a chance to check it? > Thanks > Yi _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer 2021-05-07 8:35 ` Yi Zhang @ 2021-05-07 19:50 ` Sagi Grimberg 2021-05-09 8:44 ` Hannes Reinecke 0 siblings, 1 reply; 13+ messages in thread From: Sagi Grimberg @ 2021-05-07 19:50 UTC (permalink / raw) To: Yi Zhang; +Cc: linux-nvme, skt-results-master, Bruno Goncalves, hare >> Hi Sagi >> >> Bisect shows bellow commits was the first bad commit. >> >> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad) >> Author: Hannes Reinecke <hare@suse.de> >> Date: Fri Apr 16 13:46:20 2021 +0200 >> >> nvme: sanitize KATO setting >> > > Hi Sagi > This issue has been consistently reproduced with blktests on recent > linux-block/for-next, do you have a chance to check it? I had not, Hannes can you have a look? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer 2021-05-07 19:50 ` Sagi Grimberg @ 2021-05-09 8:44 ` Hannes Reinecke 2021-05-12 0:32 ` Yi Zhang 0 siblings, 1 reply; 13+ messages in thread From: Hannes Reinecke @ 2021-05-09 8:44 UTC (permalink / raw) To: Sagi Grimberg, Yi Zhang; +Cc: linux-nvme, skt-results-master, Bruno Goncalves On 5/7/21 9:50 PM, Sagi Grimberg wrote: > >>> Hi Sagi >>> >>> Bisect shows bellow commits was the first bad commit. >>> >>> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad) >>> Author: Hannes Reinecke <hare@suse.de> >>> Date: Fri Apr 16 13:46:20 2021 +0200 >>> >>> nvme: sanitize KATO setting >>> >> >> Hi Sagi >> This issue has been consistently reproduced with blktests on recent >> linux-block/for-next, do you have a chance to check it? > > I had not, Hannes can you have a look? > Can you check if this patch helps? diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index 74b3b150e1a5..a48d8219cf90 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -458,6 +458,9 @@ static void nvme_loop_reset_ctrl_work(struct work_struct *work) container_of(work, struct nvme_loop_ctrl, ctrl.reset_work); int ret; + /* reset_work might run concurrently to delete_work */ + if (ctrl->state == NVME_CTRL_DELETING) + return; nvme_stop_ctrl(&ctrl->ctrl); nvme_loop_shutdown_ctrl(ctrl); Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer 2021-05-09 8:44 ` Hannes Reinecke @ 2021-05-12 0:32 ` Yi Zhang 2021-05-19 0:36 ` Yi Zhang 0 siblings, 1 reply; 13+ messages in thread From: Yi Zhang @ 2021-05-12 0:32 UTC (permalink / raw) To: Hannes Reinecke Cc: Sagi Grimberg, linux-nvme, skt-results-master, Bruno Goncalves On Sun, May 9, 2021 at 4:44 PM Hannes Reinecke <hare@suse.de> wrote: > > On 5/7/21 9:50 PM, Sagi Grimberg wrote: > > > >>> Hi Sagi > >>> > >>> Bisect shows bellow commits was the first bad commit. > >>> > >>> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad) > >>> Author: Hannes Reinecke <hare@suse.de> > >>> Date: Fri Apr 16 13:46:20 2021 +0200 > >>> > >>> nvme: sanitize KATO setting > >>> > >> > >> Hi Sagi > >> This issue has been consistently reproduced with blktests on recent > >> linux-block/for-next, do you have a chance to check it? > > > > I had not, Hannes can you have a look? > > > Can you check if this patch helps? > > diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c > index 74b3b150e1a5..a48d8219cf90 100644 > --- a/drivers/nvme/target/loop.c > +++ b/drivers/nvme/target/loop.c > @@ -458,6 +458,9 @@ static void nvme_loop_reset_ctrl_work(struct > work_struct *work) > container_of(work, struct nvme_loop_ctrl, ctrl.reset_work); > int ret; > > + /* reset_work might run concurrently to delete_work */ > + if (ctrl->state == NVME_CTRL_DELETING) > + return; > nvme_stop_ctrl(&ctrl->ctrl); > nvme_loop_shutdown_ctrl(ctrl); > Hi Hannes The issue still can be reproduced with this patch, here is the ful log: <12>[ 213.737602] run blktests nvme/022 at 2021-05-11 20:29:23 <6>[ 213.857826] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 <6>[ 213.879592] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:7bc1767264664586a5ebf3ec1cdca2b9. <6>[ 213.893281] nvme nvme0: creating 128 I/O queues. <6>[ 213.910709] nvme nvme0: new ctrl: "blktests-subsystem-1" <4>[ 214.943818] nvme nvme0: resetting controller <3>[ 224.004540] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! <3>[ 224.010452] nvmet: ctrl 1 fatal error occurred! <6>[ 224.014997] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" <4>[ 227.734741] ------------[ cut here ]------------ <4>[ 227.739348] WARNING: CPU: 110 PID: 1648 at drivers/nvme/target/loop.c:469 nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop] <4>[ 227.750386] Modules linked in: nvme_loop nvme_fabrics nvmet nvme_core rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm rfkill mlx5_ib ib_uverbs ib_core sunrpc coresight_etm4x i2c_smbus coresight_tmc coresight_replicator coresight_tpiu mlx5_core joydev acpi_ipmi psample ipmi_ssif mlxfw ipmi_devintf ipmi_msghandler coresight_funnel coresight thunderx2_pmu vfat fat fuse zram ip_tables xfs ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm_ttm_helper ttm drm crct10dif_ce ghash_ce gpio_xlp i2c_xlp9xx uas usb_storage aes_neon_bs <4>[ 227.809227] CPU: 110 PID: 1648 Comm: kworker/u513:2 Not tainted 5.13.0-rc1.fix+ #1 <4>[ 227.816784] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.16 07/29/2020 <4>[ 227.826511] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop] <4>[ 227.833376] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) <4>[ 227.839370] pc : nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop] <4>[ 227.845451] lr : nvme_loop_reset_ctrl_work+0x54/0x110 [nvme_loop] <4>[ 227.851532] sp : ffff800020c43d80 <4>[ 227.854833] x29: ffff800020c43d80 x28: 0000000000000000 x27: ffff00080d1b73c0 <4>[ 227.861959] x26: ffff00080d1fb76c x25: 0000000000000000 x24: ffff0008126be8c0 <4>[ 227.869082] x23: 0000000000000000 x22: ffff000826eafb00 x21: ffff0008126be4b0 <4>[ 227.876205] x20: ffff0008126be000 x19: ffff0008126be8b8 x18: 0000000000000020 <4>[ 227.883328] x17: 00000000f911e638 x16: 0000000032ecb495 x15: 000000000000000f <4>[ 227.890452] x14: 0000000000000003 x13: 0000000000000000 x12: ffff008b822d32d0 <4>[ 227.897575] x11: ffff008b822d3230 x10: 0000000000000001 x9 : ffff800009058e74 <4>[ 227.904699] x8 : ffff008b7cfeaab8 x7 : fffffffffffc0000 x6 : ffff000000000000 <4>[ 227.911822] x5 : 000000008040002a x4 : 0000000000000000 x3 : ffff0008126be4bc <4>[ 227.918946] x2 : 0000000000000001 x1 : ffff0008126be4bc x0 : 0000000000000000 <4>[ 227.926069] Call trace: <4>[ 227.928503] nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop] <4>[ 227.934237] process_one_work+0x1f0/0x4ac <4>[ 227.938239] worker_thread+0x22c/0x40c <4>[ 227.941976] kthread+0x11c/0x120 <4>[ 227.945193] ret_from_fork+0x10/0x18 <4>[ 227.948759] ---[ end trace 57e2d1ace7d39024 ]--- <1>[ 228.355440] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 <1>[ 228.364215] Mem abort info: <1>[ 228.366996] ESR = 0x96000004 <1>[ 228.370037] EC = 0x25: DABT (current EL), IL = 32 bits <1>[ 228.375336] SET = 0, FnV = 0 <1>[ 228.378377] EA = 0, S1PTW = 0 <1>[ 228.381505] Data abort info: <1>[ 228.384372] ISV = 0, ISS = 0x00000004 <1>[ 228.388194] CM = 0, WnR = 0 <1>[ 228.391149] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000897413000 <1>[ 228.397576] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000 <0>[ 228.404354] Internal error: Oops: 96000004 [#1] SMP <4>[ 228.409220] Modules linked in: nvme_loop nvme_fabrics nvmet nvme_core rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm rfkill mlx5_ib ib_uverbs ib_core sunrpc coresight_etm4x i2c_smbus coresight_tmc coresight_replicator coresight_tpiu mlx5_core joydev acpi_ipmi psample ipmi_ssif mlxfw ipmi_devintf ipmi_msghandler coresight_funnel coresight thunderx2_pmu vfat fat fuse zram ip_tables xfs ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm_ttm_helper ttm drm crct10dif_ce ghash_ce gpio_xlp i2c_xlp9xx uas usb_storage aes_neon_bs <4>[ 228.468051] CPU: 5 PID: 2015 Comm: kworker/u513:7 Tainted: G W 5.13.0-rc1.fix+ #1 <4>[ 228.476822] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.16 07/29/2020 <4>[ 228.486548] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] <4>[ 228.493166] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO BTYPE=--) <4>[ 228.499160] pc : percpu_ref_kill_and_confirm+0x94/0xb0 <4>[ 228.504288] lr : percpu_ref_kill_and_confirm+0x20/0xb0 <4>[ 228.509414] sp : ffff800020f63c90 <4>[ 228.512715] x29: ffff800020f63c90 x28: 0000000000000000 x27: ffff0008214969c0 <4>[ 228.519840] x26: ffff000809dc8b6c x25: 0000000000000000 x24: ffff00084f1d00d0 <4>[ 228.526963] x23: ffff00084f1d00b0 x22: ffff0008126be130 x21: ffff80000939a590 <4>[ 228.534087] x20: 0000000000000000 x19: ffff00084f1d0090 x18: 0000000000000020 <4>[ 228.541210] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000013 <4>[ 228.548334] x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000040 <4>[ 228.555457] x11: ffff000800401918 x10: ffff00080040191a x9 : ffff8000090589ac <4>[ 228.562580] x8 : ffff008b7cfeaab8 x7 : 0000000000000018 x6 : ffff000000000000 <4>[ 228.569703] x5 : 0000000000000000 x4 : ffff008b85321400 x3 : 0000000000000001 <4>[ 228.576827] x2 : 0000000000000000 x1 : ffff800011035d20 x0 : ffff80001147f000 <4>[ 228.583952] Call trace: <4>[ 228.586385] percpu_ref_kill_and_confirm+0x94/0xb0 <4>[ 228.591163] nvmet_sq_destroy+0xf0/0x194 [nvmet] <4>[ 228.595777] nvme_loop_destroy_io_queues+0x6c/0xa0 [nvme_loop] <4>[ 228.601598] nvme_loop_shutdown_ctrl+0x64/0xc0 [nvme_loop] <4>[ 228.607071] nvme_loop_delete_ctrl_host+0x20/0x30 [nvme_loop] <4>[ 228.612805] nvme_do_delete_ctrl+0x5c/0x74 [nvme_core] <4>[ 228.617939] nvme_delete_ctrl_work+0x20/0x44 [nvme_core] <4>[ 228.623246] process_one_work+0x1f0/0x4ac <4>[ 228.627244] worker_thread+0x184/0x40c <4>[ 228.630981] kthread+0x11c/0x120 <4>[ 228.634197] ret_from_fork+0x10/0x18 <0>[ 228.637763] Code: b0004841 3905dc03 91348021 f0006a80 (f9400442) <4>[ 228.643843] ---[ end trace 57e2d1ace7d39025 ]--- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer 2021-05-12 0:32 ` Yi Zhang @ 2021-05-19 0:36 ` Yi Zhang 2021-05-20 6:19 ` Hannes Reinecke 0 siblings, 1 reply; 13+ messages in thread From: Yi Zhang @ 2021-05-19 0:36 UTC (permalink / raw) To: Hannes Reinecke Cc: Sagi Grimberg, linux-nvme, skt-results-master, Bruno Goncalves Hi Hannes Gentle ping here. On Wed, May 12, 2021 at 8:32 AM Yi Zhang <yi.zhang@redhat.com> wrote: > > On Sun, May 9, 2021 at 4:44 PM Hannes Reinecke <hare@suse.de> wrote: > > > > On 5/7/21 9:50 PM, Sagi Grimberg wrote: > > > > > >>> Hi Sagi > > >>> > > >>> Bisect shows bellow commits was the first bad commit. > > >>> > > >>> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad) > > >>> Author: Hannes Reinecke <hare@suse.de> > > >>> Date: Fri Apr 16 13:46:20 2021 +0200 > > >>> > > >>> nvme: sanitize KATO setting > > >>> > > >> > > >> Hi Sagi > > >> This issue has been consistently reproduced with blktests on recent > > >> linux-block/for-next, do you have a chance to check it? > > > > > > I had not, Hannes can you have a look? > > > > > Can you check if this patch helps? > > > > diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c > > index 74b3b150e1a5..a48d8219cf90 100644 > > --- a/drivers/nvme/target/loop.c > > +++ b/drivers/nvme/target/loop.c > > @@ -458,6 +458,9 @@ static void nvme_loop_reset_ctrl_work(struct > > work_struct *work) > > container_of(work, struct nvme_loop_ctrl, ctrl.reset_work); > > int ret; > > > > + /* reset_work might run concurrently to delete_work */ > > + if (ctrl->state == NVME_CTRL_DELETING) > > + return; > > nvme_stop_ctrl(&ctrl->ctrl); > > nvme_loop_shutdown_ctrl(ctrl); > > > > Hi Hannes > The issue still can be reproduced with this patch, here is the full log: > > <12>[ 213.737602] run blktests nvme/022 at 2021-05-11 20:29:23 > <6>[ 213.857826] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > <6>[ 213.879592] nvmet: creating controller 1 for subsystem > blktests-subsystem-1 for NQN > nqn.2014-08.org.nvmexpress:uuid:7bc1767264664586a5ebf3ec1cdca2b9. > <6>[ 213.893281] nvme nvme0: creating 128 I/O queues. > <6>[ 213.910709] nvme nvme0: new ctrl: "blktests-subsystem-1" > <4>[ 214.943818] nvme nvme0: resetting controller > <3>[ 224.004540] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! > <3>[ 224.010452] nvmet: ctrl 1 fatal error occurred! > <6>[ 224.014997] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" > <4>[ 227.734741] ------------[ cut here ]------------ > <4>[ 227.739348] WARNING: CPU: 110 PID: 1648 at > drivers/nvme/target/loop.c:469 nvme_loop_reset_ctrl_work+0x5c/0x110 > [nvme_loop] > <4>[ 227.750386] Modules linked in: nvme_loop nvme_fabrics nvmet > nvme_core rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod > target_core_mod ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib > rdma_cm iw_cm ib_cm rfkill mlx5_ib ib_uverbs ib_core sunrpc > coresight_etm4x i2c_smbus coresight_tmc coresight_replicator > coresight_tpiu mlx5_core joydev acpi_ipmi psample ipmi_ssif mlxfw > ipmi_devintf ipmi_msghandler coresight_funnel coresight thunderx2_pmu > vfat fat fuse zram ip_tables xfs ast i2c_algo_bit drm_vram_helper > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec > drm_ttm_helper ttm drm crct10dif_ce ghash_ce gpio_xlp i2c_xlp9xx uas > usb_storage aes_neon_bs > <4>[ 227.809227] CPU: 110 PID: 1648 Comm: kworker/u513:2 Not tainted > 5.13.0-rc1.fix+ #1 > <4>[ 227.816784] Hardware name: HPE Apollo 70 > /C01_APACHE_MB , BIOS L50_5.13_1.16 07/29/2020 > <4>[ 227.826511] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop] > <4>[ 227.833376] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) > <4>[ 227.839370] pc : nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop] > <4>[ 227.845451] lr : nvme_loop_reset_ctrl_work+0x54/0x110 [nvme_loop] > <4>[ 227.851532] sp : ffff800020c43d80 > <4>[ 227.854833] x29: ffff800020c43d80 x28: 0000000000000000 x27: > ffff00080d1b73c0 > <4>[ 227.861959] x26: ffff00080d1fb76c x25: 0000000000000000 x24: > ffff0008126be8c0 > <4>[ 227.869082] x23: 0000000000000000 x22: ffff000826eafb00 x21: > ffff0008126be4b0 > <4>[ 227.876205] x20: ffff0008126be000 x19: ffff0008126be8b8 x18: > 0000000000000020 > <4>[ 227.883328] x17: 00000000f911e638 x16: 0000000032ecb495 x15: > 000000000000000f > <4>[ 227.890452] x14: 0000000000000003 x13: 0000000000000000 x12: > ffff008b822d32d0 > <4>[ 227.897575] x11: ffff008b822d3230 x10: 0000000000000001 x9 : > ffff800009058e74 > <4>[ 227.904699] x8 : ffff008b7cfeaab8 x7 : fffffffffffc0000 x6 : > ffff000000000000 > <4>[ 227.911822] x5 : 000000008040002a x4 : 0000000000000000 x3 : > ffff0008126be4bc > <4>[ 227.918946] x2 : 0000000000000001 x1 : ffff0008126be4bc x0 : > 0000000000000000 > <4>[ 227.926069] Call trace: > <4>[ 227.928503] nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop] > <4>[ 227.934237] process_one_work+0x1f0/0x4ac > <4>[ 227.938239] worker_thread+0x22c/0x40c > <4>[ 227.941976] kthread+0x11c/0x120 > <4>[ 227.945193] ret_from_fork+0x10/0x18 > <4>[ 227.948759] ---[ end trace 57e2d1ace7d39024 ]--- > <1>[ 228.355440] Unable to handle kernel NULL pointer dereference at > virtual address 0000000000000008 > <1>[ 228.364215] Mem abort info: > <1>[ 228.366996] ESR = 0x96000004 > <1>[ 228.370037] EC = 0x25: DABT (current EL), IL = 32 bits > <1>[ 228.375336] SET = 0, FnV = 0 > <1>[ 228.378377] EA = 0, S1PTW = 0 > <1>[ 228.381505] Data abort info: > <1>[ 228.384372] ISV = 0, ISS = 0x00000004 > <1>[ 228.388194] CM = 0, WnR = 0 > <1>[ 228.391149] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000897413000 > <1>[ 228.397576] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000 > <0>[ 228.404354] Internal error: Oops: 96000004 [#1] SMP > <4>[ 228.409220] Modules linked in: nvme_loop nvme_fabrics nvmet > nvme_core rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod > target_core_mod ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib > rdma_cm iw_cm ib_cm rfkill mlx5_ib ib_uverbs ib_core sunrpc > coresight_etm4x i2c_smbus coresight_tmc coresight_replicator > coresight_tpiu mlx5_core joydev acpi_ipmi psample ipmi_ssif mlxfw > ipmi_devintf ipmi_msghandler coresight_funnel coresight thunderx2_pmu > vfat fat fuse zram ip_tables xfs ast i2c_algo_bit drm_vram_helper > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec > drm_ttm_helper ttm drm crct10dif_ce ghash_ce gpio_xlp i2c_xlp9xx uas > usb_storage aes_neon_bs > <4>[ 228.468051] CPU: 5 PID: 2015 Comm: kworker/u513:7 Tainted: G > W 5.13.0-rc1.fix+ #1 > <4>[ 228.476822] Hardware name: HPE Apollo 70 > /C01_APACHE_MB , BIOS L50_5.13_1.16 07/29/2020 > <4>[ 228.486548] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] > <4>[ 228.493166] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO BTYPE=--) > <4>[ 228.499160] pc : percpu_ref_kill_and_confirm+0x94/0xb0 > <4>[ 228.504288] lr : percpu_ref_kill_and_confirm+0x20/0xb0 > <4>[ 228.509414] sp : ffff800020f63c90 > <4>[ 228.512715] x29: ffff800020f63c90 x28: 0000000000000000 x27: > ffff0008214969c0 > <4>[ 228.519840] x26: ffff000809dc8b6c x25: 0000000000000000 x24: > ffff00084f1d00d0 > <4>[ 228.526963] x23: ffff00084f1d00b0 x22: ffff0008126be130 x21: > ffff80000939a590 > <4>[ 228.534087] x20: 0000000000000000 x19: ffff00084f1d0090 x18: > 0000000000000020 > <4>[ 228.541210] x17: 0000000000000000 x16: 0000000000000000 x15: > 0000000000000013 > <4>[ 228.548334] x14: 0000000000000003 x13: 0000000000000000 x12: > 0000000000000040 > <4>[ 228.555457] x11: ffff000800401918 x10: ffff00080040191a x9 : > ffff8000090589ac > <4>[ 228.562580] x8 : ffff008b7cfeaab8 x7 : 0000000000000018 x6 : > ffff000000000000 > <4>[ 228.569703] x5 : 0000000000000000 x4 : ffff008b85321400 x3 : > 0000000000000001 > <4>[ 228.576827] x2 : 0000000000000000 x1 : ffff800011035d20 x0 : > ffff80001147f000 > <4>[ 228.583952] Call trace: > <4>[ 228.586385] percpu_ref_kill_and_confirm+0x94/0xb0 > <4>[ 228.591163] nvmet_sq_destroy+0xf0/0x194 [nvmet] > <4>[ 228.595777] nvme_loop_destroy_io_queues+0x6c/0xa0 [nvme_loop] > <4>[ 228.601598] nvme_loop_shutdown_ctrl+0x64/0xc0 [nvme_loop] > <4>[ 228.607071] nvme_loop_delete_ctrl_host+0x20/0x30 [nvme_loop] > <4>[ 228.612805] nvme_do_delete_ctrl+0x5c/0x74 [nvme_core] > <4>[ 228.617939] nvme_delete_ctrl_work+0x20/0x44 [nvme_core] > <4>[ 228.623246] process_one_work+0x1f0/0x4ac > <4>[ 228.627244] worker_thread+0x184/0x40c > <4>[ 228.630981] kthread+0x11c/0x120 > <4>[ 228.634197] ret_from_fork+0x10/0x18 > <0>[ 228.637763] Code: b0004841 3905dc03 91348021 f0006a80 (f9400442) > <4>[ 228.643843] ---[ end trace 57e2d1ace7d39025 ]--- -- Best Regards, Yi Zhang _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer 2021-05-19 0:36 ` Yi Zhang @ 2021-05-20 6:19 ` Hannes Reinecke 2021-05-21 0:38 ` Yi Zhang 0 siblings, 1 reply; 13+ messages in thread From: Hannes Reinecke @ 2021-05-20 6:19 UTC (permalink / raw) To: Yi Zhang; +Cc: Sagi Grimberg, linux-nvme, skt-results-master, Bruno Goncalves On 5/19/21 2:36 AM, Yi Zhang wrote: > Hi Hannes > > Gentle ping here. > > On Wed, May 12, 2021 at 8:32 AM Yi Zhang <yi.zhang@redhat.com> wrote: >> >> On Sun, May 9, 2021 at 4:44 PM Hannes Reinecke <hare@suse.de> wrote: >>> >>> On 5/7/21 9:50 PM, Sagi Grimberg wrote: >>>> >>>>>> Hi Sagi >>>>>> >>>>>> Bisect shows bellow commits was the first bad commit. >>>>>> >>>>>> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad) >>>>>> Author: Hannes Reinecke <hare@suse.de> >>>>>> Date: Fri Apr 16 13:46:20 2021 +0200 >>>>>> >>>>>> nvme: sanitize KATO setting >>>>>> >>>>> >>>>> Hi Sagi >>>>> This issue has been consistently reproduced with blktests on recent >>>>> linux-block/for-next, do you have a chance to check it? >>>> >>>> I had not, Hannes can you have a look? >>>> >>> Can you check if this patch helps? >>> >>> diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c >>> index 74b3b150e1a5..a48d8219cf90 100644 >>> --- a/drivers/nvme/target/loop.c >>> +++ b/drivers/nvme/target/loop.c >>> @@ -458,6 +458,9 @@ static void nvme_loop_reset_ctrl_work(struct >>> work_struct *work) >>> container_of(work, struct nvme_loop_ctrl, ctrl.reset_work); >>> int ret; >>> >>> + /* reset_work might run concurrently to delete_work */ >>> + if (ctrl->state == NVME_CTRL_DELETING) >>> + return; >>> nvme_stop_ctrl(&ctrl->ctrl); >>> nvme_loop_shutdown_ctrl(ctrl); >>> >> >> Hi Hannes >> The issue still can be reproduced with this patch, here is the full log: >> >> <12>[ 213.737602] run blktests nvme/022 at 2021-05-11 20:29:23 >> <6>[ 213.857826] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >> <6>[ 213.879592] nvmet: creating controller 1 for subsystem >> blktests-subsystem-1 for NQN >> nqn.2014-08.org.nvmexpress:uuid:7bc1767264664586a5ebf3ec1cdca2b9. >> <6>[ 213.893281] nvme nvme0: creating 128 I/O queues. >> <6>[ 213.910709] nvme nvme0: new ctrl: "blktests-subsystem-1" >> <4>[ 214.943818] nvme nvme0: resetting controller >> <3>[ 224.004540] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! >> <3>[ 224.010452] nvmet: ctrl 1 fatal error occurred! >> <6>[ 224.014997] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" >> <4>[ 227.734741] ------------[ cut here ]------------ >> <4>[ 227.739348] WARNING: CPU: 110 PID: 1648 at >> drivers/nvme/target/loop.c:469 nvme_loop_reset_ctrl_work+0x5c/0x110 >> [nvme_loop] >> <4>[ 227.750386] Modules linked in: nvme_loop nvme_fabrics nvmet >> nvme_core rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod >> target_core_mod ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib >> rdma_cm iw_cm ib_cm rfkill mlx5_ib ib_uverbs ib_core sunrpc >> coresight_etm4x i2c_smbus coresight_tmc coresight_replicator >> coresight_tpiu mlx5_core joydev acpi_ipmi psample ipmi_ssif mlxfw >> ipmi_devintf ipmi_msghandler coresight_funnel coresight thunderx2_pmu >> vfat fat fuse zram ip_tables xfs ast i2c_algo_bit drm_vram_helper >> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec >> drm_ttm_helper ttm drm crct10dif_ce ghash_ce gpio_xlp i2c_xlp9xx uas >> usb_storage aes_neon_bs >> <4>[ 227.809227] CPU: 110 PID: 1648 Comm: kworker/u513:2 Not tainted >> 5.13.0-rc1.fix+ #1 >> <4>[ 227.816784] Hardware name: HPE Apollo 70 >> /C01_APACHE_MB , BIOS L50_5.13_1.16 07/29/2020 >> <4>[ 227.826511] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop] >> <4>[ 227.833376] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) >> <4>[ 227.839370] pc : nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop] >> <4>[ 227.845451] lr : nvme_loop_reset_ctrl_work+0x54/0x110 [nvme_loop] >> <4>[ 227.851532] sp : ffff800020c43d80 >> <4>[ 227.854833] x29: ffff800020c43d80 x28: 0000000000000000 x27: >> ffff00080d1b73c0 >> <4>[ 227.861959] x26: ffff00080d1fb76c x25: 0000000000000000 x24: >> ffff0008126be8c0 >> <4>[ 227.869082] x23: 0000000000000000 x22: ffff000826eafb00 x21: >> ffff0008126be4b0 >> <4>[ 227.876205] x20: ffff0008126be000 x19: ffff0008126be8b8 x18: >> 0000000000000020 >> <4>[ 227.883328] x17: 00000000f911e638 x16: 0000000032ecb495 x15: >> 000000000000000f >> <4>[ 227.890452] x14: 0000000000000003 x13: 0000000000000000 x12: >> ffff008b822d32d0 >> <4>[ 227.897575] x11: ffff008b822d3230 x10: 0000000000000001 x9 : >> ffff800009058e74 >> <4>[ 227.904699] x8 : ffff008b7cfeaab8 x7 : fffffffffffc0000 x6 : >> ffff000000000000 >> <4>[ 227.911822] x5 : 000000008040002a x4 : 0000000000000000 x3 : >> ffff0008126be4bc >> <4>[ 227.918946] x2 : 0000000000000001 x1 : ffff0008126be4bc x0 : >> 0000000000000000 >> <4>[ 227.926069] Call trace: >> <4>[ 227.928503] nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop] >> <4>[ 227.934237] process_one_work+0x1f0/0x4ac >> <4>[ 227.938239] worker_thread+0x22c/0x40c >> <4>[ 227.941976] kthread+0x11c/0x120 >> <4>[ 227.945193] ret_from_fork+0x10/0x18 >> <4>[ 227.948759] ---[ end trace 57e2d1ace7d39024 ]--- >> <1>[ 228.355440] Unable to handle kernel NULL pointer dereference at >> virtual address 0000000000000008 >> <1>[ 228.364215] Mem abort info: >> <1>[ 228.366996] ESR = 0x96000004 >> <1>[ 228.370037] EC = 0x25: DABT (current EL), IL = 32 bits >> <1>[ 228.375336] SET = 0, FnV = 0 >> <1>[ 228.378377] EA = 0, S1PTW = 0 >> <1>[ 228.381505] Data abort info: >> <1>[ 228.384372] ISV = 0, ISS = 0x00000004 >> <1>[ 228.388194] CM = 0, WnR = 0 >> <1>[ 228.391149] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000897413000 >> <1>[ 228.397576] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000 >> <0>[ 228.404354] Internal error: Oops: 96000004 [#1] SMP >> <4>[ 228.409220] Modules linked in: nvme_loop nvme_fabrics nvmet >> nvme_core rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod >> target_core_mod ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib >> rdma_cm iw_cm ib_cm rfkill mlx5_ib ib_uverbs ib_core sunrpc >> coresight_etm4x i2c_smbus coresight_tmc coresight_replicator >> coresight_tpiu mlx5_core joydev acpi_ipmi psample ipmi_ssif mlxfw >> ipmi_devintf ipmi_msghandler coresight_funnel coresight thunderx2_pmu >> vfat fat fuse zram ip_tables xfs ast i2c_algo_bit drm_vram_helper >> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec >> drm_ttm_helper ttm drm crct10dif_ce ghash_ce gpio_xlp i2c_xlp9xx uas >> usb_storage aes_neon_bs >> <4>[ 228.468051] CPU: 5 PID: 2015 Comm: kworker/u513:7 Tainted: G >> W 5.13.0-rc1.fix+ #1 >> <4>[ 228.476822] Hardware name: HPE Apollo 70 >> /C01_APACHE_MB , BIOS L50_5.13_1.16 07/29/2020 >> <4>[ 228.486548] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] >> <4>[ 228.493166] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO BTYPE=--) >> <4>[ 228.499160] pc : percpu_ref_kill_and_confirm+0x94/0xb0 >> <4>[ 228.504288] lr : percpu_ref_kill_and_confirm+0x20/0xb0 >> <4>[ 228.509414] sp : ffff800020f63c90 >> <4>[ 228.512715] x29: ffff800020f63c90 x28: 0000000000000000 x27: >> ffff0008214969c0 >> <4>[ 228.519840] x26: ffff000809dc8b6c x25: 0000000000000000 x24: >> ffff00084f1d00d0 >> <4>[ 228.526963] x23: ffff00084f1d00b0 x22: ffff0008126be130 x21: >> ffff80000939a590 >> <4>[ 228.534087] x20: 0000000000000000 x19: ffff00084f1d0090 x18: >> 0000000000000020 >> <4>[ 228.541210] x17: 0000000000000000 x16: 0000000000000000 x15: >> 0000000000000013 >> <4>[ 228.548334] x14: 0000000000000003 x13: 0000000000000000 x12: >> 0000000000000040 >> <4>[ 228.555457] x11: ffff000800401918 x10: ffff00080040191a x9 : >> ffff8000090589ac >> <4>[ 228.562580] x8 : ffff008b7cfeaab8 x7 : 0000000000000018 x6 : >> ffff000000000000 >> <4>[ 228.569703] x5 : 0000000000000000 x4 : ffff008b85321400 x3 : >> 0000000000000001 >> <4>[ 228.576827] x2 : 0000000000000000 x1 : ffff800011035d20 x0 : >> ffff80001147f000 >> <4>[ 228.583952] Call trace: >> <4>[ 228.586385] percpu_ref_kill_and_confirm+0x94/0xb0 >> <4>[ 228.591163] nvmet_sq_destroy+0xf0/0x194 [nvmet] >> <4>[ 228.595777] nvme_loop_destroy_io_queues+0x6c/0xa0 [nvme_loop] >> <4>[ 228.601598] nvme_loop_shutdown_ctrl+0x64/0xc0 [nvme_loop] >> <4>[ 228.607071] nvme_loop_delete_ctrl_host+0x20/0x30 [nvme_loop] >> <4>[ 228.612805] nvme_do_delete_ctrl+0x5c/0x74 [nvme_core] >> <4>[ 228.617939] nvme_delete_ctrl_work+0x20/0x44 [nvme_core] >> <4>[ 228.623246] process_one_work+0x1f0/0x4ac >> <4>[ 228.627244] worker_thread+0x184/0x40c >> <4>[ 228.630981] kthread+0x11c/0x120 >> <4>[ 228.634197] ret_from_fork+0x10/0x18 >> <0>[ 228.637763] Code: b0004841 3905dc03 91348021 f0006a80 (f9400442) >> <4>[ 228.643843] ---[ end trace 57e2d1ace7d39025 ]--- > > > What about this? diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index 74b3b150e1a5..9838a7d27bc1 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -263,7 +263,8 @@ static const struct blk_mq_ops nvme_loop_admin_mq_ops = { static void nvme_loop_destroy_admin_queue(struct nvme_loop_ctrl *ctrl) { - clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags); + if (!test_and_clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags)) + return; nvmet_sq_destroy(&ctrl->queues[0].nvme_sq); blk_cleanup_queue(ctrl->ctrl.admin_q); blk_cleanup_queue(ctrl->ctrl.fabrics_q); @@ -299,6 +300,7 @@ static void nvme_loop_destroy_io_queues(struct nvme_loop_ctrl *ctrl) clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[i].flags); nvmet_sq_destroy(&ctrl->queues[i].nvme_sq); } + ctrl->ctrl.queue_count = 1; } static int nvme_loop_init_io_queues(struct nvme_loop_ctrl *ctrl) @@ -405,6 +407,7 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) return 0; out_cleanup_queue: + clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags); blk_cleanup_queue(ctrl->ctrl.admin_q); out_cleanup_fabrics_q: blk_cleanup_queue(ctrl->ctrl.fabrics_q); @@ -462,8 +465,10 @@ static void nvme_loop_reset_ctrl_work(struct work_struct *work) nvme_loop_shutdown_ctrl(ctrl); if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) { - /* state change failure should never happen */ - WARN_ON_ONCE(1); + if (ctrl->ctrl.state != NVME_CTRL_DELETING && + ctrl->ctrl.state != NVME_CTRL_DELETING_NOIO) + /* state change failure for non-deleted ctrl? */ + WARN_ON_ONCE(1); return; } (I'll be sending out a formal patchset once it's confirmed). Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer 2021-05-20 6:19 ` Hannes Reinecke @ 2021-05-21 0:38 ` Yi Zhang 2021-05-21 18:19 ` Sagi Grimberg 0 siblings, 1 reply; 13+ messages in thread From: Yi Zhang @ 2021-05-21 0:38 UTC (permalink / raw) To: Hannes Reinecke Cc: Sagi Grimberg, linux-nvme, skt-results-master, Bruno Goncalves > > > > What about this? Hi Hannes With this patch, no WARNNING/NULL pointer this time, but still have 'keep-alive timer expired' and reset failure issue, here is the full log: # ./check nvme/022 nvme/022 (test NVMe reset command on NVMeOF file-backed ns) [failed] runtime 10.646s ... 11.087s --- tests/nvme/022.out 2021-05-20 20:16:31.384068807 -0400 +++ /root/blktests/results/nodev/nvme/022.out.bad 2021-05-20 20:24:27.874250466 -0400 @@ -1,4 +1,5 @@ Running nvme/022 91fdba0d-f87b-4c25-b80f-db7be1418b9e uuid.91fdba0d-f87b-4c25-b80f-db7be1418b9e +ERROR: reset failed Test complete # cat results/nodev/nvme/022.full Reset: Network dropped connection on reset NQN:blktests-subsystem-1 disconnected 1 controller(s) [37353.068448] run blktests nvme/022 at 2021-05-20 20:24:16 [37353.146301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [37353.161765] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904. [37353.175796] nvme nvme0: creating 128 I/O queues. [37353.189734] nvme nvme0: new ctrl: "blktests-subsystem-1" [37354.216686] nvme nvme0: resetting controller [37363.270607] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! [37363.276521] nvmet: ctrl 1 fatal error occurred! [37363.281058] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" # ./check nvme/021 nvme/021 (test NVMe list command on NVMeOF file-backed ns) [passed] runtime 10.958s ... 11.382s # dmesg [38142.862881] run blktests nvme/021 at 2021-05-20 20:37:26 [38142.941038] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [38142.956621] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904. [38142.970524] nvme nvme0: creating 128 I/O queues. [38142.984356] nvme nvme0: new ctrl: "blktests-subsystem-1" [38144.014601] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" [38153.030107] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! [38153.036018] nvmet: ctrl 1 fatal error occurred! > > diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c > index 74b3b150e1a5..9838a7d27bc1 100644 > --- a/drivers/nvme/target/loop.c > +++ b/drivers/nvme/target/loop.c > @@ -263,7 +263,8 @@ static const struct blk_mq_ops > nvme_loop_admin_mq_ops = { > > static void nvme_loop_destroy_admin_queue(struct nvme_loop_ctrl *ctrl) > { > - clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags); > + if (!test_and_clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags)) > + return; > nvmet_sq_destroy(&ctrl->queues[0].nvme_sq); > blk_cleanup_queue(ctrl->ctrl.admin_q); > blk_cleanup_queue(ctrl->ctrl.fabrics_q); > @@ -299,6 +300,7 @@ static void nvme_loop_destroy_io_queues(struct > nvme_loop_ctrl *ctrl) > clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[i].flags); > nvmet_sq_destroy(&ctrl->queues[i].nvme_sq); > } > + ctrl->ctrl.queue_count = 1; > } > > static int nvme_loop_init_io_queues(struct nvme_loop_ctrl *ctrl) > @@ -405,6 +407,7 @@ static int nvme_loop_configure_admin_queue(struct > nvme_loop_ctrl *ctrl) > return 0; > > out_cleanup_queue: > + clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags); > blk_cleanup_queue(ctrl->ctrl.admin_q); > out_cleanup_fabrics_q: > blk_cleanup_queue(ctrl->ctrl.fabrics_q); > @@ -462,8 +465,10 @@ static void nvme_loop_reset_ctrl_work(struct > work_struct *work) > nvme_loop_shutdown_ctrl(ctrl); > > if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) { > - /* state change failure should never happen */ > - WARN_ON_ONCE(1); > + if (ctrl->ctrl.state != NVME_CTRL_DELETING && > + ctrl->ctrl.state != NVME_CTRL_DELETING_NOIO) > + /* state change failure for non-deleted ctrl? */ > + WARN_ON_ONCE(1); > return; > } > > > (I'll be sending out a formal patchset once it's confirmed). > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg > HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer > -- Best Regards, Yi Zhang _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer 2021-05-21 0:38 ` Yi Zhang @ 2021-05-21 18:19 ` Sagi Grimberg 2021-05-22 0:12 ` Yi Zhang 0 siblings, 1 reply; 13+ messages in thread From: Sagi Grimberg @ 2021-05-21 18:19 UTC (permalink / raw) To: Yi Zhang, Hannes Reinecke; +Cc: linux-nvme, skt-results-master, Bruno Goncalves >>> What about this? > > Hi Hannes > With this patch, no WARNNING/NULL pointer this time, but still have > 'keep-alive timer expired' and reset failure issue, here is the full > log: > > # ./check nvme/022 > nvme/022 (test NVMe reset command on NVMeOF file-backed ns) [failed] > runtime 10.646s ... 11.087s > --- tests/nvme/022.out 2021-05-20 20:16:31.384068807 -0400 > +++ /root/blktests/results/nodev/nvme/022.out.bad 2021-05-20 > 20:24:27.874250466 -0400 > @@ -1,4 +1,5 @@ > Running nvme/022 > 91fdba0d-f87b-4c25-b80f-db7be1418b9e > uuid.91fdba0d-f87b-4c25-b80f-db7be1418b9e > +ERROR: reset failed > Test complete > # cat results/nodev/nvme/022.full > Reset: Network dropped connection on reset > NQN:blktests-subsystem-1 disconnected 1 controller(s) > > [37353.068448] run blktests nvme/022 at 2021-05-20 20:24:16 > [37353.146301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > [37353.161765] nvmet: creating controller 1 for subsystem > blktests-subsystem-1 for NQN > nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904. > [37353.175796] nvme nvme0: creating 128 I/O queues. > [37353.189734] nvme nvme0: new ctrl: "blktests-subsystem-1" > [37354.216686] nvme nvme0: resetting controller > [37363.270607] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! > [37363.276521] nvmet: ctrl 1 fatal error occurred! > [37363.281058] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" > > # ./check nvme/021 > nvme/021 (test NVMe list command on NVMeOF file-backed ns) [passed] > runtime 10.958s ... 11.382s > # dmesg > [38142.862881] run blktests nvme/021 at 2021-05-20 20:37:26 > [38142.941038] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > [38142.956621] nvmet: creating controller 1 for subsystem > blktests-subsystem-1 for NQN > nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904. > [38142.970524] nvme nvme0: creating 128 I/O queues. > [38142.984356] nvme nvme0: new ctrl: "blktests-subsystem-1" > [38144.014601] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" > [38153.030107] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! > [38153.036018] nvmet: ctrl 1 fatal error occurred! I think that the main reason is that there are 128 queues that are being created, and during that time the keep alive timer ends up expiring as it is shorter (used to be 15 seconds, now 5 by default). nvmet only stops the keep-alive timer when the controller is freed, which is pretty late in the sequence.. The problem is that it needs to be this way because if we shut it down sooner a host can die in the middle of a teardown sequence and we still need to detect that and cleanup ourselves. But maybe we can mod the keep-alive timer for every queue we delete, just in the case the host is not deleting fast enough? Ming, does this solve the issue you are seeing? -- diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 1853db38b682..f0715e9a4a9c 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -804,6 +804,7 @@ void nvmet_sq_destroy(struct nvmet_sq *sq) percpu_ref_exit(&sq->ref); if (ctrl) { + ctrl->cmd_seen = true; nvmet_ctrl_put(ctrl); sq->ctrl = NULL; /* allows reusing the queue later */ } -- We probably need to rename cmd_seen to extend_tbkas (extend traffic based keep-alive). _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer 2021-05-21 18:19 ` Sagi Grimberg @ 2021-05-22 0:12 ` Yi Zhang 2021-05-22 14:59 ` Hannes Reinecke 0 siblings, 1 reply; 13+ messages in thread From: Yi Zhang @ 2021-05-22 0:12 UTC (permalink / raw) To: Sagi Grimberg Cc: Hannes Reinecke, linux-nvme, skt-results-master, Bruno Goncalves On Sat, May 22, 2021 at 2:19 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > > >>> What about this? > > > > Hi Hannes > > With this patch, no WARNNING/NULL pointer this time, but still have > > 'keep-alive timer expired' and reset failure issue, here is the full > > log: > > > > # ./check nvme/022 > > nvme/022 (test NVMe reset command on NVMeOF file-backed ns) [failed] > > runtime 10.646s ... 11.087s > > --- tests/nvme/022.out 2021-05-20 20:16:31.384068807 -0400 > > +++ /root/blktests/results/nodev/nvme/022.out.bad 2021-05-20 > > 20:24:27.874250466 -0400 > > @@ -1,4 +1,5 @@ > > Running nvme/022 > > 91fdba0d-f87b-4c25-b80f-db7be1418b9e > > uuid.91fdba0d-f87b-4c25-b80f-db7be1418b9e > > +ERROR: reset failed > > Test complete > > # cat results/nodev/nvme/022.full > > Reset: Network dropped connection on reset > > NQN:blktests-subsystem-1 disconnected 1 controller(s) > > > > [37353.068448] run blktests nvme/022 at 2021-05-20 20:24:16 > > [37353.146301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > > [37353.161765] nvmet: creating controller 1 for subsystem > > blktests-subsystem-1 for NQN > > nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904. > > [37353.175796] nvme nvme0: creating 128 I/O queues. > > [37353.189734] nvme nvme0: new ctrl: "blktests-subsystem-1" > > [37354.216686] nvme nvme0: resetting controller > > [37363.270607] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! > > [37363.276521] nvmet: ctrl 1 fatal error occurred! > > [37363.281058] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" > > > > # ./check nvme/021 > > nvme/021 (test NVMe list command on NVMeOF file-backed ns) [passed] > > runtime 10.958s ... 11.382s > > # dmesg > > [38142.862881] run blktests nvme/021 at 2021-05-20 20:37:26 > > [38142.941038] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > > [38142.956621] nvmet: creating controller 1 for subsystem > > blktests-subsystem-1 for NQN > > nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904. > > [38142.970524] nvme nvme0: creating 128 I/O queues. > > [38142.984356] nvme nvme0: new ctrl: "blktests-subsystem-1" > > [38144.014601] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" > > [38153.030107] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! > > [38153.036018] nvmet: ctrl 1 fatal error occurred! > > I think that the main reason is that there are 128 queues that are being > created, and during that time the keep alive timer ends up expiring as > it is shorter (used to be 15 seconds, now 5 by default). > > nvmet only stops the keep-alive timer when the controller is freed, > which is pretty late in the sequence.. The problem is that it needs to > be this way because if we shut it down sooner a host can die in the > middle of a teardown sequence and we still need to detect that and > cleanup ourselves. But maybe we can mod the keep-alive timer for > every queue we delete, just in the case the host is not deleting > fast enough? > > Ming, does this solve the issue you are seeing? Hi Sagi The issue was fixed by this patch. :) > -- > diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c > index 1853db38b682..f0715e9a4a9c 100644 > --- a/drivers/nvme/target/core.c > +++ b/drivers/nvme/target/core.c > @@ -804,6 +804,7 @@ void nvmet_sq_destroy(struct nvmet_sq *sq) > percpu_ref_exit(&sq->ref); > > if (ctrl) { > + ctrl->cmd_seen = true; > nvmet_ctrl_put(ctrl); > sq->ctrl = NULL; /* allows reusing the queue later */ > } > -- > > We probably need to rename cmd_seen to extend_tbkas (extend traffic > based keep-alive). > -- Best Regards, Yi Zhang _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer 2021-05-22 0:12 ` Yi Zhang @ 2021-05-22 14:59 ` Hannes Reinecke 0 siblings, 0 replies; 13+ messages in thread From: Hannes Reinecke @ 2021-05-22 14:59 UTC (permalink / raw) To: Yi Zhang, Sagi Grimberg; +Cc: linux-nvme, skt-results-master, Bruno Goncalves On 5/22/21 2:12 AM, Yi Zhang wrote: > On Sat, May 22, 2021 at 2:19 AM Sagi Grimberg <sagi@grimberg.me> wrote: >> >> >>>>> What about this? >>> >>> Hi Hannes >>> With this patch, no WARNNING/NULL pointer this time, but still have >>> 'keep-alive timer expired' and reset failure issue, here is the full >>> log: >>> >>> # ./check nvme/022 >>> nvme/022 (test NVMe reset command on NVMeOF file-backed ns) [failed] >>> runtime 10.646s ... 11.087s >>> --- tests/nvme/022.out 2021-05-20 20:16:31.384068807 -0400 >>> +++ /root/blktests/results/nodev/nvme/022.out.bad 2021-05-20 >>> 20:24:27.874250466 -0400 >>> @@ -1,4 +1,5 @@ >>> Running nvme/022 >>> 91fdba0d-f87b-4c25-b80f-db7be1418b9e >>> uuid.91fdba0d-f87b-4c25-b80f-db7be1418b9e >>> +ERROR: reset failed >>> Test complete >>> # cat results/nodev/nvme/022.full >>> Reset: Network dropped connection on reset >>> NQN:blktests-subsystem-1 disconnected 1 controller(s) >>> >>> [37353.068448] run blktests nvme/022 at 2021-05-20 20:24:16 >>> [37353.146301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >>> [37353.161765] nvmet: creating controller 1 for subsystem >>> blktests-subsystem-1 for NQN >>> nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904. >>> [37353.175796] nvme nvme0: creating 128 I/O queues. >>> [37353.189734] nvme nvme0: new ctrl: "blktests-subsystem-1" >>> [37354.216686] nvme nvme0: resetting controller >>> [37363.270607] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! >>> [37363.276521] nvmet: ctrl 1 fatal error occurred! >>> [37363.281058] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" >>> >>> # ./check nvme/021 >>> nvme/021 (test NVMe list command on NVMeOF file-backed ns) [passed] >>> runtime 10.958s ... 11.382s >>> # dmesg >>> [38142.862881] run blktests nvme/021 at 2021-05-20 20:37:26 >>> [38142.941038] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >>> [38142.956621] nvmet: creating controller 1 for subsystem >>> blktests-subsystem-1 for NQN >>> nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904. >>> [38142.970524] nvme nvme0: creating 128 I/O queues. >>> [38142.984356] nvme nvme0: new ctrl: "blktests-subsystem-1" >>> [38144.014601] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" >>> [38153.030107] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! >>> [38153.036018] nvmet: ctrl 1 fatal error occurred! >> >> I think that the main reason is that there are 128 queues that are being >> created, and during that time the keep alive timer ends up expiring as >> it is shorter (used to be 15 seconds, now 5 by default). >> >> nvmet only stops the keep-alive timer when the controller is freed, >> which is pretty late in the sequence.. The problem is that it needs to >> be this way because if we shut it down sooner a host can die in the >> middle of a teardown sequence and we still need to detect that and >> cleanup ourselves. But maybe we can mod the keep-alive timer for >> every queue we delete, just in the case the host is not deleting >> fast enough? >> >> Ming, does this solve the issue you are seeing? > > Hi Sagi > The issue was fixed by this patch. :) > >> -- >> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c >> index 1853db38b682..f0715e9a4a9c 100644 >> --- a/drivers/nvme/target/core.c >> +++ b/drivers/nvme/target/core.c >> @@ -804,6 +804,7 @@ void nvmet_sq_destroy(struct nvmet_sq *sq) >> percpu_ref_exit(&sq->ref); >> >> if (ctrl) { >> + ctrl->cmd_seen = true; >> nvmet_ctrl_put(ctrl); >> sq->ctrl = NULL; /* allows reusing the queue later */ >> } >> -- >> >> We probably need to rename cmd_seen to extend_tbkas (extend traffic >> based keep-alive). >> > > Thanks for the confirmation. I'll send a proper patchset. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-05-22 14:59 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-30 16:40 [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer Yi Zhang 2021-05-01 0:55 ` Sagi Grimberg 2021-05-01 9:58 ` Yi Zhang 2021-05-07 8:35 ` Yi Zhang 2021-05-07 19:50 ` Sagi Grimberg 2021-05-09 8:44 ` Hannes Reinecke 2021-05-12 0:32 ` Yi Zhang 2021-05-19 0:36 ` Yi Zhang 2021-05-20 6:19 ` Hannes Reinecke 2021-05-21 0:38 ` Yi Zhang 2021-05-21 18:19 ` Sagi Grimberg 2021-05-22 0:12 ` Yi Zhang 2021-05-22 14:59 ` Hannes Reinecke
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.