All of lore.kernel.org
 help / color / mirror / Atom feed
* [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
@ 2021-04-30 16:40 Yi Zhang
  2021-05-01  0:55 ` Sagi Grimberg
  0 siblings, 1 reply; 13+ messages in thread
From: Yi Zhang @ 2021-04-30 16:40 UTC (permalink / raw)
  To: linux-nvme; +Cc: skt-results-master, Bruno Goncalves

Hello
Recently CKI reproduced this WARNING and NULL pointer with
linux-block/for-next on aarch64, seems it's one regression, I will try
if I can bisect the culprit.

blktests: nvme/022 (test NVMe reset command on NVMeOF file-backed ns)

[ 1879.759978] run blktests nvme/022 at 2021-04-30 12:30:36
[ 1879.804283] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 1879.819087] nvmet: creating controller 1 for subsystem
blktests-subsystem-1 for NQN
nqn.2014-08.org.nvmexpress:uuid:0da758a0-4d84-4133-82dd-9801235b55cd.
[ 1879.833081] nvmet: unhandled identify cns 6 on qid 0
[ 1879.838079] nvme nvme0: creating 128 I/O queues.
[ 1879.852353] nvme nvme0: new ctrl: "blktests-subsystem-1"
[ 1880.879731] nvme nvme0: resetting controller
[ 1889.940458] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
[ 1889.946377] nvmet: ctrl 1 fatal error occurred!
[ 1889.950928] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
[ 1892.810813] -
[ 1892.815427] WARNING: CPU: 30 PID: 5492 at
drivers/nvme/target/loop.c:466 nvme_loop_reset_ctrl_work+0x48/0xf0
[nvme_loop]
[ 1892.826293] Modules linked in: nvme_loop nvme_fabrics nvme_core
nvmet loop rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert
iscsi_target_mod target_core_mod ib_iser vfat libiscsi fat
scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
ib_uverbs i2c_smbus ib_core crct10dif_ce ghash_ce sha1_ce acpi_ipmi
ipmi_ssif ipmi_devintf ipmi_msghandler thunderx2_pmu ip_tables xfs
libcrc32c sr_mod cdrom mlx5_core ast i2c_algo_bit drm_vram_helper
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
drm_ttm_helper ttm drm uas mlxfw sha2_ce tls sha256_arm64 usb_storage
sg psample gpio_xlp i2c_xlp9xx dm_mirror dm_region_hash dm_log dm_mod
[last unloaded: nvmet]
[ 1892.885150] CPU: 30 PID: 5492 Comm: kworker/u513:5 Not tainted 5.12.0+ #1
[ 1892.891926] Hardware name: HPE Apollo 70             /C01_APACHE_MB
        , BIOS L50_5.13_1.16 07/29/2020
[ 1892.901654] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop]
[ 1892.908519] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
[ 1892.914513] pc : nvme_loop_reset_ctrl_work+0x48/0xf0 [nvme_loop]
[ 1892.920508] lr : nvme_loop_reset_ctrl_work+0x40/0xf0 [nvme_loop]
[ 1892.926502] sp : fffffe0031b6fd70
[ 1892.929803] x29: fffffe0031b6fd70 x28: 0000000000000000
[ 1892.935105] x27: fffffc081065c0c0 x26: fffffc0807c2c26c
[ 1892.940405] x25: 0000000000000000 x24: fffffc084ae24898
[ 1892.945705] x23: 0000000000000000 x22: fffffc09410c1d00
[ 1892.951004] x21: fffffc084ae24890 x20: fffffc084ae244a0
[ 1892.956305] x19: fffffc084ae24000 x18: 0000000000000012
[ 1892.961604] x17: 0000000000000001 x16: 0000000000000019
[ 1892.966904] x15: fffffe0011d7e7e0 x14: fffffc8ba09cedf8
[ 1892.972204] x13: 0000000000000000 x12: 0000000000000003
[ 1892.977504] x11: fffffc8ba09ced40 x10: fffffe0011d7e7e8
[ 1892.982804] x9 : fffffe000ad60c58 x8 : fffffe8b650f0000
[ 1892.988104] x7 : 0000000000000008 x6 : fffffc0000000000
[ 1892.993403] x5 : 0000000000000000 x4 : ffffffff22f337e0
[ 1892.998703] x3 : fffffc084ae244ac x2 : 0000000000000001
[ 1893.004003] x1 : fffffc084ae244ac x0 : 0000000000000000
[ 1893.009303] Call trace:
[ 1893.011737]  nvme_loop_reset_ctrl_work+0x48/0xf0 [nvme_loop]
[ 1893.017384]  process_one_work+0x1d0/0x438
[ 1893.021385]  worker_thread+0x1f8/0x4d8
[ 1893.025123]  kthread+0x114/0x118
[ 1893.028341]  ret_from_fork+0x10/0x18
[ 1893.031907] ---[ end trace 883109425327ab60 ]---
[ 1893.301843] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000008
[ 1893.310620] Mem abort info:
[ 1893.313401]   ESR = 0x96000006
[ 1893.316442]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 1893.321741]   SET = 0, FnV = 0
[ 1893.324783]   EA = 0, S1PTW = 0
[ 1893.327911] Data abort info:
[ 1893.330778]   ISV = 0, ISS = 0x00000006
[ 1893.334600]   CM = 0, WnR = 0
[ 1893.337555] user pgtable: 64k pages, 42-bit VAs, pgdp=0000000a0b750000
[ 1893.344069] [0000000000000008] pgd=0000000000000000,
p4d=0000000000000000, pud=0000000000000000, pmd=0000000000000000
[ 1893.354669] Internal error: Oops: 96000006 [#1] SMP
[ 1893.359535] Modules linked in: nvme_loop nvme_fabrics nvme_core
nvmet loop rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_ ib_ipoib iw_cm
intf ipmi_msghandrm_kms_helper s64 usb_storage s.418386] CPU: 0 PID:
12871 Comm: kworker/u513:0 Tainted: G        W         5.12.0+ #1
[ 1893.426551] Hardware name: HPE Apollo 70             /C01_APACHE_MB
        , BIOS L50_5.13_1.16 07/29/2020
[ 1893.436277] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
[ 1893.442892] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO BTYPE=--)
[ 1893.448886] pc : percpu_ref_kill_and_confirm+0x15c/0x178
[ 1893.454189] lr : nvmet_sq_destroy+0xec/0x1f0 [nvmet]
[ 1893.459149] sp : fffffe006466fc70
[ 1893.462449] x29: fffffe006466fc70 x28: 0000000000000000
[ 1893.467750] x27: fffffc8bcee4e5c0 x26: fffffc0807f57a6c
[ 1893.473051] x25: 0000000000000000 x24: fffffc084ae248b8
[ 1893.478351] x23: fffffc0854e600d0 x22: fffffe000ac40600
[ 1893.483650] x21: 0000000000000000 x20: fffffc0854e60090
[ 1893.488950] x19: fffffe0011d7e000 x18: 0000000000000016
[ 1893.494250] x17: 0000000000000001 x16: fffffc084e210598
[ 1893.499550] x15: fffffe0011d7e7e0 x14: fffffc8ba09cee18
[ 1893.504850] x13: 0000000000000000 x12: fffffe006466fc68
[ 1893.510149] x11: 0000000000000040 x10: fffffe001158b2f8
[ 1893.515449] x9 : fffffe000ad608ec x8 : fffffc082000cfc0
[ 1893.520749] x7 : 0000000000000001 x6 : fffffc0000000000
[ 1893.526049] x5 : 0000000000000080 x4 : 0000000000000000
[ 1893.531349] x3 : 0000000000000000 x2 : fffffe0011a4e000
[ 1893.536649] x1 : fffffe0010bdec08 x0 : fffffe0010e95000
[ 1893.541949] Call trace:
[ 1893.544383]  percpu_ref_kill_and_confirm+0x15c/0x178
[ 1893.549335]  nvmet_sq_destroy+0xec/0x1f0 [nvmet]
[ 1893.553945]  nvme_loop_destroy_io_queues+0x64/0x90 [nvme_loop]
[ 1893.559767]  nvme_loop_shutdown_ctrl+0x60/0xb8 [nvme_loop]
[ 1893.565240]  nvme_loop_delete_ctrl_host+0x18/0x20 [nvme_loop]
[ 1893.570973]  nvme_do_delete_ctrl+0x58/0x6c [nvme_core]
[ 1893.576106]  nvme_delete_ctrl_work+0x18/0x38 [nvme_core]
[ 1893.581411]  process_one_work+0x1d0/0x438
[ 1893.585410]  worker_thread+0x150/0x4d8
[ 1893.589148]  kthread+0x114/0x118
[ 1893.592364]  ret_from_fork+0x10/0x18
[ 1893.595929] Code: 39244840 f0003781 d0004d40 91302021 (f9400462)
[ 1893.602139] ---[ end trace 883109425327ab61 ]---
[ 1893.606744] Kernel panic - not syncing: Oops: Fatal exception
[ 1893.612512] SMP: stopping secondary CPUs
[ 1893.616474] Kernel Offset: disabled
[ 1893.619949] CPU features: 0x00046002,63000838
[ 1893.624293] Memory Limit: none
[ 1893.627362] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
  2021-04-30 16:40 [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer Yi Zhang
@ 2021-05-01  0:55 ` Sagi Grimberg
  2021-05-01  9:58   ` Yi Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Sagi Grimberg @ 2021-05-01  0:55 UTC (permalink / raw)
  To: Yi Zhang, linux-nvme; +Cc: skt-results-master, Bruno Goncalves


> Hello
> Recently CKI reproduced this WARNING and NULL pointer with
> linux-block/for-next on aarch64, seems it's one regression, I will try
> if I can bisect the culprit.
> 
> blktests: nvme/022 (test NVMe reset command on NVMeOF file-backed ns)
> 
> [ 1879.759978] run blktests nvme/022 at 2021-04-30 12:30:36
> [ 1879.804283] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [ 1879.819087] nvmet: creating controller 1 for subsystem
> blktests-subsystem-1 for NQN
> nqn.2014-08.org.nvmexpress:uuid:0da758a0-4d84-4133-82dd-9801235b55cd.
> [ 1879.833081] nvmet: unhandled identify cns 6 on qid 0
> [ 1879.838079] nvme nvme0: creating 128 I/O queues.
> [ 1879.852353] nvme nvme0: new ctrl: "blktests-subsystem-1"
> [ 1880.879731] nvme nvme0: resetting controller
> [ 1889.940458] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> [ 1889.946377] nvmet: ctrl 1 fatal error occurred!
> [ 1889.950928] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"

It appears that we are somehow now expire the kato after/during a reset
sequence and then seem to race reset and remove...

bisection will help definitely.

> [ 1892.810813] -
> [ 1892.815427] WARNING: CPU: 30 PID: 5492 at
> drivers/nvme/target/loop.c:466 nvme_loop_reset_ctrl_work+0x48/0xf0
> [nvme_loop]
> [ 1892.826293] Modules linked in: nvme_loop nvme_fabrics nvme_core
> nvmet loop rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert
> iscsi_target_mod target_core_mod ib_iser vfat libiscsi fat
> scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
> ib_uverbs i2c_smbus ib_core crct10dif_ce ghash_ce sha1_ce acpi_ipmi
> ipmi_ssif ipmi_devintf ipmi_msghandler thunderx2_pmu ip_tables xfs
> libcrc32c sr_mod cdrom mlx5_core ast i2c_algo_bit drm_vram_helper
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
> drm_ttm_helper ttm drm uas mlxfw sha2_ce tls sha256_arm64 usb_storage
> sg psample gpio_xlp i2c_xlp9xx dm_mirror dm_region_hash dm_log dm_mod
> [last unloaded: nvmet]
> [ 1892.885150] CPU: 30 PID: 5492 Comm: kworker/u513:5 Not tainted 5.12.0+ #1
> [ 1892.891926] Hardware name: HPE Apollo 70             /C01_APACHE_MB
>          , BIOS L50_5.13_1.16 07/29/2020
> [ 1892.901654] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop]
> [ 1892.908519] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
> [ 1892.914513] pc : nvme_loop_reset_ctrl_work+0x48/0xf0 [nvme_loop]
> [ 1892.920508] lr : nvme_loop_reset_ctrl_work+0x40/0xf0 [nvme_loop]
> [ 1892.926502] sp : fffffe0031b6fd70
> [ 1892.929803] x29: fffffe0031b6fd70 x28: 0000000000000000
> [ 1892.935105] x27: fffffc081065c0c0 x26: fffffc0807c2c26c
> [ 1892.940405] x25: 0000000000000000 x24: fffffc084ae24898
> [ 1892.945705] x23: 0000000000000000 x22: fffffc09410c1d00
> [ 1892.951004] x21: fffffc084ae24890 x20: fffffc084ae244a0
> [ 1892.956305] x19: fffffc084ae24000 x18: 0000000000000012
> [ 1892.961604] x17: 0000000000000001 x16: 0000000000000019
> [ 1892.966904] x15: fffffe0011d7e7e0 x14: fffffc8ba09cedf8
> [ 1892.972204] x13: 0000000000000000 x12: 0000000000000003
> [ 1892.977504] x11: fffffc8ba09ced40 x10: fffffe0011d7e7e8
> [ 1892.982804] x9 : fffffe000ad60c58 x8 : fffffe8b650f0000
> [ 1892.988104] x7 : 0000000000000008 x6 : fffffc0000000000
> [ 1892.993403] x5 : 0000000000000000 x4 : ffffffff22f337e0
> [ 1892.998703] x3 : fffffc084ae244ac x2 : 0000000000000001
> [ 1893.004003] x1 : fffffc084ae244ac x0 : 0000000000000000
> [ 1893.009303] Call trace:
> [ 1893.011737]  nvme_loop_reset_ctrl_work+0x48/0xf0 [nvme_loop]
> [ 1893.017384]  process_one_work+0x1d0/0x438
> [ 1893.021385]  worker_thread+0x1f8/0x4d8
> [ 1893.025123]  kthread+0x114/0x118
> [ 1893.028341]  ret_from_fork+0x10/0x18
> [ 1893.031907] ---[ end trace 883109425327ab60 ]---
> [ 1893.301843] Unable to handle kernel NULL pointer dereference at
> virtual address 0000000000000008
> [ 1893.310620] Mem abort info:
> [ 1893.313401]   ESR = 0x96000006
> [ 1893.316442]   EC = 0x25: DABT (current EL), IL = 32 bits
> [ 1893.321741]   SET = 0, FnV = 0
> [ 1893.324783]   EA = 0, S1PTW = 0
> [ 1893.327911] Data abort info:
> [ 1893.330778]   ISV = 0, ISS = 0x00000006
> [ 1893.334600]   CM = 0, WnR = 0
> [ 1893.337555] user pgtable: 64k pages, 42-bit VAs, pgdp=0000000a0b750000
> [ 1893.344069] [0000000000000008] pgd=0000000000000000,
> p4d=0000000000000000, pud=0000000000000000, pmd=0000000000000000
> [ 1893.354669] Internal error: Oops: 96000006 [#1] SMP
> [ 1893.359535] Modules linked in: nvme_loop nvme_fabrics nvme_core
> nvmet loop rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_ ib_ipoib iw_cm
> intf ipmi_msghandrm_kms_helper s64 usb_storage s.418386] CPU: 0 PID:
> 12871 Comm: kworker/u513:0 Tainted: G        W         5.12.0+ #1
> [ 1893.426551] Hardware name: HPE Apollo 70             /C01_APACHE_MB
>          , BIOS L50_5.13_1.16 07/29/2020
> [ 1893.436277] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
> [ 1893.442892] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO BTYPE=--)
> [ 1893.448886] pc : percpu_ref_kill_and_confirm+0x15c/0x178
> [ 1893.454189] lr : nvmet_sq_destroy+0xec/0x1f0 [nvmet]
> [ 1893.459149] sp : fffffe006466fc70
> [ 1893.462449] x29: fffffe006466fc70 x28: 0000000000000000
> [ 1893.467750] x27: fffffc8bcee4e5c0 x26: fffffc0807f57a6c
> [ 1893.473051] x25: 0000000000000000 x24: fffffc084ae248b8
> [ 1893.478351] x23: fffffc0854e600d0 x22: fffffe000ac40600
> [ 1893.483650] x21: 0000000000000000 x20: fffffc0854e60090
> [ 1893.488950] x19: fffffe0011d7e000 x18: 0000000000000016
> [ 1893.494250] x17: 0000000000000001 x16: fffffc084e210598
> [ 1893.499550] x15: fffffe0011d7e7e0 x14: fffffc8ba09cee18
> [ 1893.504850] x13: 0000000000000000 x12: fffffe006466fc68
> [ 1893.510149] x11: 0000000000000040 x10: fffffe001158b2f8
> [ 1893.515449] x9 : fffffe000ad608ec x8 : fffffc082000cfc0
> [ 1893.520749] x7 : 0000000000000001 x6 : fffffc0000000000
> [ 1893.526049] x5 : 0000000000000080 x4 : 0000000000000000
> [ 1893.531349] x3 : 0000000000000000 x2 : fffffe0011a4e000
> [ 1893.536649] x1 : fffffe0010bdec08 x0 : fffffe0010e95000
> [ 1893.541949] Call trace:
> [ 1893.544383]  percpu_ref_kill_and_confirm+0x15c/0x178
> [ 1893.549335]  nvmet_sq_destroy+0xec/0x1f0 [nvmet]
> [ 1893.553945]  nvme_loop_destroy_io_queues+0x64/0x90 [nvme_loop]
> [ 1893.559767]  nvme_loop_shutdown_ctrl+0x60/0xb8 [nvme_loop]
> [ 1893.565240]  nvme_loop_delete_ctrl_host+0x18/0x20 [nvme_loop]
> [ 1893.570973]  nvme_do_delete_ctrl+0x58/0x6c [nvme_core]
> [ 1893.576106]  nvme_delete_ctrl_work+0x18/0x38 [nvme_core]
> [ 1893.581411]  process_one_work+0x1d0/0x438
> [ 1893.585410]  worker_thread+0x150/0x4d8
> [ 1893.589148]  kthread+0x114/0x118
> [ 1893.592364]  ret_from_fork+0x10/0x18
> [ 1893.595929] Code: 39244840 f0003781 d0004d40 91302021 (f9400462)
> [ 1893.602139] ---[ end trace 883109425327ab61 ]---
> [ 1893.606744] Kernel panic - not syncing: Oops: Fatal exception
> [ 1893.612512] SMP: stopping secondary CPUs
> [ 1893.616474] Kernel Offset: disabled
> [ 1893.619949] CPU features: 0x00046002,63000838
> [ 1893.624293] Memory Limit: none
> [ 1893.627362] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
  2021-05-01  0:55 ` Sagi Grimberg
@ 2021-05-01  9:58   ` Yi Zhang
  2021-05-07  8:35     ` Yi Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Yi Zhang @ 2021-05-01  9:58 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, skt-results-master, Bruno Goncalves

On Sat, May 1, 2021 at 8:55 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> > Hello
> > Recently CKI reproduced this WARNING and NULL pointer with
> > linux-block/for-next on aarch64, seems it's one regression, I will try
> > if I can bisect the culprit.
> >
> > blktests: nvme/022 (test NVMe reset command on NVMeOF file-backed ns)
> >
> > [ 1879.759978] run blktests nvme/022 at 2021-04-30 12:30:36
> > [ 1879.804283] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> > [ 1879.819087] nvmet: creating controller 1 for subsystem
> > blktests-subsystem-1 for NQN
> > nqn.2014-08.org.nvmexpress:uuid:0da758a0-4d84-4133-82dd-9801235b55cd.
> > [ 1879.833081] nvmet: unhandled identify cns 6 on qid 0
> > [ 1879.838079] nvme nvme0: creating 128 I/O queues.
> > [ 1879.852353] nvme nvme0: new ctrl: "blktests-subsystem-1"
> > [ 1880.879731] nvme nvme0: resetting controller
> > [ 1889.940458] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> > [ 1889.946377] nvmet: ctrl 1 fatal error occurred!
> > [ 1889.950928] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
>
> It appears that we are somehow now expire the kato after/during a reset
> sequence and then seem to race reset and remove...
>
> bisection will help definitely.
>
Hi Sagi

Bisect shows bellow commits was the first bad commit.

commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad)
Author: Hannes Reinecke <hare@suse.de>
Date:   Fri Apr 16 13:46:20 2021 +0200

    nvme: sanitize KATO setting

Thanks
Yi


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
  2021-05-01  9:58   ` Yi Zhang
@ 2021-05-07  8:35     ` Yi Zhang
  2021-05-07 19:50       ` Sagi Grimberg
  0 siblings, 1 reply; 13+ messages in thread
From: Yi Zhang @ 2021-05-07  8:35 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, skt-results-master, Bruno Goncalves, hare

On Sat, May 1, 2021 at 5:58 PM Yi Zhang <yi.zhang@redhat.com> wrote:
>
> On Sat, May 1, 2021 at 8:55 AM Sagi Grimberg <sagi@grimberg.me> wrote:
> >
> >
> > > Hello
> > > Recently CKI reproduced this WARNING and NULL pointer with
> > > linux-block/for-next on aarch64, seems it's one regression, I will try
> > > if I can bisect the culprit.
> > >
> > > blktests: nvme/022 (test NVMe reset command on NVMeOF file-backed ns)
> > >
> > > [ 1879.759978] run blktests nvme/022 at 2021-04-30 12:30:36
> > > [ 1879.804283] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> > > [ 1879.819087] nvmet: creating controller 1 for subsystem
> > > blktests-subsystem-1 for NQN
> > > nqn.2014-08.org.nvmexpress:uuid:0da758a0-4d84-4133-82dd-9801235b55cd.
> > > [ 1879.833081] nvmet: unhandled identify cns 6 on qid 0
> > > [ 1879.838079] nvme nvme0: creating 128 I/O queues.
> > > [ 1879.852353] nvme nvme0: new ctrl: "blktests-subsystem-1"
> > > [ 1880.879731] nvme nvme0: resetting controller
> > > [ 1889.940458] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> > > [ 1889.946377] nvmet: ctrl 1 fatal error occurred!
> > > [ 1889.950928] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> >
> > It appears that we are somehow now expire the kato after/during a reset
> > sequence and then seem to race reset and remove...
> >
> > bisection will help definitely.
> >
> Hi Sagi
>
> Bisect shows bellow commits was the first bad commit.
>
> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad)
> Author: Hannes Reinecke <hare@suse.de>
> Date:   Fri Apr 16 13:46:20 2021 +0200
>
>     nvme: sanitize KATO setting
>

Hi Sagi
This issue has been consistently reproduced with blktests on recent
linux-block/for-next, do you have a chance to check it?


> Thanks
> Yi


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
  2021-05-07  8:35     ` Yi Zhang
@ 2021-05-07 19:50       ` Sagi Grimberg
  2021-05-09  8:44         ` Hannes Reinecke
  0 siblings, 1 reply; 13+ messages in thread
From: Sagi Grimberg @ 2021-05-07 19:50 UTC (permalink / raw)
  To: Yi Zhang; +Cc: linux-nvme, skt-results-master, Bruno Goncalves, hare


>> Hi Sagi
>>
>> Bisect shows bellow commits was the first bad commit.
>>
>> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad)
>> Author: Hannes Reinecke <hare@suse.de>
>> Date:   Fri Apr 16 13:46:20 2021 +0200
>>
>>      nvme: sanitize KATO setting
>>
> 
> Hi Sagi
> This issue has been consistently reproduced with blktests on recent
> linux-block/for-next, do you have a chance to check it?

I had not, Hannes can you have a look?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
  2021-05-07 19:50       ` Sagi Grimberg
@ 2021-05-09  8:44         ` Hannes Reinecke
  2021-05-12  0:32           ` Yi Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Hannes Reinecke @ 2021-05-09  8:44 UTC (permalink / raw)
  To: Sagi Grimberg, Yi Zhang; +Cc: linux-nvme, skt-results-master, Bruno Goncalves

On 5/7/21 9:50 PM, Sagi Grimberg wrote:
> 
>>> Hi Sagi
>>>
>>> Bisect shows bellow commits was the first bad commit.
>>>
>>> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad)
>>> Author: Hannes Reinecke <hare@suse.de>
>>> Date:   Fri Apr 16 13:46:20 2021 +0200
>>>
>>>      nvme: sanitize KATO setting
>>>
>>
>> Hi Sagi
>> This issue has been consistently reproduced with blktests on recent
>> linux-block/for-next, do you have a chance to check it?
> 
> I had not, Hannes can you have a look?
> 
Can you check if this patch helps?

diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 74b3b150e1a5..a48d8219cf90 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -458,6 +458,9 @@ static void nvme_loop_reset_ctrl_work(struct 
work_struct *work)
                 container_of(work, struct nvme_loop_ctrl, ctrl.reset_work);
         int ret;

+       /* reset_work might run concurrently to delete_work */
+       if (ctrl->state == NVME_CTRL_DELETING)
+               return;
         nvme_stop_ctrl(&ctrl->ctrl);
         nvme_loop_shutdown_ctrl(ctrl);


Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
  2021-05-09  8:44         ` Hannes Reinecke
@ 2021-05-12  0:32           ` Yi Zhang
  2021-05-19  0:36             ` Yi Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Yi Zhang @ 2021-05-12  0:32 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, linux-nvme, skt-results-master, Bruno Goncalves

On Sun, May 9, 2021 at 4:44 PM Hannes Reinecke <hare@suse.de> wrote:
>
> On 5/7/21 9:50 PM, Sagi Grimberg wrote:
> >
> >>> Hi Sagi
> >>>
> >>> Bisect shows bellow commits was the first bad commit.
> >>>
> >>> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad)
> >>> Author: Hannes Reinecke <hare@suse.de>
> >>> Date:   Fri Apr 16 13:46:20 2021 +0200
> >>>
> >>>      nvme: sanitize KATO setting
> >>>
> >>
> >> Hi Sagi
> >> This issue has been consistently reproduced with blktests on recent
> >> linux-block/for-next, do you have a chance to check it?
> >
> > I had not, Hannes can you have a look?
> >
> Can you check if this patch helps?
>
> diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
> index 74b3b150e1a5..a48d8219cf90 100644
> --- a/drivers/nvme/target/loop.c
> +++ b/drivers/nvme/target/loop.c
> @@ -458,6 +458,9 @@ static void nvme_loop_reset_ctrl_work(struct
> work_struct *work)
>                  container_of(work, struct nvme_loop_ctrl, ctrl.reset_work);
>          int ret;
>
> +       /* reset_work might run concurrently to delete_work */
> +       if (ctrl->state == NVME_CTRL_DELETING)
> +               return;
>          nvme_stop_ctrl(&ctrl->ctrl);
>          nvme_loop_shutdown_ctrl(ctrl);
>

Hi Hannes
The issue still can be reproduced with this patch, here is the ful log:

<12>[  213.737602] run blktests nvme/022 at 2021-05-11 20:29:23
<6>[  213.857826] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
<6>[  213.879592] nvmet: creating controller 1 for subsystem
blktests-subsystem-1 for NQN
nqn.2014-08.org.nvmexpress:uuid:7bc1767264664586a5ebf3ec1cdca2b9.
<6>[  213.893281] nvme nvme0: creating 128 I/O queues.
<6>[  213.910709] nvme nvme0: new ctrl: "blktests-subsystem-1"
<4>[  214.943818] nvme nvme0: resetting controller
<3>[  224.004540] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
<3>[  224.010452] nvmet: ctrl 1 fatal error occurred!
<6>[  224.014997] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
<4>[  227.734741] ------------[ cut here ]------------
<4>[  227.739348] WARNING: CPU: 110 PID: 1648 at
drivers/nvme/target/loop.c:469 nvme_loop_reset_ctrl_work+0x5c/0x110
[nvme_loop]
<4>[  227.750386] Modules linked in: nvme_loop nvme_fabrics nvmet
nvme_core rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod
target_core_mod ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib
rdma_cm iw_cm ib_cm rfkill mlx5_ib ib_uverbs ib_core sunrpc
coresight_etm4x i2c_smbus coresight_tmc coresight_replicator
coresight_tpiu mlx5_core joydev acpi_ipmi psample ipmi_ssif mlxfw
ipmi_devintf ipmi_msghandler coresight_funnel coresight thunderx2_pmu
vfat fat fuse zram ip_tables xfs ast i2c_algo_bit drm_vram_helper
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec
drm_ttm_helper ttm drm crct10dif_ce ghash_ce gpio_xlp i2c_xlp9xx uas
usb_storage aes_neon_bs
<4>[  227.809227] CPU: 110 PID: 1648 Comm: kworker/u513:2 Not tainted
5.13.0-rc1.fix+ #1
<4>[  227.816784] Hardware name: HPE Apollo 70
/C01_APACHE_MB         , BIOS L50_5.13_1.16 07/29/2020
<4>[  227.826511] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop]
<4>[  227.833376] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
<4>[  227.839370] pc : nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop]
<4>[  227.845451] lr : nvme_loop_reset_ctrl_work+0x54/0x110 [nvme_loop]
<4>[  227.851532] sp : ffff800020c43d80
<4>[  227.854833] x29: ffff800020c43d80 x28: 0000000000000000 x27:
ffff00080d1b73c0
<4>[  227.861959] x26: ffff00080d1fb76c x25: 0000000000000000 x24:
ffff0008126be8c0
<4>[  227.869082] x23: 0000000000000000 x22: ffff000826eafb00 x21:
ffff0008126be4b0
<4>[  227.876205] x20: ffff0008126be000 x19: ffff0008126be8b8 x18:
0000000000000020
<4>[  227.883328] x17: 00000000f911e638 x16: 0000000032ecb495 x15:
000000000000000f
<4>[  227.890452] x14: 0000000000000003 x13: 0000000000000000 x12:
ffff008b822d32d0
<4>[  227.897575] x11: ffff008b822d3230 x10: 0000000000000001 x9 :
ffff800009058e74
<4>[  227.904699] x8 : ffff008b7cfeaab8 x7 : fffffffffffc0000 x6 :
ffff000000000000
<4>[  227.911822] x5 : 000000008040002a x4 : 0000000000000000 x3 :
ffff0008126be4bc
<4>[  227.918946] x2 : 0000000000000001 x1 : ffff0008126be4bc x0 :
0000000000000000
<4>[  227.926069] Call trace:
<4>[  227.928503]  nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop]
<4>[  227.934237]  process_one_work+0x1f0/0x4ac
<4>[  227.938239]  worker_thread+0x22c/0x40c
<4>[  227.941976]  kthread+0x11c/0x120
<4>[  227.945193]  ret_from_fork+0x10/0x18
<4>[  227.948759] ---[ end trace 57e2d1ace7d39024 ]---
<1>[  228.355440] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000008
<1>[  228.364215] Mem abort info:
<1>[  228.366996]   ESR = 0x96000004
<1>[  228.370037]   EC = 0x25: DABT (current EL), IL = 32 bits
<1>[  228.375336]   SET = 0, FnV = 0
<1>[  228.378377]   EA = 0, S1PTW = 0
<1>[  228.381505] Data abort info:
<1>[  228.384372]   ISV = 0, ISS = 0x00000004
<1>[  228.388194]   CM = 0, WnR = 0
<1>[  228.391149] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000897413000
<1>[  228.397576] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000
<0>[  228.404354] Internal error: Oops: 96000004 [#1] SMP
<4>[  228.409220] Modules linked in: nvme_loop nvme_fabrics nvmet
nvme_core rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod
target_core_mod ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib
rdma_cm iw_cm ib_cm rfkill mlx5_ib ib_uverbs ib_core sunrpc
coresight_etm4x i2c_smbus coresight_tmc coresight_replicator
coresight_tpiu mlx5_core joydev acpi_ipmi psample ipmi_ssif mlxfw
ipmi_devintf ipmi_msghandler coresight_funnel coresight thunderx2_pmu
vfat fat fuse zram ip_tables xfs ast i2c_algo_bit drm_vram_helper
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec
drm_ttm_helper ttm drm crct10dif_ce ghash_ce gpio_xlp i2c_xlp9xx uas
usb_storage aes_neon_bs
<4>[  228.468051] CPU: 5 PID: 2015 Comm: kworker/u513:7 Tainted: G
   W         5.13.0-rc1.fix+ #1
<4>[  228.476822] Hardware name: HPE Apollo 70
/C01_APACHE_MB         , BIOS L50_5.13_1.16 07/29/2020
<4>[  228.486548] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
<4>[  228.493166] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO BTYPE=--)
<4>[  228.499160] pc : percpu_ref_kill_and_confirm+0x94/0xb0
<4>[  228.504288] lr : percpu_ref_kill_and_confirm+0x20/0xb0
<4>[  228.509414] sp : ffff800020f63c90
<4>[  228.512715] x29: ffff800020f63c90 x28: 0000000000000000 x27:
ffff0008214969c0
<4>[  228.519840] x26: ffff000809dc8b6c x25: 0000000000000000 x24:
ffff00084f1d00d0
<4>[  228.526963] x23: ffff00084f1d00b0 x22: ffff0008126be130 x21:
ffff80000939a590
<4>[  228.534087] x20: 0000000000000000 x19: ffff00084f1d0090 x18:
0000000000000020
<4>[  228.541210] x17: 0000000000000000 x16: 0000000000000000 x15:
0000000000000013
<4>[  228.548334] x14: 0000000000000003 x13: 0000000000000000 x12:
0000000000000040
<4>[  228.555457] x11: ffff000800401918 x10: ffff00080040191a x9 :
ffff8000090589ac
<4>[  228.562580] x8 : ffff008b7cfeaab8 x7 : 0000000000000018 x6 :
ffff000000000000
<4>[  228.569703] x5 : 0000000000000000 x4 : ffff008b85321400 x3 :
0000000000000001
<4>[  228.576827] x2 : 0000000000000000 x1 : ffff800011035d20 x0 :
ffff80001147f000
<4>[  228.583952] Call trace:
<4>[  228.586385]  percpu_ref_kill_and_confirm+0x94/0xb0
<4>[  228.591163]  nvmet_sq_destroy+0xf0/0x194 [nvmet]
<4>[  228.595777]  nvme_loop_destroy_io_queues+0x6c/0xa0 [nvme_loop]
<4>[  228.601598]  nvme_loop_shutdown_ctrl+0x64/0xc0 [nvme_loop]
<4>[  228.607071]  nvme_loop_delete_ctrl_host+0x20/0x30 [nvme_loop]
<4>[  228.612805]  nvme_do_delete_ctrl+0x5c/0x74 [nvme_core]
<4>[  228.617939]  nvme_delete_ctrl_work+0x20/0x44 [nvme_core]
<4>[  228.623246]  process_one_work+0x1f0/0x4ac
<4>[  228.627244]  worker_thread+0x184/0x40c
<4>[  228.630981]  kthread+0x11c/0x120
<4>[  228.634197]  ret_from_fork+0x10/0x18
<0>[  228.637763] Code: b0004841 3905dc03 91348021 f0006a80 (f9400442)
<4>[  228.643843] ---[ end trace 57e2d1ace7d39025 ]---


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
  2021-05-12  0:32           ` Yi Zhang
@ 2021-05-19  0:36             ` Yi Zhang
  2021-05-20  6:19               ` Hannes Reinecke
  0 siblings, 1 reply; 13+ messages in thread
From: Yi Zhang @ 2021-05-19  0:36 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, linux-nvme, skt-results-master, Bruno Goncalves

Hi Hannes

Gentle ping here.

On Wed, May 12, 2021 at 8:32 AM Yi Zhang <yi.zhang@redhat.com> wrote:
>
> On Sun, May 9, 2021 at 4:44 PM Hannes Reinecke <hare@suse.de> wrote:
> >
> > On 5/7/21 9:50 PM, Sagi Grimberg wrote:
> > >
> > >>> Hi Sagi
> > >>>
> > >>> Bisect shows bellow commits was the first bad commit.
> > >>>
> > >>> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad)
> > >>> Author: Hannes Reinecke <hare@suse.de>
> > >>> Date:   Fri Apr 16 13:46:20 2021 +0200
> > >>>
> > >>>      nvme: sanitize KATO setting
> > >>>
> > >>
> > >> Hi Sagi
> > >> This issue has been consistently reproduced with blktests on recent
> > >> linux-block/for-next, do you have a chance to check it?
> > >
> > > I had not, Hannes can you have a look?
> > >
> > Can you check if this patch helps?
> >
> > diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
> > index 74b3b150e1a5..a48d8219cf90 100644
> > --- a/drivers/nvme/target/loop.c
> > +++ b/drivers/nvme/target/loop.c
> > @@ -458,6 +458,9 @@ static void nvme_loop_reset_ctrl_work(struct
> > work_struct *work)
> >                  container_of(work, struct nvme_loop_ctrl, ctrl.reset_work);
> >          int ret;
> >
> > +       /* reset_work might run concurrently to delete_work */
> > +       if (ctrl->state == NVME_CTRL_DELETING)
> > +               return;
> >          nvme_stop_ctrl(&ctrl->ctrl);
> >          nvme_loop_shutdown_ctrl(ctrl);
> >
>
> Hi Hannes
> The issue still can be reproduced with this patch, here is the full log:
>
> <12>[  213.737602] run blktests nvme/022 at 2021-05-11 20:29:23
> <6>[  213.857826] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> <6>[  213.879592] nvmet: creating controller 1 for subsystem
> blktests-subsystem-1 for NQN
> nqn.2014-08.org.nvmexpress:uuid:7bc1767264664586a5ebf3ec1cdca2b9.
> <6>[  213.893281] nvme nvme0: creating 128 I/O queues.
> <6>[  213.910709] nvme nvme0: new ctrl: "blktests-subsystem-1"
> <4>[  214.943818] nvme nvme0: resetting controller
> <3>[  224.004540] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> <3>[  224.010452] nvmet: ctrl 1 fatal error occurred!
> <6>[  224.014997] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> <4>[  227.734741] ------------[ cut here ]------------
> <4>[  227.739348] WARNING: CPU: 110 PID: 1648 at
> drivers/nvme/target/loop.c:469 nvme_loop_reset_ctrl_work+0x5c/0x110
> [nvme_loop]
> <4>[  227.750386] Modules linked in: nvme_loop nvme_fabrics nvmet
> nvme_core rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod
> target_core_mod ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib
> rdma_cm iw_cm ib_cm rfkill mlx5_ib ib_uverbs ib_core sunrpc
> coresight_etm4x i2c_smbus coresight_tmc coresight_replicator
> coresight_tpiu mlx5_core joydev acpi_ipmi psample ipmi_ssif mlxfw
> ipmi_devintf ipmi_msghandler coresight_funnel coresight thunderx2_pmu
> vfat fat fuse zram ip_tables xfs ast i2c_algo_bit drm_vram_helper
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec
> drm_ttm_helper ttm drm crct10dif_ce ghash_ce gpio_xlp i2c_xlp9xx uas
> usb_storage aes_neon_bs
> <4>[  227.809227] CPU: 110 PID: 1648 Comm: kworker/u513:2 Not tainted
> 5.13.0-rc1.fix+ #1
> <4>[  227.816784] Hardware name: HPE Apollo 70
> /C01_APACHE_MB         , BIOS L50_5.13_1.16 07/29/2020
> <4>[  227.826511] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop]
> <4>[  227.833376] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
> <4>[  227.839370] pc : nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop]
> <4>[  227.845451] lr : nvme_loop_reset_ctrl_work+0x54/0x110 [nvme_loop]
> <4>[  227.851532] sp : ffff800020c43d80
> <4>[  227.854833] x29: ffff800020c43d80 x28: 0000000000000000 x27:
> ffff00080d1b73c0
> <4>[  227.861959] x26: ffff00080d1fb76c x25: 0000000000000000 x24:
> ffff0008126be8c0
> <4>[  227.869082] x23: 0000000000000000 x22: ffff000826eafb00 x21:
> ffff0008126be4b0
> <4>[  227.876205] x20: ffff0008126be000 x19: ffff0008126be8b8 x18:
> 0000000000000020
> <4>[  227.883328] x17: 00000000f911e638 x16: 0000000032ecb495 x15:
> 000000000000000f
> <4>[  227.890452] x14: 0000000000000003 x13: 0000000000000000 x12:
> ffff008b822d32d0
> <4>[  227.897575] x11: ffff008b822d3230 x10: 0000000000000001 x9 :
> ffff800009058e74
> <4>[  227.904699] x8 : ffff008b7cfeaab8 x7 : fffffffffffc0000 x6 :
> ffff000000000000
> <4>[  227.911822] x5 : 000000008040002a x4 : 0000000000000000 x3 :
> ffff0008126be4bc
> <4>[  227.918946] x2 : 0000000000000001 x1 : ffff0008126be4bc x0 :
> 0000000000000000
> <4>[  227.926069] Call trace:
> <4>[  227.928503]  nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop]
> <4>[  227.934237]  process_one_work+0x1f0/0x4ac
> <4>[  227.938239]  worker_thread+0x22c/0x40c
> <4>[  227.941976]  kthread+0x11c/0x120
> <4>[  227.945193]  ret_from_fork+0x10/0x18
> <4>[  227.948759] ---[ end trace 57e2d1ace7d39024 ]---
> <1>[  228.355440] Unable to handle kernel NULL pointer dereference at
> virtual address 0000000000000008
> <1>[  228.364215] Mem abort info:
> <1>[  228.366996]   ESR = 0x96000004
> <1>[  228.370037]   EC = 0x25: DABT (current EL), IL = 32 bits
> <1>[  228.375336]   SET = 0, FnV = 0
> <1>[  228.378377]   EA = 0, S1PTW = 0
> <1>[  228.381505] Data abort info:
> <1>[  228.384372]   ISV = 0, ISS = 0x00000004
> <1>[  228.388194]   CM = 0, WnR = 0
> <1>[  228.391149] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000897413000
> <1>[  228.397576] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000
> <0>[  228.404354] Internal error: Oops: 96000004 [#1] SMP
> <4>[  228.409220] Modules linked in: nvme_loop nvme_fabrics nvmet
> nvme_core rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod
> target_core_mod ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib
> rdma_cm iw_cm ib_cm rfkill mlx5_ib ib_uverbs ib_core sunrpc
> coresight_etm4x i2c_smbus coresight_tmc coresight_replicator
> coresight_tpiu mlx5_core joydev acpi_ipmi psample ipmi_ssif mlxfw
> ipmi_devintf ipmi_msghandler coresight_funnel coresight thunderx2_pmu
> vfat fat fuse zram ip_tables xfs ast i2c_algo_bit drm_vram_helper
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec
> drm_ttm_helper ttm drm crct10dif_ce ghash_ce gpio_xlp i2c_xlp9xx uas
> usb_storage aes_neon_bs
> <4>[  228.468051] CPU: 5 PID: 2015 Comm: kworker/u513:7 Tainted: G
>    W         5.13.0-rc1.fix+ #1
> <4>[  228.476822] Hardware name: HPE Apollo 70
> /C01_APACHE_MB         , BIOS L50_5.13_1.16 07/29/2020
> <4>[  228.486548] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
> <4>[  228.493166] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO BTYPE=--)
> <4>[  228.499160] pc : percpu_ref_kill_and_confirm+0x94/0xb0
> <4>[  228.504288] lr : percpu_ref_kill_and_confirm+0x20/0xb0
> <4>[  228.509414] sp : ffff800020f63c90
> <4>[  228.512715] x29: ffff800020f63c90 x28: 0000000000000000 x27:
> ffff0008214969c0
> <4>[  228.519840] x26: ffff000809dc8b6c x25: 0000000000000000 x24:
> ffff00084f1d00d0
> <4>[  228.526963] x23: ffff00084f1d00b0 x22: ffff0008126be130 x21:
> ffff80000939a590
> <4>[  228.534087] x20: 0000000000000000 x19: ffff00084f1d0090 x18:
> 0000000000000020
> <4>[  228.541210] x17: 0000000000000000 x16: 0000000000000000 x15:
> 0000000000000013
> <4>[  228.548334] x14: 0000000000000003 x13: 0000000000000000 x12:
> 0000000000000040
> <4>[  228.555457] x11: ffff000800401918 x10: ffff00080040191a x9 :
> ffff8000090589ac
> <4>[  228.562580] x8 : ffff008b7cfeaab8 x7 : 0000000000000018 x6 :
> ffff000000000000
> <4>[  228.569703] x5 : 0000000000000000 x4 : ffff008b85321400 x3 :
> 0000000000000001
> <4>[  228.576827] x2 : 0000000000000000 x1 : ffff800011035d20 x0 :
> ffff80001147f000
> <4>[  228.583952] Call trace:
> <4>[  228.586385]  percpu_ref_kill_and_confirm+0x94/0xb0
> <4>[  228.591163]  nvmet_sq_destroy+0xf0/0x194 [nvmet]
> <4>[  228.595777]  nvme_loop_destroy_io_queues+0x6c/0xa0 [nvme_loop]
> <4>[  228.601598]  nvme_loop_shutdown_ctrl+0x64/0xc0 [nvme_loop]
> <4>[  228.607071]  nvme_loop_delete_ctrl_host+0x20/0x30 [nvme_loop]
> <4>[  228.612805]  nvme_do_delete_ctrl+0x5c/0x74 [nvme_core]
> <4>[  228.617939]  nvme_delete_ctrl_work+0x20/0x44 [nvme_core]
> <4>[  228.623246]  process_one_work+0x1f0/0x4ac
> <4>[  228.627244]  worker_thread+0x184/0x40c
> <4>[  228.630981]  kthread+0x11c/0x120
> <4>[  228.634197]  ret_from_fork+0x10/0x18
> <0>[  228.637763] Code: b0004841 3905dc03 91348021 f0006a80 (f9400442)
> <4>[  228.643843] ---[ end trace 57e2d1ace7d39025 ]---



-- 
Best Regards,
  Yi Zhang


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
  2021-05-19  0:36             ` Yi Zhang
@ 2021-05-20  6:19               ` Hannes Reinecke
  2021-05-21  0:38                 ` Yi Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Hannes Reinecke @ 2021-05-20  6:19 UTC (permalink / raw)
  To: Yi Zhang; +Cc: Sagi Grimberg, linux-nvme, skt-results-master, Bruno Goncalves

On 5/19/21 2:36 AM, Yi Zhang wrote:
> Hi Hannes
> 
> Gentle ping here.
> 
> On Wed, May 12, 2021 at 8:32 AM Yi Zhang <yi.zhang@redhat.com> wrote:
>>
>> On Sun, May 9, 2021 at 4:44 PM Hannes Reinecke <hare@suse.de> wrote:
>>>
>>> On 5/7/21 9:50 PM, Sagi Grimberg wrote:
>>>>
>>>>>> Hi Sagi
>>>>>>
>>>>>> Bisect shows bellow commits was the first bad commit.
>>>>>>
>>>>>> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (refs/bisect/bad)
>>>>>> Author: Hannes Reinecke <hare@suse.de>
>>>>>> Date:   Fri Apr 16 13:46:20 2021 +0200
>>>>>>
>>>>>>       nvme: sanitize KATO setting
>>>>>>
>>>>>
>>>>> Hi Sagi
>>>>> This issue has been consistently reproduced with blktests on recent
>>>>> linux-block/for-next, do you have a chance to check it?
>>>>
>>>> I had not, Hannes can you have a look?
>>>>
>>> Can you check if this patch helps?
>>>
>>> diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
>>> index 74b3b150e1a5..a48d8219cf90 100644
>>> --- a/drivers/nvme/target/loop.c
>>> +++ b/drivers/nvme/target/loop.c
>>> @@ -458,6 +458,9 @@ static void nvme_loop_reset_ctrl_work(struct
>>> work_struct *work)
>>>                   container_of(work, struct nvme_loop_ctrl, ctrl.reset_work);
>>>           int ret;
>>>
>>> +       /* reset_work might run concurrently to delete_work */
>>> +       if (ctrl->state == NVME_CTRL_DELETING)
>>> +               return;
>>>           nvme_stop_ctrl(&ctrl->ctrl);
>>>           nvme_loop_shutdown_ctrl(ctrl);
>>>
>>
>> Hi Hannes
>> The issue still can be reproduced with this patch, here is the full log:
>>
>> <12>[  213.737602] run blktests nvme/022 at 2021-05-11 20:29:23
>> <6>[  213.857826] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>> <6>[  213.879592] nvmet: creating controller 1 for subsystem
>> blktests-subsystem-1 for NQN
>> nqn.2014-08.org.nvmexpress:uuid:7bc1767264664586a5ebf3ec1cdca2b9.
>> <6>[  213.893281] nvme nvme0: creating 128 I/O queues.
>> <6>[  213.910709] nvme nvme0: new ctrl: "blktests-subsystem-1"
>> <4>[  214.943818] nvme nvme0: resetting controller
>> <3>[  224.004540] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
>> <3>[  224.010452] nvmet: ctrl 1 fatal error occurred!
>> <6>[  224.014997] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
>> <4>[  227.734741] ------------[ cut here ]------------
>> <4>[  227.739348] WARNING: CPU: 110 PID: 1648 at
>> drivers/nvme/target/loop.c:469 nvme_loop_reset_ctrl_work+0x5c/0x110
>> [nvme_loop]
>> <4>[  227.750386] Modules linked in: nvme_loop nvme_fabrics nvmet
>> nvme_core rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod
>> target_core_mod ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib
>> rdma_cm iw_cm ib_cm rfkill mlx5_ib ib_uverbs ib_core sunrpc
>> coresight_etm4x i2c_smbus coresight_tmc coresight_replicator
>> coresight_tpiu mlx5_core joydev acpi_ipmi psample ipmi_ssif mlxfw
>> ipmi_devintf ipmi_msghandler coresight_funnel coresight thunderx2_pmu
>> vfat fat fuse zram ip_tables xfs ast i2c_algo_bit drm_vram_helper
>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec
>> drm_ttm_helper ttm drm crct10dif_ce ghash_ce gpio_xlp i2c_xlp9xx uas
>> usb_storage aes_neon_bs
>> <4>[  227.809227] CPU: 110 PID: 1648 Comm: kworker/u513:2 Not tainted
>> 5.13.0-rc1.fix+ #1
>> <4>[  227.816784] Hardware name: HPE Apollo 70
>> /C01_APACHE_MB         , BIOS L50_5.13_1.16 07/29/2020
>> <4>[  227.826511] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop]
>> <4>[  227.833376] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
>> <4>[  227.839370] pc : nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop]
>> <4>[  227.845451] lr : nvme_loop_reset_ctrl_work+0x54/0x110 [nvme_loop]
>> <4>[  227.851532] sp : ffff800020c43d80
>> <4>[  227.854833] x29: ffff800020c43d80 x28: 0000000000000000 x27:
>> ffff00080d1b73c0
>> <4>[  227.861959] x26: ffff00080d1fb76c x25: 0000000000000000 x24:
>> ffff0008126be8c0
>> <4>[  227.869082] x23: 0000000000000000 x22: ffff000826eafb00 x21:
>> ffff0008126be4b0
>> <4>[  227.876205] x20: ffff0008126be000 x19: ffff0008126be8b8 x18:
>> 0000000000000020
>> <4>[  227.883328] x17: 00000000f911e638 x16: 0000000032ecb495 x15:
>> 000000000000000f
>> <4>[  227.890452] x14: 0000000000000003 x13: 0000000000000000 x12:
>> ffff008b822d32d0
>> <4>[  227.897575] x11: ffff008b822d3230 x10: 0000000000000001 x9 :
>> ffff800009058e74
>> <4>[  227.904699] x8 : ffff008b7cfeaab8 x7 : fffffffffffc0000 x6 :
>> ffff000000000000
>> <4>[  227.911822] x5 : 000000008040002a x4 : 0000000000000000 x3 :
>> ffff0008126be4bc
>> <4>[  227.918946] x2 : 0000000000000001 x1 : ffff0008126be4bc x0 :
>> 0000000000000000
>> <4>[  227.926069] Call trace:
>> <4>[  227.928503]  nvme_loop_reset_ctrl_work+0x5c/0x110 [nvme_loop]
>> <4>[  227.934237]  process_one_work+0x1f0/0x4ac
>> <4>[  227.938239]  worker_thread+0x22c/0x40c
>> <4>[  227.941976]  kthread+0x11c/0x120
>> <4>[  227.945193]  ret_from_fork+0x10/0x18
>> <4>[  227.948759] ---[ end trace 57e2d1ace7d39024 ]---
>> <1>[  228.355440] Unable to handle kernel NULL pointer dereference at
>> virtual address 0000000000000008
>> <1>[  228.364215] Mem abort info:
>> <1>[  228.366996]   ESR = 0x96000004
>> <1>[  228.370037]   EC = 0x25: DABT (current EL), IL = 32 bits
>> <1>[  228.375336]   SET = 0, FnV = 0
>> <1>[  228.378377]   EA = 0, S1PTW = 0
>> <1>[  228.381505] Data abort info:
>> <1>[  228.384372]   ISV = 0, ISS = 0x00000004
>> <1>[  228.388194]   CM = 0, WnR = 0
>> <1>[  228.391149] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000897413000
>> <1>[  228.397576] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000
>> <0>[  228.404354] Internal error: Oops: 96000004 [#1] SMP
>> <4>[  228.409220] Modules linked in: nvme_loop nvme_fabrics nvmet
>> nvme_core rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod
>> target_core_mod ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib
>> rdma_cm iw_cm ib_cm rfkill mlx5_ib ib_uverbs ib_core sunrpc
>> coresight_etm4x i2c_smbus coresight_tmc coresight_replicator
>> coresight_tpiu mlx5_core joydev acpi_ipmi psample ipmi_ssif mlxfw
>> ipmi_devintf ipmi_msghandler coresight_funnel coresight thunderx2_pmu
>> vfat fat fuse zram ip_tables xfs ast i2c_algo_bit drm_vram_helper
>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec
>> drm_ttm_helper ttm drm crct10dif_ce ghash_ce gpio_xlp i2c_xlp9xx uas
>> usb_storage aes_neon_bs
>> <4>[  228.468051] CPU: 5 PID: 2015 Comm: kworker/u513:7 Tainted: G
>>     W         5.13.0-rc1.fix+ #1
>> <4>[  228.476822] Hardware name: HPE Apollo 70
>> /C01_APACHE_MB         , BIOS L50_5.13_1.16 07/29/2020
>> <4>[  228.486548] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
>> <4>[  228.493166] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO BTYPE=--)
>> <4>[  228.499160] pc : percpu_ref_kill_and_confirm+0x94/0xb0
>> <4>[  228.504288] lr : percpu_ref_kill_and_confirm+0x20/0xb0
>> <4>[  228.509414] sp : ffff800020f63c90
>> <4>[  228.512715] x29: ffff800020f63c90 x28: 0000000000000000 x27:
>> ffff0008214969c0
>> <4>[  228.519840] x26: ffff000809dc8b6c x25: 0000000000000000 x24:
>> ffff00084f1d00d0
>> <4>[  228.526963] x23: ffff00084f1d00b0 x22: ffff0008126be130 x21:
>> ffff80000939a590
>> <4>[  228.534087] x20: 0000000000000000 x19: ffff00084f1d0090 x18:
>> 0000000000000020
>> <4>[  228.541210] x17: 0000000000000000 x16: 0000000000000000 x15:
>> 0000000000000013
>> <4>[  228.548334] x14: 0000000000000003 x13: 0000000000000000 x12:
>> 0000000000000040
>> <4>[  228.555457] x11: ffff000800401918 x10: ffff00080040191a x9 :
>> ffff8000090589ac
>> <4>[  228.562580] x8 : ffff008b7cfeaab8 x7 : 0000000000000018 x6 :
>> ffff000000000000
>> <4>[  228.569703] x5 : 0000000000000000 x4 : ffff008b85321400 x3 :
>> 0000000000000001
>> <4>[  228.576827] x2 : 0000000000000000 x1 : ffff800011035d20 x0 :
>> ffff80001147f000
>> <4>[  228.583952] Call trace:
>> <4>[  228.586385]  percpu_ref_kill_and_confirm+0x94/0xb0
>> <4>[  228.591163]  nvmet_sq_destroy+0xf0/0x194 [nvmet]
>> <4>[  228.595777]  nvme_loop_destroy_io_queues+0x6c/0xa0 [nvme_loop]
>> <4>[  228.601598]  nvme_loop_shutdown_ctrl+0x64/0xc0 [nvme_loop]
>> <4>[  228.607071]  nvme_loop_delete_ctrl_host+0x20/0x30 [nvme_loop]
>> <4>[  228.612805]  nvme_do_delete_ctrl+0x5c/0x74 [nvme_core]
>> <4>[  228.617939]  nvme_delete_ctrl_work+0x20/0x44 [nvme_core]
>> <4>[  228.623246]  process_one_work+0x1f0/0x4ac
>> <4>[  228.627244]  worker_thread+0x184/0x40c
>> <4>[  228.630981]  kthread+0x11c/0x120
>> <4>[  228.634197]  ret_from_fork+0x10/0x18
>> <0>[  228.637763] Code: b0004841 3905dc03 91348021 f0006a80 (f9400442)
>> <4>[  228.643843] ---[ end trace 57e2d1ace7d39025 ]---
> 
> 
> What about this?

diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 74b3b150e1a5..9838a7d27bc1 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -263,7 +263,8 @@ static const struct blk_mq_ops 
nvme_loop_admin_mq_ops = {

  static void nvme_loop_destroy_admin_queue(struct nvme_loop_ctrl *ctrl)
  {
-       clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags);
+       if (!test_and_clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags))
+               return;
         nvmet_sq_destroy(&ctrl->queues[0].nvme_sq);
         blk_cleanup_queue(ctrl->ctrl.admin_q);
         blk_cleanup_queue(ctrl->ctrl.fabrics_q);
@@ -299,6 +300,7 @@ static void nvme_loop_destroy_io_queues(struct 
nvme_loop_ctrl *ctrl)
                 clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[i].flags);
                 nvmet_sq_destroy(&ctrl->queues[i].nvme_sq);
         }
+       ctrl->ctrl.queue_count = 1;
  }

  static int nvme_loop_init_io_queues(struct nvme_loop_ctrl *ctrl)
@@ -405,6 +407,7 @@ static int nvme_loop_configure_admin_queue(struct 
nvme_loop_ctrl *ctrl)
         return 0;

  out_cleanup_queue:
+       clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags);
         blk_cleanup_queue(ctrl->ctrl.admin_q);
  out_cleanup_fabrics_q:
         blk_cleanup_queue(ctrl->ctrl.fabrics_q);
@@ -462,8 +465,10 @@ static void nvme_loop_reset_ctrl_work(struct 
work_struct *work)
         nvme_loop_shutdown_ctrl(ctrl);

         if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
-               /* state change failure should never happen */
-               WARN_ON_ONCE(1);
+               if (ctrl->ctrl.state != NVME_CTRL_DELETING &&
+                   ctrl->ctrl.state != NVME_CTRL_DELETING_NOIO)
+                       /* state change failure for non-deleted ctrl? */
+                       WARN_ON_ONCE(1);
                 return;
         }


(I'll be sending out a formal patchset once it's confirmed).

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
  2021-05-20  6:19               ` Hannes Reinecke
@ 2021-05-21  0:38                 ` Yi Zhang
  2021-05-21 18:19                   ` Sagi Grimberg
  0 siblings, 1 reply; 13+ messages in thread
From: Yi Zhang @ 2021-05-21  0:38 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, linux-nvme, skt-results-master, Bruno Goncalves

> >
> > What about this?

Hi Hannes
With this patch, no WARNNING/NULL pointer this time, but still have
'keep-alive timer expired' and reset failure issue, here is the full
log:

# ./check nvme/022
nvme/022 (test NVMe reset command on NVMeOF file-backed ns)  [failed]
    runtime  10.646s  ...  11.087s
    --- tests/nvme/022.out 2021-05-20 20:16:31.384068807 -0400
    +++ /root/blktests/results/nodev/nvme/022.out.bad 2021-05-20
20:24:27.874250466 -0400
    @@ -1,4 +1,5 @@
     Running nvme/022
     91fdba0d-f87b-4c25-b80f-db7be1418b9e
     uuid.91fdba0d-f87b-4c25-b80f-db7be1418b9e
    +ERROR: reset failed
     Test complete
# cat results/nodev/nvme/022.full
Reset: Network dropped connection on reset
NQN:blktests-subsystem-1 disconnected 1 controller(s)

[37353.068448] run blktests nvme/022 at 2021-05-20 20:24:16
[37353.146301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[37353.161765] nvmet: creating controller 1 for subsystem
blktests-subsystem-1 for NQN
nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
[37353.175796] nvme nvme0: creating 128 I/O queues.
[37353.189734] nvme nvme0: new ctrl: "blktests-subsystem-1"
[37354.216686] nvme nvme0: resetting controller
[37363.270607] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
[37363.276521] nvmet: ctrl 1 fatal error occurred!
[37363.281058] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"

# ./check nvme/021
nvme/021 (test NVMe list command on NVMeOF file-backed ns)   [passed]
    runtime  10.958s  ...  11.382s
# dmesg
[38142.862881] run blktests nvme/021 at 2021-05-20 20:37:26
[38142.941038] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[38142.956621] nvmet: creating controller 1 for subsystem
blktests-subsystem-1 for NQN
nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
[38142.970524] nvme nvme0: creating 128 I/O queues.
[38142.984356] nvme nvme0: new ctrl: "blktests-subsystem-1"
[38144.014601] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
[38153.030107] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
[38153.036018] nvmet: ctrl 1 fatal error occurred!


>
> diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
> index 74b3b150e1a5..9838a7d27bc1 100644
> --- a/drivers/nvme/target/loop.c
> +++ b/drivers/nvme/target/loop.c
> @@ -263,7 +263,8 @@ static const struct blk_mq_ops
> nvme_loop_admin_mq_ops = {
>
>   static void nvme_loop_destroy_admin_queue(struct nvme_loop_ctrl *ctrl)
>   {
> -       clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags);
> +       if (!test_and_clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags))
> +               return;
>          nvmet_sq_destroy(&ctrl->queues[0].nvme_sq);
>          blk_cleanup_queue(ctrl->ctrl.admin_q);
>          blk_cleanup_queue(ctrl->ctrl.fabrics_q);
> @@ -299,6 +300,7 @@ static void nvme_loop_destroy_io_queues(struct
> nvme_loop_ctrl *ctrl)
>                  clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[i].flags);
>                  nvmet_sq_destroy(&ctrl->queues[i].nvme_sq);
>          }
> +       ctrl->ctrl.queue_count = 1;
>   }
>
>   static int nvme_loop_init_io_queues(struct nvme_loop_ctrl *ctrl)
> @@ -405,6 +407,7 @@ static int nvme_loop_configure_admin_queue(struct
> nvme_loop_ctrl *ctrl)
>          return 0;
>
>   out_cleanup_queue:
> +       clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags);
>          blk_cleanup_queue(ctrl->ctrl.admin_q);
>   out_cleanup_fabrics_q:
>          blk_cleanup_queue(ctrl->ctrl.fabrics_q);
> @@ -462,8 +465,10 @@ static void nvme_loop_reset_ctrl_work(struct
> work_struct *work)
>          nvme_loop_shutdown_ctrl(ctrl);
>
>          if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
> -               /* state change failure should never happen */
> -               WARN_ON_ONCE(1);
> +               if (ctrl->ctrl.state != NVME_CTRL_DELETING &&
> +                   ctrl->ctrl.state != NVME_CTRL_DELETING_NOIO)
> +                       /* state change failure for non-deleted ctrl? */
> +                       WARN_ON_ONCE(1);
>                  return;
>          }
>
>
> (I'll be sending out a formal patchset once it's confirmed).
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
>


-- 
Best Regards,
  Yi Zhang


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
  2021-05-21  0:38                 ` Yi Zhang
@ 2021-05-21 18:19                   ` Sagi Grimberg
  2021-05-22  0:12                     ` Yi Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Sagi Grimberg @ 2021-05-21 18:19 UTC (permalink / raw)
  To: Yi Zhang, Hannes Reinecke; +Cc: linux-nvme, skt-results-master, Bruno Goncalves


>>> What about this?
> 
> Hi Hannes
> With this patch, no WARNNING/NULL pointer this time, but still have
> 'keep-alive timer expired' and reset failure issue, here is the full
> log:
> 
> # ./check nvme/022
> nvme/022 (test NVMe reset command on NVMeOF file-backed ns)  [failed]
>      runtime  10.646s  ...  11.087s
>      --- tests/nvme/022.out 2021-05-20 20:16:31.384068807 -0400
>      +++ /root/blktests/results/nodev/nvme/022.out.bad 2021-05-20
> 20:24:27.874250466 -0400
>      @@ -1,4 +1,5 @@
>       Running nvme/022
>       91fdba0d-f87b-4c25-b80f-db7be1418b9e
>       uuid.91fdba0d-f87b-4c25-b80f-db7be1418b9e
>      +ERROR: reset failed
>       Test complete
> # cat results/nodev/nvme/022.full
> Reset: Network dropped connection on reset
> NQN:blktests-subsystem-1 disconnected 1 controller(s)
> 
> [37353.068448] run blktests nvme/022 at 2021-05-20 20:24:16
> [37353.146301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [37353.161765] nvmet: creating controller 1 for subsystem
> blktests-subsystem-1 for NQN
> nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
> [37353.175796] nvme nvme0: creating 128 I/O queues.
> [37353.189734] nvme nvme0: new ctrl: "blktests-subsystem-1"
> [37354.216686] nvme nvme0: resetting controller
> [37363.270607] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> [37363.276521] nvmet: ctrl 1 fatal error occurred!
> [37363.281058] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> 
> # ./check nvme/021
> nvme/021 (test NVMe list command on NVMeOF file-backed ns)   [passed]
>      runtime  10.958s  ...  11.382s
> # dmesg
> [38142.862881] run blktests nvme/021 at 2021-05-20 20:37:26
> [38142.941038] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [38142.956621] nvmet: creating controller 1 for subsystem
> blktests-subsystem-1 for NQN
> nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
> [38142.970524] nvme nvme0: creating 128 I/O queues.
> [38142.984356] nvme nvme0: new ctrl: "blktests-subsystem-1"
> [38144.014601] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> [38153.030107] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> [38153.036018] nvmet: ctrl 1 fatal error occurred!

I think that the main reason is that there are 128 queues that are being
created, and during that time the keep alive timer ends up expiring as
it is shorter (used to be 15 seconds, now 5 by default).

nvmet only stops the keep-alive timer when the controller is freed,
which is pretty late in the sequence.. The problem is that it needs to
be this way because if we shut it down sooner a host can die in the
middle of a teardown sequence and we still need to detect that and
cleanup ourselves. But maybe we can mod the keep-alive timer for
every queue we delete, just in the case the host is not deleting
fast enough?

Ming, does this solve the issue you are seeing?
--
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 1853db38b682..f0715e9a4a9c 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -804,6 +804,7 @@ void nvmet_sq_destroy(struct nvmet_sq *sq)
         percpu_ref_exit(&sq->ref);

         if (ctrl) {
+               ctrl->cmd_seen = true;
                 nvmet_ctrl_put(ctrl);
                 sq->ctrl = NULL; /* allows reusing the queue later */
         }
--

We probably need to rename cmd_seen to extend_tbkas (extend traffic
based keep-alive).

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
  2021-05-21 18:19                   ` Sagi Grimberg
@ 2021-05-22  0:12                     ` Yi Zhang
  2021-05-22 14:59                       ` Hannes Reinecke
  0 siblings, 1 reply; 13+ messages in thread
From: Yi Zhang @ 2021-05-22  0:12 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, linux-nvme, skt-results-master, Bruno Goncalves

On Sat, May 22, 2021 at 2:19 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >>> What about this?
> >
> > Hi Hannes
> > With this patch, no WARNNING/NULL pointer this time, but still have
> > 'keep-alive timer expired' and reset failure issue, here is the full
> > log:
> >
> > # ./check nvme/022
> > nvme/022 (test NVMe reset command on NVMeOF file-backed ns)  [failed]
> >      runtime  10.646s  ...  11.087s
> >      --- tests/nvme/022.out 2021-05-20 20:16:31.384068807 -0400
> >      +++ /root/blktests/results/nodev/nvme/022.out.bad 2021-05-20
> > 20:24:27.874250466 -0400
> >      @@ -1,4 +1,5 @@
> >       Running nvme/022
> >       91fdba0d-f87b-4c25-b80f-db7be1418b9e
> >       uuid.91fdba0d-f87b-4c25-b80f-db7be1418b9e
> >      +ERROR: reset failed
> >       Test complete
> > # cat results/nodev/nvme/022.full
> > Reset: Network dropped connection on reset
> > NQN:blktests-subsystem-1 disconnected 1 controller(s)
> >
> > [37353.068448] run blktests nvme/022 at 2021-05-20 20:24:16
> > [37353.146301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> > [37353.161765] nvmet: creating controller 1 for subsystem
> > blktests-subsystem-1 for NQN
> > nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
> > [37353.175796] nvme nvme0: creating 128 I/O queues.
> > [37353.189734] nvme nvme0: new ctrl: "blktests-subsystem-1"
> > [37354.216686] nvme nvme0: resetting controller
> > [37363.270607] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> > [37363.276521] nvmet: ctrl 1 fatal error occurred!
> > [37363.281058] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> >
> > # ./check nvme/021
> > nvme/021 (test NVMe list command on NVMeOF file-backed ns)   [passed]
> >      runtime  10.958s  ...  11.382s
> > # dmesg
> > [38142.862881] run blktests nvme/021 at 2021-05-20 20:37:26
> > [38142.941038] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> > [38142.956621] nvmet: creating controller 1 for subsystem
> > blktests-subsystem-1 for NQN
> > nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
> > [38142.970524] nvme nvme0: creating 128 I/O queues.
> > [38142.984356] nvme nvme0: new ctrl: "blktests-subsystem-1"
> > [38144.014601] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> > [38153.030107] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> > [38153.036018] nvmet: ctrl 1 fatal error occurred!
>
> I think that the main reason is that there are 128 queues that are being
> created, and during that time the keep alive timer ends up expiring as
> it is shorter (used to be 15 seconds, now 5 by default).
>
> nvmet only stops the keep-alive timer when the controller is freed,
> which is pretty late in the sequence.. The problem is that it needs to
> be this way because if we shut it down sooner a host can die in the
> middle of a teardown sequence and we still need to detect that and
> cleanup ourselves. But maybe we can mod the keep-alive timer for
> every queue we delete, just in the case the host is not deleting
> fast enough?
>
> Ming, does this solve the issue you are seeing?

Hi Sagi
The issue was fixed by this patch. :)

> --
> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> index 1853db38b682..f0715e9a4a9c 100644
> --- a/drivers/nvme/target/core.c
> +++ b/drivers/nvme/target/core.c
> @@ -804,6 +804,7 @@ void nvmet_sq_destroy(struct nvmet_sq *sq)
>          percpu_ref_exit(&sq->ref);
>
>          if (ctrl) {
> +               ctrl->cmd_seen = true;
>                  nvmet_ctrl_put(ctrl);
>                  sq->ctrl = NULL; /* allows reusing the queue later */
>          }
> --
>
> We probably need to rename cmd_seen to extend_tbkas (extend traffic
> based keep-alive).
>


-- 
Best Regards,
  Yi Zhang


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer
  2021-05-22  0:12                     ` Yi Zhang
@ 2021-05-22 14:59                       ` Hannes Reinecke
  0 siblings, 0 replies; 13+ messages in thread
From: Hannes Reinecke @ 2021-05-22 14:59 UTC (permalink / raw)
  To: Yi Zhang, Sagi Grimberg; +Cc: linux-nvme, skt-results-master, Bruno Goncalves

On 5/22/21 2:12 AM, Yi Zhang wrote:
> On Sat, May 22, 2021 at 2:19 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>>
>>
>>>>> What about this?
>>>
>>> Hi Hannes
>>> With this patch, no WARNNING/NULL pointer this time, but still have
>>> 'keep-alive timer expired' and reset failure issue, here is the full
>>> log:
>>>
>>> # ./check nvme/022
>>> nvme/022 (test NVMe reset command on NVMeOF file-backed ns)  [failed]
>>>       runtime  10.646s  ...  11.087s
>>>       --- tests/nvme/022.out 2021-05-20 20:16:31.384068807 -0400
>>>       +++ /root/blktests/results/nodev/nvme/022.out.bad 2021-05-20
>>> 20:24:27.874250466 -0400
>>>       @@ -1,4 +1,5 @@
>>>        Running nvme/022
>>>        91fdba0d-f87b-4c25-b80f-db7be1418b9e
>>>        uuid.91fdba0d-f87b-4c25-b80f-db7be1418b9e
>>>       +ERROR: reset failed
>>>        Test complete
>>> # cat results/nodev/nvme/022.full
>>> Reset: Network dropped connection on reset
>>> NQN:blktests-subsystem-1 disconnected 1 controller(s)
>>>
>>> [37353.068448] run blktests nvme/022 at 2021-05-20 20:24:16
>>> [37353.146301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [37353.161765] nvmet: creating controller 1 for subsystem
>>> blktests-subsystem-1 for NQN
>>> nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
>>> [37353.175796] nvme nvme0: creating 128 I/O queues.
>>> [37353.189734] nvme nvme0: new ctrl: "blktests-subsystem-1"
>>> [37354.216686] nvme nvme0: resetting controller
>>> [37363.270607] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
>>> [37363.276521] nvmet: ctrl 1 fatal error occurred!
>>> [37363.281058] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
>>>
>>> # ./check nvme/021
>>> nvme/021 (test NVMe list command on NVMeOF file-backed ns)   [passed]
>>>       runtime  10.958s  ...  11.382s
>>> # dmesg
>>> [38142.862881] run blktests nvme/021 at 2021-05-20 20:37:26
>>> [38142.941038] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [38142.956621] nvmet: creating controller 1 for subsystem
>>> blktests-subsystem-1 for NQN
>>> nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
>>> [38142.970524] nvme nvme0: creating 128 I/O queues.
>>> [38142.984356] nvme nvme0: new ctrl: "blktests-subsystem-1"
>>> [38144.014601] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
>>> [38153.030107] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
>>> [38153.036018] nvmet: ctrl 1 fatal error occurred!
>>
>> I think that the main reason is that there are 128 queues that are being
>> created, and during that time the keep alive timer ends up expiring as
>> it is shorter (used to be 15 seconds, now 5 by default).
>>
>> nvmet only stops the keep-alive timer when the controller is freed,
>> which is pretty late in the sequence.. The problem is that it needs to
>> be this way because if we shut it down sooner a host can die in the
>> middle of a teardown sequence and we still need to detect that and
>> cleanup ourselves. But maybe we can mod the keep-alive timer for
>> every queue we delete, just in the case the host is not deleting
>> fast enough?
>>
>> Ming, does this solve the issue you are seeing?
> 
> Hi Sagi
> The issue was fixed by this patch. :)
> 
>> --
>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
>> index 1853db38b682..f0715e9a4a9c 100644
>> --- a/drivers/nvme/target/core.c
>> +++ b/drivers/nvme/target/core.c
>> @@ -804,6 +804,7 @@ void nvmet_sq_destroy(struct nvmet_sq *sq)
>>           percpu_ref_exit(&sq->ref);
>>
>>           if (ctrl) {
>> +               ctrl->cmd_seen = true;
>>                   nvmet_ctrl_put(ctrl);
>>                   sq->ctrl = NULL; /* allows reusing the queue later */
>>           }
>> --
>>
>> We probably need to rename cmd_seen to extend_tbkas (extend traffic
>> based keep-alive).
>>
> 
> 
Thanks for the confirmation.

I'll send a proper patchset.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-05-22 14:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-30 16:40 [bug report] blktests nvme/022 lead kernel WARNING and NULL pointer Yi Zhang
2021-05-01  0:55 ` Sagi Grimberg
2021-05-01  9:58   ` Yi Zhang
2021-05-07  8:35     ` Yi Zhang
2021-05-07 19:50       ` Sagi Grimberg
2021-05-09  8:44         ` Hannes Reinecke
2021-05-12  0:32           ` Yi Zhang
2021-05-19  0:36             ` Yi Zhang
2021-05-20  6:19               ` Hannes Reinecke
2021-05-21  0:38                 ` Yi Zhang
2021-05-21 18:19                   ` Sagi Grimberg
2021-05-22  0:12                     ` Yi Zhang
2021-05-22 14:59                       ` Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.