All of lore.kernel.org
 help / color / mirror / Atom feed
* 4.0.0-rc4 NVMe NULL pointer dereference and hang
@ 2015-03-23  2:20 Steven Noonan
  0 siblings, 0 replies; 3+ messages in thread
From: Steven Noonan @ 2015-03-23  2:20 UTC (permalink / raw)
  To: Linux Kernel mailing List, Matthew Wilcox

This happens on boot, and then eventually results in an RCU stall.

[    8.047533] nvme 0000:05:00.0: Device not ready; aborting initialisation

Note that the above is expected with this hardware (long story).
Although 3.19.x prints the above and then continues gracefully, 4.0-rc
breaks immediately after the above message:

[    8.054306] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000008
[    8.062155] IP: [<ffffffffa025e614>] nvme_dev_list_remove+0x24/0xa0 [nvme]
[    8.069043] PGD 0
[    8.071067] Oops: 0002 [#1] SMP
[    8.074332] Modules linked in: ahci libahci libata ehci_pci
ehci_hcd scsi_mod usbcore usb_common nvme i915 intel_gtt i2c_algo_bit
video drm_kms_helper drm i2c_core e
1000e ptp pps_core ipmi_poweroff ipmi_msghandler button
[    8.094244] CPU: 4 PID: 632 Comm: kworker/u288:1 Not tainted
4.0.0-rc4-00347-gb87444a2 #5
[    8.109878] Workqueue: nvme nvme_reset_workfn [nvme]
[    8.114852] task: ffff881f98271d70 ti: ffff881f982b8000 task.ti:
ffff881f982b8000
[    8.122321] RIP: 0010:[<ffffffffa025e614>]  [<ffffffffa025e614>]
nvme_dev_list_remove+0x24/0xa0 [nvme]
[    8.131624] RSP: 0000:ffff881f982bbd18  EFLAGS: 00010246
[    8.136930] RAX: 0000000000000000 RBX: ffff883f63f84800 RCX: ffff88bf66e6a418
[    8.144052] RDX: 0000000000000000 RSI: 0000000000000120 RDI: ffffffffa0269848
[    8.151171] RBP: ffff881f982bbd28 R08: ffff881f982b8000 R09: 0000000000000001
[    8.158288] R10: 0000000000000086 R11: 0000000000000020 R12: ffff883f63f84800
[    8.165411] R13: ffff88bf66e6a400 R14: ffff88df627ff900 R15: 0000000000001000
[    8.172530] FS:  0000000000000000(0000) GS:ffff883f7f880000(0000)
knlGS:0000000000000000
[    8.180600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.186337] CR2: 0000000000000008 CR3: 000001007ea0c000 CR4: 00000000001406e0
[    8.193458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    8.200574] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    8.207693] Stack:
[    8.209705]  ffff881f982bbd28 ffff883f63f84978 ffff881f982bbdc8
ffffffffa026005e
[    8.217150]  ffff883f7f894300 00000000ffff8de0 ffff881f982bbd98
ffffffff810a65e1
[    8.224600]  ffff881f982bbdd8 ffffffff810a9943 ffff881f982bbd98
ffff881f982bbdd0
[    8.232049] Call Trace:
[    8.234500]  [<ffffffffa026005e>] nvme_dev_shutdown+0x1e/0x430 [nvme]
[    8.240943]  [<ffffffff810a65e1>] ? put_prev_entity+0x31/0x350
[    8.246772]  [<ffffffff810a9943>] ? pick_next_task_fair+0x103/0x4e0
[    8.253046]  [<ffffffff81012605>] ? __switch_to+0x175/0x5c0
[    8.258607]  [<ffffffffa0262a8e>] nvme_reset_failed_dev+0x1e/0x100 [nvme]
[    8.265378]  [<ffffffffa025e1cf>] nvme_reset_workfn+0xf/0x20 [nvme]
[    8.271649]  [<ffffffff810872fe>] process_one_work+0x14e/0x400
[    8.277472]  [<ffffffff8108822b>] worker_thread+0x5b/0x530
[    8.282943]  [<ffffffff810881d0>] ? rescuer_thread+0x3a0/0x3a0
[    8.288778]  [<ffffffff8108d359>] kthread+0xc9/0xe0
[    8.293649]  [<ffffffff8108d290>] ? kthread_stop+0x100/0x100
[    8.299322]  [<ffffffff81541158>] ret_from_fork+0x58/0x90
[    8.304711]  [<ffffffff8108d290>] ? kthread_stop+0x100/0x100
[    8.310357] Code: 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5
53 48 89 fb 48 c7 c7 48 98 26 a0 48 83 ec 08 e8 c3 23 2e e1 48 8b 13
48 8b 43 08 <48> 89 42 08 48 8
9 10 48 89 1b 48 81 3d 77 ae 00 00 a0 94 26 a0
[    8.330295] RIP  [<ffffffffa025e614>] nvme_dev_list_remove+0x24/0xa0 [nvme]
[    8.337258]  RSP <ffff881f982bbd18>
[    8.340739] CR2: 0000000000000008
[    8.344056] ---[ end trace 70831a936042aa41 ]---
[    8.348727] BUG: unable to handle kernel paging request at ffffffffffffffd8
[    8.355708] IP: [<ffffffff8108d801>] kthread_data+0x11/0x20
[    8.361286] PGD 1007ea0f067 PUD 1007ea11067 PMD 0
[    8.366120] Oops: 0000 [#2]
[    8.412683] SMP
[    8.414727] Modules linked in: ahci libahci libata ehci_pci
ehci_hcd scsi_mod usbcore usb_common nvme i915 intel_gtt i2c_algo_bit
video drm_kms_helper drm i2c_core e
1000e ptp pps_core ipmi_poweroff ipmi_msghandler button
[    8.434632] CPU: 4 PID: 632 Comm: kworker/u288:1 Tainted: G      D
       4.0.0-rc4-00347-gb87444a2 #5
[    8.451480] task: ffff881f98271d70 ti: ffff881f982b8000 task.ti:
ffff881f982b8000
[    8.458942] RIP: 0010:[<ffffffff8108d801>]  [<ffffffff8108d801>]
kthread_data+0x11/0x20
[    8.466946] RSP: 0000:ffff881f982bb9d0  EFLAGS: 00010096
[    8.472247] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 000000000000000f
[    8.479367] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff881f98271d70
[    8.486486] RBP: ffff881f982bb9e8 R08: 0000000000000000 R09: ffffffff8106d480
[    8.493605] R10: ffff883f7f897f20 R11: ffffea007e60a800 R12: 0000000000014300
[    8.500726] R13: ffff883f7f894300 R14: ffff881f98271d70 R15: 0000000000000004
[    8.507846] FS:  0000000000000000(0000) GS:ffff883f7f880000(0000)
knlGS:0000000000000000
[    8.515918] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.521651] CR2: 0000000000000028 CR3: 000001007ea0c000 CR4: 00000000001406e0
[    8.528772] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    8.535890] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    8.543011] Stack:
[    8.545020]  ffffffff81088785 ffff881f982bb9e8 ffff883f7f894300
ffff881f982bba38
[    8.552467]  ffffffff8153cfae 0000000000000000 ffff881f98271d70
0000000000000001
[    8.559917]  ffff881f982bbfd8 ffff881f982728d0 ffff881f982bb228
ffff887f66dc8000
[    8.567367] Call Trace:
[    8.569814]  [<ffffffff81088785>] ? wq_worker_sleeping+0x15/0xa0
[    8.575820]  [<ffffffff8153cfae>] __schedule+0x64e/0x9b0
[    8.581122]  [<ffffffff8153d347>] schedule+0x37/0x90
[    8.586094]  [<ffffffff81072a63>] do_exit+0x773/0xa90
[    8.591146]  [<ffffffff81017a08>] oops_end+0xa8/0x120
[    8.596209]  [<ffffffff8105c97d>] no_context+0x11d/0x390
[    8.601510]  [<ffffffff8105cc6d>] __bad_area_nosemaphore+0x7d/0x210
[    8.607763]  [<ffffffff8105ce13>] bad_area_nosemaphore+0x13/0x20
[    8.613758]  [<ffffffff8105d2aa>] __do_page_fault+0x25a/0x4a0
[    8.619491]  [<ffffffff8105d512>] do_page_fault+0x22/0x30
[    8.624888]  [<ffffffff815433c8>] page_fault+0x28/0x30
[    8.630016]  [<ffffffffa025e614>] ? nvme_dev_list_remove+0x24/0xa0 [nvme]
[    8.636789]  [<ffffffffa025e60d>] ? nvme_dev_list_remove+0x1d/0xa0 [nvme]
[    8.643563]  [<ffffffffa026005e>] nvme_dev_shutdown+0x1e/0x430 [nvme]
[    8.649988]  [<ffffffff810a65e1>] ? put_prev_entity+0x31/0x350
[    8.655810]  [<ffffffff810a9943>] ? pick_next_task_fair+0x103/0x4e0
[    8.662063]  [<ffffffff81012605>] ? __switch_to+0x175/0x5c0
[    8.667624]  [<ffffffffa0262a8e>] nvme_reset_failed_dev+0x1e/0x100 [nvme]
[    8.674395]  [<ffffffffa025e1cf>] nvme_reset_workfn+0xf/0x20 [nvme]
[    8.680649]  [<ffffffff810872fe>] process_one_work+0x14e/0x400
[    8.686470]  [<ffffffff8108822b>] worker_thread+0x5b/0x530
[    8.691944]  [<ffffffff810881d0>] ? rescuer_thread+0x3a0/0x3a0
[    8.697763]  [<ffffffff8108d359>] kthread+0xc9/0xe0
[    8.702633]  [<ffffffff8108d290>] ? kthread_stop+0x100/0x100
[    8.708279]  [<ffffffff81541158>] ret_from_fork+0x58/0x90
[    8.713666]  [<ffffffff8108d290>] ? kthread_stop+0x100/0x100
[    8.724701] Code: 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3
66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 e8 08 00 00 55
48 89 e5 5d <48> 8b 40 d8 c3 6
6 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
[    8.744638] RIP  [<ffffffff8108d801>] kthread_data+0x11/0x20
[    8.750303]  RSP <ffff881f982bb9d0>
[    8.753787] CR2: ffffffffffffffd8
[    8.757095] ---[ end trace 70831a936042aa42 ]---

Eventually an RCU stall message is printed and the host is unusable until reset.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 4.0.0-rc4 NVMe NULL pointer dereference and hang
       [not found] <E1Ya2zQ-00012y-Gh@feisty.vs19.net>
@ 2015-03-23 14:45   ` Keith Busch
  0 siblings, 0 replies; 3+ messages in thread
From: Keith Busch @ 2015-03-23 14:45 UTC (permalink / raw)
  To: steven; +Cc: linux-nvme, linux-kernel

On Sun, 22 Mar 2015, Steven Noonan wrote:
> This happens on boot, and then eventually results in an RCU stall.
>
> [    8.047533] nvme 0000:05:00.0: Device not ready; aborting initialisation
>
> Note that the above is expected with this hardware (long story).
> Although 3.19.x prints the above and then continues gracefully, 4.0-rc
> breaks immediately after the above message:

Thanks for the notice. I CC'ed the linux-nvme mailing list.

Since your device failed to become ready (this is expected, you say? ok,
I won't ask. :)), it triggered recovery action that assumed it's list
head was initialized once before. It's a one-line fix:

---
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 3b43897..ab7c847 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -3007,6 +3007,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
  	}
  	get_device(dev->device);

+	INIT_LIST_HEAD(&dev->node);
  	INIT_WORK(&dev->probe_work, nvme_async_probe);
  	schedule_work(&dev->probe_work);
  	return 0;
--

I'll coerce someone to merge this for rc6 today.

> [    8.054306] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [    8.062155] IP: [<ffffffffa025e614>] nvme_dev_list_remove+0x24/0xa0 [nvme]
> [    8.069043] PGD 0
> [    8.071067] Oops: 0002 [#1] SMP
> [    8.074332] Modules linked in: ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common nvme i915 intel_gtt i2c_algo_bit video drm_kms_helper drm i2c_core e1000e ptp pps_core ipmi_poweroff ipmi_msghandler button
> [    8.094244] CPU: 4 PID: 632 Comm: kworker/u288:1 Not tainted 4.0.0-rc4-00347-gb87444a2 #5
> [    8.109878] Workqueue: nvme nvme_reset_workfn [nvme]
> [    8.114852] task: ffff881f98271d70 ti: ffff881f982b8000 task.ti: ffff881f982b8000
> [    8.122321] RIP: 0010:[<ffffffffa025e614>]  [<ffffffffa025e614>] nvme_dev_list_remove+0x24/0xa0 [nvme]
> [    8.131624] RSP: 0000:ffff881f982bbd18  EFLAGS: 00010246
> [    8.136930] RAX: 0000000000000000 RBX: ffff883f63f84800 RCX: ffff88bf66e6a418
> [    8.144052] RDX: 0000000000000000 RSI: 0000000000000120 RDI: ffffffffa0269848
> [    8.151171] RBP: ffff881f982bbd28 R08: ffff881f982b8000 R09: 0000000000000001
> [    8.158288] R10: 0000000000000086 R11: 0000000000000020 R12: ffff883f63f84800
> [    8.165411] R13: ffff88bf66e6a400 R14: ffff88df627ff900 R15: 0000000000001000
> [    8.172530] FS:  0000000000000000(0000) GS:ffff883f7f880000(0000) knlGS:0000000000000000
> [    8.180600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    8.186337] CR2: 0000000000000008 CR3: 000001007ea0c000 CR4: 00000000001406e0
> [    8.193458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    8.200574] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    8.207693] Stack:
> [    8.209705]  ffff881f982bbd28 ffff883f63f84978 ffff881f982bbdc8 ffffffffa026005e
> [    8.217150]  ffff883f7f894300 00000000ffff8de0 ffff881f982bbd98 ffffffff810a65e1
> [    8.224600]  ffff881f982bbdd8 ffffffff810a9943 ffff881f982bbd98 ffff881f982bbdd0
> [    8.232049] Call Trace:
> [    8.234500]  [<ffffffffa026005e>] nvme_dev_shutdown+0x1e/0x430 [nvme]
> [    8.240943]  [<ffffffff810a65e1>] ? put_prev_entity+0x31/0x350
> [    8.246772]  [<ffffffff810a9943>] ? pick_next_task_fair+0x103/0x4e0
> [    8.253046]  [<ffffffff81012605>] ? __switch_to+0x175/0x5c0
> [    8.258607]  [<ffffffffa0262a8e>] nvme_reset_failed_dev+0x1e/0x100 [nvme]
> [    8.265378]  [<ffffffffa025e1cf>] nvme_reset_workfn+0xf/0x20 [nvme]
> [    8.271649]  [<ffffffff810872fe>] process_one_work+0x14e/0x400
> [    8.277472]  [<ffffffff8108822b>] worker_thread+0x5b/0x530
> [    8.282943]  [<ffffffff810881d0>] ? rescuer_thread+0x3a0/0x3a0
> [    8.288778]  [<ffffffff8108d359>] kthread+0xc9/0xe0
> [    8.293649]  [<ffffffff8108d290>] ? kthread_stop+0x100/0x100
> [    8.299322]  [<ffffffff81541158>] ret_from_fork+0x58/0x90
> [    8.304711]  [<ffffffff8108d290>] ? kthread_stop+0x100/0x100
> [    8.310357] Code: 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 c7 c7 48 98 26 a0 48 83 ec 08 e8 c3 23 2e e1 48 8b 13 48 8b 43 08 <48> 89 42 08 48 89 10 48 89 1b 48 81 3d 77 ae 00 00 a0 94 26 a0
> [    8.330295] RIP  [<ffffffffa025e614>] nvme_dev_list_remove+0x24/0xa0 [nvme]
> [    8.337258]  RSP <ffff881f982bbd18>
> [    8.340739] CR2: 0000000000000008
> [    8.344056] ---[ end trace 70831a936042aa41 ]---

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* 4.0.0-rc4 NVMe NULL pointer dereference and hang
@ 2015-03-23 14:45   ` Keith Busch
  0 siblings, 0 replies; 3+ messages in thread
From: Keith Busch @ 2015-03-23 14:45 UTC (permalink / raw)


On Sun, 22 Mar 2015, Steven Noonan wrote:
> This happens on boot, and then eventually results in an RCU stall.
>
> [    8.047533] nvme 0000:05:00.0: Device not ready; aborting initialisation
>
> Note that the above is expected with this hardware (long story).
> Although 3.19.x prints the above and then continues gracefully, 4.0-rc
> breaks immediately after the above message:

Thanks for the notice. I CC'ed the linux-nvme mailing list.

Since your device failed to become ready (this is expected, you say? ok,
I won't ask. :)), it triggered recovery action that assumed it's list
head was initialized once before. It's a one-line fix:

---
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 3b43897..ab7c847 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -3007,6 +3007,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
  	}
  	get_device(dev->device);

+	INIT_LIST_HEAD(&dev->node);
  	INIT_WORK(&dev->probe_work, nvme_async_probe);
  	schedule_work(&dev->probe_work);
  	return 0;
--

I'll coerce someone to merge this for rc6 today.

> [    8.054306] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [    8.062155] IP: [<ffffffffa025e614>] nvme_dev_list_remove+0x24/0xa0 [nvme]
> [    8.069043] PGD 0
> [    8.071067] Oops: 0002 [#1] SMP
> [    8.074332] Modules linked in: ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common nvme i915 intel_gtt i2c_algo_bit video drm_kms_helper drm i2c_core e1000e ptp pps_core ipmi_poweroff ipmi_msghandler button
> [    8.094244] CPU: 4 PID: 632 Comm: kworker/u288:1 Not tainted 4.0.0-rc4-00347-gb87444a2 #5
> [    8.109878] Workqueue: nvme nvme_reset_workfn [nvme]
> [    8.114852] task: ffff881f98271d70 ti: ffff881f982b8000 task.ti: ffff881f982b8000
> [    8.122321] RIP: 0010:[<ffffffffa025e614>]  [<ffffffffa025e614>] nvme_dev_list_remove+0x24/0xa0 [nvme]
> [    8.131624] RSP: 0000:ffff881f982bbd18  EFLAGS: 00010246
> [    8.136930] RAX: 0000000000000000 RBX: ffff883f63f84800 RCX: ffff88bf66e6a418
> [    8.144052] RDX: 0000000000000000 RSI: 0000000000000120 RDI: ffffffffa0269848
> [    8.151171] RBP: ffff881f982bbd28 R08: ffff881f982b8000 R09: 0000000000000001
> [    8.158288] R10: 0000000000000086 R11: 0000000000000020 R12: ffff883f63f84800
> [    8.165411] R13: ffff88bf66e6a400 R14: ffff88df627ff900 R15: 0000000000001000
> [    8.172530] FS:  0000000000000000(0000) GS:ffff883f7f880000(0000) knlGS:0000000000000000
> [    8.180600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    8.186337] CR2: 0000000000000008 CR3: 000001007ea0c000 CR4: 00000000001406e0
> [    8.193458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    8.200574] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    8.207693] Stack:
> [    8.209705]  ffff881f982bbd28 ffff883f63f84978 ffff881f982bbdc8 ffffffffa026005e
> [    8.217150]  ffff883f7f894300 00000000ffff8de0 ffff881f982bbd98 ffffffff810a65e1
> [    8.224600]  ffff881f982bbdd8 ffffffff810a9943 ffff881f982bbd98 ffff881f982bbdd0
> [    8.232049] Call Trace:
> [    8.234500]  [<ffffffffa026005e>] nvme_dev_shutdown+0x1e/0x430 [nvme]
> [    8.240943]  [<ffffffff810a65e1>] ? put_prev_entity+0x31/0x350
> [    8.246772]  [<ffffffff810a9943>] ? pick_next_task_fair+0x103/0x4e0
> [    8.253046]  [<ffffffff81012605>] ? __switch_to+0x175/0x5c0
> [    8.258607]  [<ffffffffa0262a8e>] nvme_reset_failed_dev+0x1e/0x100 [nvme]
> [    8.265378]  [<ffffffffa025e1cf>] nvme_reset_workfn+0xf/0x20 [nvme]
> [    8.271649]  [<ffffffff810872fe>] process_one_work+0x14e/0x400
> [    8.277472]  [<ffffffff8108822b>] worker_thread+0x5b/0x530
> [    8.282943]  [<ffffffff810881d0>] ? rescuer_thread+0x3a0/0x3a0
> [    8.288778]  [<ffffffff8108d359>] kthread+0xc9/0xe0
> [    8.293649]  [<ffffffff8108d290>] ? kthread_stop+0x100/0x100
> [    8.299322]  [<ffffffff81541158>] ret_from_fork+0x58/0x90
> [    8.304711]  [<ffffffff8108d290>] ? kthread_stop+0x100/0x100
> [    8.310357] Code: 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 c7 c7 48 98 26 a0 48 83 ec 08 e8 c3 23 2e e1 48 8b 13 48 8b 43 08 <48> 89 42 08 48 89 10 48 89 1b 48 81 3d 77 ae 00 00 a0 94 26 a0
> [    8.330295] RIP  [<ffffffffa025e614>] nvme_dev_list_remove+0x24/0xa0 [nvme]
> [    8.337258]  RSP <ffff881f982bbd18>
> [    8.340739] CR2: 0000000000000008
> [    8.344056] ---[ end trace 70831a936042aa41 ]---

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-03-23 14:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-23  2:20 4.0.0-rc4 NVMe NULL pointer dereference and hang Steven Noonan
     [not found] <E1Ya2zQ-00012y-Gh@feisty.vs19.net>
2015-03-23 14:45 ` Keith Busch
2015-03-23 14:45   ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.